Text and Fonts - PDF Explained

Graphics Programs Reference

In-Depth Information

/Descent -205

/ItalicAngle 0

/StemV 90

/MissingWidth 602

/FontFile2 12 0 R The actual font file, here in TrueType format .

>>

endobj

The details of the actual font formats (Type1, TrueType etc.) are not discussed here—

in fact, they are not discussed in the PDF Standard either, but by external documents

from the providers of those font formats.

Extracting Text from a Document

It is customary to include enough information in a file's font dictionaries to allow the

actual character identities (rather than just the glyphs) to be retrieved. This is important

to allow users to search and copy text from PDF viewing applications like Adobe

Reader. In can also be used, in a more limited capacity, to allow edits to be made to

the textual content of a document.

There are two mechanisms for this: the /Encoding entry in the font (which maps char-

acter codes to Adobe Glyph List entries like /bullet ), and a more modern mechanism,

the /ToUnicode entry which provides a program in a language defined by Adobe which

maps character codes directly to Unicode entities. Here is an example of a /ToUnicode

program:

23 0 obj

<< /Length 317 >>

stream

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<

/Registry (Symbol+0) /Ordering (T1UV) /Supplement 0 >> def

/CMapName /Symbol+0 def

1 begincodespacerange <01> <01> endcodespacerange

1 beginbfrange

<01> <01> <2022> Maps character code 1 to Unicode U+2022, the bullet point

endbfrange

endcmap CMapName currentdict /CMap defineresource pop end end

endstream

endobj

Another hardship in the extraction of text is reconstructing the text operators within

the content stream. Operators may split up the text for kerning or justification, and

hyphenation at the end of lines can interrupt the stream of characters. Indeed, it is even

possible that the text operators may be out of order. Usually, though, a good recon-

struction of text may be produced from most modern files.

Search WWH ::

Custom Search

Home