Graphics Programs Reference
In-Depth Information
/Descent -205
/ItalicAngle 0
/StemV 90
/MissingWidth 602
/FontFile2 12 0 R The actual font file, here in TrueType format .
>>
endobj
The details of the actual font formats (Type1, TrueType etc.) are not discussed here—
in fact, they are not discussed in the PDF Standard either, but by external documents
from the providers of those font formats.
Extracting Text from a Document
It is customary to include enough information in a file's font dictionaries to allow the
actual character identities (rather than just the glyphs) to be retrieved. This is important
to allow users to search and copy text from PDF viewing applications like Adobe
Reader. In can also be used, in a more limited capacity, to allow edits to be made to
the textual content of a document.
There are two mechanisms for this: the /Encoding entry in the font (which maps char-
acter codes to Adobe Glyph List entries like /bullet ), and a more modern mechanism,
the /ToUnicode entry which provides a program in a language defined by Adobe which
maps character codes directly to Unicode entities. Here is an example of a /ToUnicode
program:
23 0 obj
<< /Length 317 >>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Symbol+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /Symbol+0 def
1 begincodespacerange <01> <01> endcodespacerange
1 beginbfrange
<01> <01> <2022> Maps character code 1 to Unicode U+2022, the bullet point
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
endstream
endobj
Another hardship in the extraction of text is reconstructing the text operators within
the content stream. Operators may split up the text for kerning or justification, and
hyphenation at the end of lines can interrupt the stream of characters. Indeed, it is even
possible that the text operators may be out of order. Usually, though, a good recon-
struction of text may be produced from most modern files.
Search WWH ::




Custom Search