Graphics Programs Reference
In-Depth Information
How a PDF File is Read
To read a PDF file, converting it from a flat series of bytes into a graph of objects in
memory, the following steps might typically occur:
1. Read the PDF header from the beginning of the file, checking that this is, indeed,
a PDF document and retrieving its version number.
2. The end-of-file marker is now found, by searching backward from the end of the
file. The trailer dictionary can now be read, and the byte offset of the start of the
cross-reference table retrieved.
3. The cross-reference table can now be read. We now know where each object in the
file is.
4. At this stage, all the objects can be read and parsed, or we can leave this process
until each object is actually needed, reading it on demand.
5. We can now use the data, extracting the pages, parsing graphical content, extract-
ing metadata, and so on.
This is not an exhaustive description, since there are many possible complications (en-
cryption, linearization, objects, and cross-reference streams).
The following recursive data structure, given in psuedocode, can hold a PDF object.
pdfobject ::= Null
| Boolean of bool
| Integer of int
| Real of real
| String of string
| Name of string
| Array of pdfobject array
| Dictionary of (string, pdfobject) array Array of (string, pdfobject) pairs
| Stream of (pdfobject, bytes) Stream dictionary and stream data
| Indirect of int
For example, the object << /Kids [2 0 R] /Count 1 /Type /Pages >> might be repre-
sented as:
Dictionary
((Name (/Kids), Array (Indirect 2)),
(Name (/Count), Integer (1)),
(Name (/Type), Name (/Pages)))
Figure 3-1 , shown earlier in the chapter, shows the object graph for the file in Exam-
ple 3-1 .
Search WWH ::




Custom Search