File Structure - PDF Explained

Graphics Programs Reference

In-Depth Information

How a PDF File is Read

To read a PDF file, converting it from a flat series of bytes into a graph of objects in

memory, the following steps might typically occur:

1. Read the PDF header from the beginning of the file, checking that this is, indeed,

a PDF document and retrieving its version number.

2. The end-of-file marker is now found, by searching backward from the end of the

file. The trailer dictionary can now be read, and the byte offset of the start of the

cross-reference table retrieved.

3. The cross-reference table can now be read. We now know where each object in the

file is.

4. At this stage, all the objects can be read and parsed, or we can leave this process

until each object is actually needed, reading it on demand.

5. We can now use the data, extracting the pages, parsing graphical content, extract-

ing metadata, and so on.

This is not an exhaustive description, since there are many possible complications (en-

cryption, linearization, objects, and cross-reference streams).

The following recursive data structure, given in psuedocode, can hold a PDF object.

pdfobject ::= Null

| Boolean of bool

| Integer of int

| Real of real

| String of string

| Name of string

| Array of pdfobject array

| Dictionary of (string, pdfobject) array Array of (string, pdfobject) pairs

| Stream of (pdfobject, bytes) Stream dictionary and stream data

| Indirect of int

For example, the object << /Kids [2 0 R] /Count 1 /Type /Pages >> might be repre-

sented as:

Dictionary

((Name (/Kids), Array (Indirect 2)),

(Name (/Count), Integer (1)),

(Name (/Type), Name (/Pages)))

Figure 3-1 , shown earlier in the chapter, shows the object graph for the file in Exam-

ple 3-1 .

Search WWH ::

Custom Search

Home