Page Description Languages: postScript and PDF (Digital Library) Part 2

Compatibility with Unicode

Character-identifier keyed, or CID-keyed, fonts provide a newer format designed for use with Unicode. They map multiple byte values to character codes in much the same way that the encoding vector works in base fonts—except that the mapping is not restricted to 256 entries. The CID-keyed font specification is independent of PostScript and can be used in other environments. The data is also external to the document file: font and encoding-vector resources are accessed by reading external files into dictionaries.

Open Type is a new font description that goes beyond the provisions of CID-keyed fonts. It encapsulates Type 1 and TrueType fonts into the same kind of wrapper, yielding a portable, scalable font platform that is backward-compatible. The basic approach of CID-keyed fonts is used to map numeric identifiers to character codes. OpenType includes multilingual character sets with full Unicode support, and extended character sets that support small caps, ligatures, and fractions—all within the same font. It includes a way of automatically substituting a single glyph for a given sequence (e.g., the ligature fi can be substituted for the sequence f followed by i) and vice versa. Substitution can be context sensitive. For example, a swash letter, which is an ornamental letter— often a decorated italic capital—used to open paragraphs, can be introduced automatically at the beginning of words or lines.

Text extraction

It is useful to be able to extract plain text from PostScript files. To build a full-text index for a digital library, the raw text needs to be available. An approximation to the formatting information may be useful too—perhaps to display HTML versions of documents in a Web browser. For this, structural features like paragraph boundaries and font characteristics must be identified from PostScript.

PostScript allows complete flexibility in how documents are described—for example, the characters do not have to be in any particular order. In practice, actual PostScript documents tend to be more constrained. However, the text they contain is often fragmented and inextricably muddled up with other character strings that do not appear in the output. Figure 4.19 shows an example, along with the text extracted from it. Characters to be placed on the page appear in the PostScript file as parenthesized strings.

Figure 4.19: A postScript document and the text extracted from it

But font names, file names, and other internal information are represented in the same way—examples can be seen in the first few lines of the figure. Also the division of text into words is not immediately apparent. Spaces are implied by the character positions rather than being present explicitly. Text is written out in fragments, and each parenthetical string sometimes represents only part of a word. Deciding which fragments to concatenate is difficult. Although heuristics might be devised to cover common cases, they are unlikely to lead to a robust solution that can deal satisfactorily with the variety of files found in practice.

This is why text extraction based on scanning a PostScript document for strings of text meets with limited success. It also fails to extract any formatting information. Above all, it does not address the fundamental issue that PostScript is a programming language whose output, in principle, cannot be determined merely by scanning the file—for example, in a PostScript document the raw text could be (and often is) compressed, to be decompressed by the interpreter every time the document is displayed. As it happens, this deep-rooted issue leads to a solution that is far more robust than scanning for text, can account for formatting information, and decodes any programmed information.

If a PostScript code fragment is prepended to a document and the document is then run through a standard PostScript interpreter, the placement of text characters can be intercepted, producing text in a file, rather than pixels on a page. The central trick is to redefine PostScript’s show operator, which is responsible for placing text on the page. Regardless of how a program is constructed, all printed text passes through this operator (or a variant, as mentioned later). The new code fragment redefines it to write its argument, a text string, to a file instead of rendering it on the screen. Then, when the document is executed, a text file is produced instead of the usual physical pages.

A simple text extraction program

The idea can be illustrated by a simple program. Prepending the incantation /show { print} def, shown in Figure 4.20a, to the document of Figure 4.19 redefines the show operator. The effect is to define the name show to be print instead—and therefore print the characters to a file. The result appears at the right of Figure 4.20a. One problem has been solved: winnowing the text destined for a page from the remainder of the parenthesized text in the original file.

The problem of identifying whole words from fragments must still be addressed, for the text in Figure 4.20a contains no spaces. Printing a space between each fragment yields the text in Figure 4.20b. Spaces do appear between each word, but they also appear within words, such as m ultiple and imp ortan t.

To put spaces in their proper places, it is necessary to consider where fragments are placed on the page. Between adjacent characters, the print position moves only a short distance from one fragment to the next; if a space intervenes, the distance is larger. An appropriate threshold will depend on the type size and should be chosen accordingly; however, we use a fixed value for illustration.

The program fragment in Figure 4.20c implements this modification. The symbol X records the horizontal coordinate of the right-hand side of the previous fragment. The new show procedure obtains the current x coordinate using the currentpoint operator (the pop discards the y coordinate) and subtracts the previous coordinate held in X. If the difference exceeds a preset threshold—in this case, five points—a space is printed. Then the fragment itself is printed.

Figure 4.20: Extracting text from postScript: (a) printing all fragments rendered by show; (b) putting spaces between every pair of fragments; (c) putting spaces between fragments with a separation of at least five points; (d) catering for variants of the show operator

In order to record the new x coordinate, the fragment must actually be rendered. Unfortunately, Figures 4.20a and b have suppressed rendering by redefining the show operator. The line systemdict /show get exec retrieves the original definition of show from the system dictionary (systemdict /show get) and executes it (exec) with the original string as argument. This renders the text and updates the current point, which is recorded in X on the next line. Executing the original show operator provides a foolproof way of updating coordinates exactly as they are when the text is rendered. This new procedure produces the text in Figure 4.20c, in which all words are segmented correctly. Line breaks are detected by analyzing vertical coordinates in the same way and comparing the difference with another fixed threshold.

PostScript (to be precise, Level 1 PostScript) has four variants of the show command—ashow, width-show, awidthshow, and kshow—and they should all be treated similarly. In Figure 4.20d, a procedure is defined to do the work. It is called with two arguments, the text string and the name of the appropriate show variant. Just before it finishes, the code for the appropriate command is located in the system dictionary and executed.

Improving the output

Notwithstanding the use of fixed thresholds for word and line breaks, this scheme is quite effective for extracting text from many PostScript documents. However, several enhancements can be made to improve the quality of the output. First, fixed thresholds fail when the text is printed in an unusually large or small font. With large fonts, interfragment gaps are mistakenly identified as interword gaps, and words break up. With small ones, interword gaps are mistaken for interfragment gaps, and words run together. To solve this problem, the word-space threshold can be expressed as a fraction of the average character width. This is calculated for the fragments on each side of the break by dividing the rendered width of the fragment by the number of characters in it. As a rule of thumb, the interword threshold should be about 30 percent greater than the average character width.

Second, line breaks in PostScript documents are designed for typeset text with proportionally spaced fonts. The corresponding lines of plain text are rarely all of the same length. Moreover, the best line wrapping often depends on context—such as the width of the window that displays the text. Paragraph breaks, on the other hand, have significance in terms of document content and should be preserved. Line and paragraph breaks can be distinguished in two ways. Usually paragraphs are separated by more vertical space than lines are. In this case, any advance that exceeds the nominal line space can be treated as a paragraph break. The nominal spacing can be taken as the most common nontrivial change in y coordinate throughout the document.

Sometimes paragraphs are distinguished by horizontal indentation rather than vertical spacing. Treating indented lines as paragraph breaks sometimes fails, however—quotations and bulleted text are often indented too. Additional heuristics are needed to detect these cases. For example, an indented line may open a new paragraph if it starts with a capital letter; if its right margin and the right margin of the following line are at about the same place; and if the following line is not also indented. Although not infallible, these rules work reasonably well in practice.

Third, more complex processing is needed to deal properly with different fonts. For instance, ligatures, bullets, and printer’s quotes (" ‘ ‘ " rather than ‘ ") are non-ASCII values that can be recognized and mapped appropriately. Mathematical formulas with complex sub-line spacing, Greek letters, and special mathematical symbols are difficult to deal with satisfactorily. A simple dodge is to flag unknown characters with a question mark, because there is no truly satisfactory plain-text representation for mathematics.

Fourth, when documents are justified to a fixed right margin, words are often hyphenated. Output will be improved if this process is reversed, but simply deleting hyphens from the end of lines inadvertently removes them from compound words that happen to straddle line breaks.

Finally, printed pages often appear in reverse order. This is for mechanical convenience: when pages are placed face up on the output tray, the first one produced is the last page of the document. PostScript’s document-structuring conventions include a way of specifying page ordering, but it is often not followed in actual document files. Of several possible heuristics for detecting page order, a robust one is to extract numbers from the text adjacent to page breaks. These are usually page numbers, and you can tell that a document is reversed because they decrease rather than increase. Even if some numbers in the text are erroneously identified as page numbers, the method is quite reliable if the final decision is based on the overall majority.

Using PostScript in a digital library

Some early digital libraries were built from PostScript source documents, with contemporary versions shifting to PDF (discussed below) or a combination of the two. PostScript’s ability to display print-quality documents using a variety of fonts and graphics on virtually any computer platform is a wonderful feature. Because the files are 7-bit ASCII, they can be distributed electronically using lowest-common-denominator e-mail protocols. And although PostScript predates Unicode, characters from different character sets can be freely mixed because documents can contain many different fonts. Embedding fonts in documents makes them faithfully reproducible even when sent to printers and computer systems that lack the necessary fonts.

The fact that PostScript is a programming language, however, introduces problems that are not normally associated with documents. A document is a program. And programs crash for a variety of obscure reasons, leaving the user with at best an incomplete document and no clear recovery options. Although PostScript is supposed to be portable, in practice people often experience difficulty printing PostScript files—particularly on different computer platforms. When a document crashes, it does not necessarily mean that the file is corrupt. Just as subtle differences occur among compilers for high-level languages like C++, the behavior of PostScript interpreters can differ in unpredictable ways. Life was simpler in the early days, when there was one level of Postscript and a small set of different interpreters. Now, with the proliferation of PostScript support, any laxity in the code an application generates may not surface locally, but instead cause unpredictable problems at a different time on a computer far away.

Trouble often surfaces as a stack oversaw or stack underflow error. Overflow means that the available memory has been exceeded on the particular machine that is executing the document. Underflow occurs when an insufficient number of elements are left on the stack to satisfy the operator currently being executed. For example, if the stack contains a single value when the add operator is issued, a stack underflow error occurs. Other complications can be triggered by conflicting definitions of what a "new-line" character means on a given operating system—something we have already encountered with plain text files. Although PostScript classes both the carriage-return and line-feed characters (CR and LFin Table 4.1) as white space (along with "tab" and "space," HTand SPAC, respectively), not all interpreters honor this.

PostScript versions of word-processor files are invariably far larger than the native format, particularly when they include uncompressed images. Level 1 does not explicitly provide compressed data formats. However, PostScript is a programming language and so this ability can be programmed in. A document can incorporate compressed data so long as it also includes a decompression routine that is called whenever the compressed data is handled. This keeps image data compact, yet retains Level 1 compatibility. Drawbacks are that every document duplicates the decompression program, and decompression is slow because it is performed by an interpreted program rather than a precompiled one. These are not usually serious. When the document is displayed online, only the current page’s images need be decompressed, and when it is printed, decompression is quick compared with the physical printing time. Note that PostScript based digital library repositories commonly include Level 1 legacy documents.

The ideas behind PostScript make it attractive for digital libraries. However, there are caveats. First, it was not designed for online display. Second, if advantage is taken of additions and upgrades, such as those embodied in comments, encapsulated PostScript, and higher levels of PostScript, digital library users must upgrade their viewing software accordingly (or, more likely, some users will encounter mysterious errors when viewing certain documents). Third, extracting text for indexing purposes is not trivial, and the problem is compounded by international character sets and creative typography.

Portable Document Format: PDF

PDF is a page description language that arose out of PostScript and addresses its shortcomings. It has precisely the same imaging model. Page-based, it paints sequences of graphical primitives, modified by transformations and clipping. It has the same graphical shapes—lines, curves, text, and sampled images. Again, text and images receive special attention, as befits their leading role in documents. The concept of current path, stroked or filled, also recurs. PDF is device independent and expressed in ASCII.

There are two major differences between PDF and PostScript. First, PDF is not a full-scale programming language. (In reality, as we have seen, this feature limits PostScript’s portability.) Gone are procedures, variables, and control structures. Features like compression and encryption are built in— there is no opportunity to program them. Second, PDF includes new features for interactive display. The overall file structure is imposed, rather than being embodied in document-structuring conventions as with PostScript. This provides random access to pages, hierarchically structured content, and navigation within a document. Also, hyperlinks are supported.

There are many less significant differences. Operators are still postfix—that is, they come after their arguments—but their names are shorter and more cryptic, often only one letter, such as S for stroke and f for fill. To avoid confusion among the different conventions of different operating systems, the nature and use of white space are carefully specified. PDF files include byte offsets to other parts of the file and are always generated by software applications (rather than being written by hand as small PostScript programs occasionally are).

PDF has been through several versions since its introduction in 1993. In 1999, JavaScript support was added for greater interactivity (Version 1.3), and in 2000 (Version 1.5) support for JPEG 2000 was added (see Section 5.3). Adobe has released successive free versions of its Reader application (and associated browser plug-ins) to enable users to access the enhanced features. Although other readers are available, most users view PDF documents in Adobe applications. In 2008, PDF (Version 1.7) became an ISO international standard.

Different subsets of PDF have been defined to target specific user groups. Of particular relevance to digital librarians is the subset aimed at long-term archiving, known as PDF/A. PDF/A documents are intended to be self-contained and static. They require all fonts to be embedded, no use of external resources, no JavaScript actions, and device-independent color spaces. The first archival standard, PDF/A-1, is based on PDF version 1.4. The majority of applications that generate PDF provide options that allow users to save their document in whatever version they require. As with other evolving file formats, there is a trade-off between using the latest new features and ensuring that your documents can be widely read.

Inside a PDF file

Figure 4.18d is a PDF file that produces an exact replica of Figure 4.18a. The first line encodes the type and version as a comment, in the same way that PostScript does. Five lines near the end of the first column specify the text Welcome in several languages. The glyph a is generated as the character \344 in the Windows extended 8-bit character set (selected by the line starting /Encoding in the second column), and Tj is equivalent to PostScript’s show. Beyond these similarities, the PDF syntax is far removed from its PostScript counterpart.

PDF files split into four sections: header, objects, cross-references, and trailer. The header is the first line of Figure 4.18d. The object section follows and accounts for most of the file. Here it comprises a sequence of six objects in the form <num> <num> obj … endobj; these define a graph structure (explained below). Then follows the cross-reference section, with numbers (eight lines of them) that give the position of each object in the file as a byte offset from the beginning. The first line says how many entries there are; subsequent ones provide the lookup information (we expand on this later). Finally comes the trailer, which specifies the root of the graph structure, followed by the byte offset of the beginning of the cross-reference section.

The object section in Figure 4.18d defines the graph structure in Figure 4.18e. The root points to a Catalog object (number 1), which in turn points to a Pages object, which points to (in this case) a single Page object. The Page object (number 3) contains a pointer back to its parent. Its definition in Figure 4.18d also includes pointers to Contents, which in this case is a Stream object that produces the actual text, and two Resources, one of which (Font, object 6) selects a particular font and size (14-point Helvetica), while the other (ProcSet, object 5) is an array called the procedure set array that is used when the document is printed.

A rendered document is the result of traversing this network of objects. Only one of the six objects in Figure 4.18d generates actual marks on the page (object 4, stream). Every object has a unique numeric identifier within the file (the first of the <num> fields). Statements such as 5 0 R (occurring in object 3) define references to other objects—object 5 in this case. The 0 that follows each object number is its generation number. Applications that allow documents to be updated incrementally alter this number when defining new versions of objects.

Object networks are hierarchical graph structures that reflect the nature of documents. Of course they are generally far more complex than the simple example in Figure 4.18e. Most documents are composed of pages; many pages have a header, the main text, and a footer; documents often include nested sections. The physical page structure and the logical section structure usually represent parallel hierarchical structures, and the object network is specifically designed for describing such structures—indeed, any number of parallel structures can be built. These object networks are quite different from the linear interpretation sequence of PostScript programs. They save space by eliminating duplication (of headers and footers, for example). But most importantly they support the development of online reading aids that navigate around the structure and display appropriate parts of it, as described in the next subsection.

The network’s root is specified in the trailer section. The cross-reference section provides random access to all objects. Objects are numbered from zero upward (some, such as object 0, may not be specified in the object section). The cross-reference section includes one line for each, giving the byte offset of its beginning, the generation number, and its status (n means it is in use, f means it is free). Object 0 is always free and has a generation number of 65,536. Each line in the cross-reference section is padded to exactly 20 bytes with leading zeros.

To render a PDF document, you start at the end. PDF files always end with %%EOF—otherwise they are malformed and an error is issued. The preceding startxref statement gives the byte offset of the cross-reference section, which shows where each object begins. The trailer statement specifies the root node.

The example in Figure 4.18d contains various data types: number (integer or real), string (array of unsigned 8-bit values), name, array, dictionary, and stream. All but the last have their origin in PostScript. A dictionary is delimited by double angle brackets, << . . . >>—a notational convenience that was introduced in PostScript Level 2. The stream type specifies a "raw" data section delimited by stream … endstream. It includes a dictionary (delimited by angle brackets in object 4 of Figure 4.18d) that contains associated elements. The preceding /Length gives the length of the raw data, 118 bytes. Optional elements that perform processing operations on the stream may also be included—/Filter, for example, specifies how to decode it.

PDF has types for Boolean, date, and specialized composite types such as rectangle—an array of four numbers. There is a text type that contains 16-bit unsigned values that can be used for Unicode text (the UTF-16 variant described in topic 8), although non-Unicode extensions are also supported.

Features of PDF

The PDF object network supports a variety of different browsing features. Figure 4.21 shows a docu-ment—which is in fact the language reference manual—displayed using the Acrobat PDF reader. The navigation panel on the left presents a hierarchical structure of section headings known as bookmarks, which the user can expand and contract at will and use to bring up particular sections of the document in the main panel. This simply corresponds to displaying different parts of the object network tree illustrated in Figure 4.18e, at different levels of detail. Bookmarks are implemented using the PDF object type Outline.

Thumbnail pictures of each page can also be included in this panel. These images can be embedded in the PDF file at the time it is created, by creating new objects and linking them into the network. Some PDF readers are capable of generating thumbnail images on the fly even if they are not explicitly included in the PDF file. Hyperlinks can be placed in the main text so that you can jump from one document to another. For each navigational feature, corresponding objects must appear in the PDF network, such as the Outline objects mentioned earlier that represent bookmarks.

Figure 4.21: Reading a bookmark-enabled PDF document with Acrobat

PDF has a searchable image option that is particularly relevant to collections derived from paper documents. Using it, invisible characters can be overlaid on top of an image. Highlighting and searching operations utilize the hidden information, but the visual appearance is that of the image. Using this option, a PDF document can comprise the original scanned page, backed up by text generated by optical character recognition. Errors in the text do not mar the document’s appearance at all. The overall result combines the accuracy of image display with the flexibility of textual operations such as searching and highlighting. In terms of implementation, PDF files containing searchable images are typically generated as an output option by OCR programs (see Section 4.2). They specify each entire page as a single image, linked into the object network in such a way that it is displayed as a background to the text of the page.

There are many other interactive features. PDF provides a means of annotation that includes video and audio as well as text. Actions can be specified that launch an application. Forms can be defined for gathering fielded information. PDF has moved a long way from its origins in document printing, and today its interactive capabilities rival those of HTML.

Compression is an integral part of the language and is more convenient to use than the piecemeal development found in PostScript. It can be applied to individual stream components and helps reduce overall storage requirements and minimize download times—important factors for a digital library.

Linearized PDF

The regular PDF file structure makes it impossible to display the opening pages of documents until the complete file has been received. Even with compression, large documents can take a long time to arrive. Linearization is an extension that allows parts of the document to be shown before downloading finishes. Linearized PDF documents obey rules governing object order but include more than one cross-reference section.

The integrity of the PDF format is maintained: any PDF viewer can display linearized documents. However, applications can take advantage of the additional information to produce pages faster. The display order can be tailored to the document—the first pages displayed are not necessarily the document’s opening pages, and images can be deferred to later.

Security and PDF documents

The PDF document format has four features related to information security:

• encryption

• digital rights management

• phoning home

• redaction.

PDF files can be encrypted so that a password is needed to edit or view the contents. Two separate encryption systems are defined within PDF; it also includes a way in which third-party security schemes can be used for documents.

A separate facility is provided for PDF files to embed digital rights management (DRM) restrictions that can limit copying, editing, or printing. DRM restrictions provide only limited security, however, because they depend on the reader software to obey them. Alternatively, if you want to print a document that does not allow printing, you could use a screen-capture tool to capture the page images and print them. Of course, the resolution will suffer, but there are tools that convert PDF documents to high-resolution images, so this need not be a problem. If you need a fresh copy of the PDF document, just OCR the images.

Like HTML files, PDF files can submit information to a Web server. This facility could be used to make documents "phone home" when they are opened, and report the network (IP) address of the reader’s computer. Many PDF readers will notify the user via a dialogue box that the document’s supplier is auditing usage of the file and offer the option of withdrawing or continuing.

Redaction means removing information from documents. In the old days, secure redaction was achieved by physically cutting out parts of the text with scissors or knife and photocopying it against a black background. A similar effect can be achieved less securely using a black marker to strike through the text. However, as many people have discovered to their cost, covering up information in an electronic file is not the same as deleting it.

Whereas redacting paper documents is safe and easy, PDF files have created a trap for the unwary. The graphical tools available in Adobe Acrobat can be used to make it appear as though material has been redacted when in fact it has not, because the text remains in the PDF file and can still be extracted. For example, you might use a highlighter tool as a black marker, or use a rectangle tool to cover text. But these tools annotate; they do not redact. There have been several incidents where organizations have tried, and failed, to redact information in PDF files. For instance, in a 2008 legal case involving Facebook, some settlement details were kept confidential—the press were barred from the courtroom. However, when the improperly redacted PDF transcript was released, a simple copy-and-paste operation revealed the hidden text.

One sure way of redaction is to print out the document, redact the paper version, and scan it in again. Alternatively, you could visually cover the text using graphical tools, convert the PDF file to TIFF images (see Section 5.3), and convert these back to PDF. Finally, Adobe’s Acrobat Professional program allows true redaction on the PDF form of the document.

PDF and PostScript

PDF is a sophisticated document description language that was developed by Adobe as a successor to PostScript. It addresses various serious deficiencies that had arisen with PostScript, principally lack of portability. While PostScript is a programming language, PDF is a format, and this bestows the advantage of increased robustness in rendering. Also, PostScript has a reputation for verbosity that PDF has avoided (PostScript now incorporates compression, but not all software uses it). Another feature of PDF is that metadata can be included in PDF files using the Extensible Metadata Platform, XMP (see Section 6.3).

PDF incorporates additional features that support online display. Its design draws on expertise that ranges from traditional printing to hypertext and structured document display. It is a complex format that presents challenging programming problems. However, a wide selection of software tools is readily available. There are utilities that convert between PostScript and PDF. Because they share the same imaging model, the documents’ printed forms are equivalent. But PDF is not a full programming language, so when converting PostScript to it, loops and other constructs must be explicitly unwound. In PostScript, PDF’s interactive features are lost.

Today, PDF is the format of choice for presenting finished documents online. But PostScript is pervasive. Any application that can print a document can save it as a PostScript file, whereas standard desktop environments sometimes lack software to generate PDF. From a digital library perspective, collections (for example, CiteSeer for Scientific publications) frequently contain a mixture of PostScript and PDF documents. The problems of extracting text for indexing purposes are similar and can be solved in the same way. Some software viewers can display both formats.