Page Description Languages: postScript and PDF (Digital Library) Part 1

The purpose of page description languages is to express typeset documents in a way that is independent of the particular output device used. Early word-processing programs and drawing packages incorporated code for sending documents to particular printers and could not be used with other devices. With the advent of page description languages, programs can generate documents in a device-independent format that will print on any device equipped with a driver for that language.

Most of the time, digital libraries can treat documents in these description languages by processing them using standard "black boxes": generate this report in a particular syntax, display it here, transfer it there, and print. However, to build coherent collections from the documents, you need internal knowledge of these formats to understand what can and cannot be accomplished: whether the text can be indexed, bookmarks inserted, images extracted, and so on. For this reason we describe two page description languages in detail.

PostScript, the first commercially developed page description language, was released in 1985, whereupon it was rapidly adopted by software companies and printer manufacturers as a platform-independent way of describing printed pages that could include both text and graphics. Soon it was being coupled with software applications (notably, in the early days, PageMaker on the Apple Macintosh) that ensure that "what you see" graphically on the computer’s raster display is "what you get" on the printed page. PDF, Portable Document Format, is a page description language that arose out of PostScript and addresses many of its shortcomings.


PostScript fundamentals

PostScript files comprise a sequence of drawing instructions, including ones that draw particular letters from particular fonts. The instructions are like this: move to the point defined by these x and y coordinates and then draw a straight line to here; using the following x and y coordinates as control points, draw a smooth curve around them with such-and-such a thickness; display a character from this font at this position and in this point size; display the following image data, scaled and rotated by this amount. Instructions are included to specify such things as page size, clipping away all parts of a picture that lie outside a given region, and when to move to the next page.

But PostScript is more than just a file format. It is a high-level programming language that supports diverse data types and operations on them. Variables and predefined operators allow the usual kinds of data manipulation. New operations can be encapsulated as user-defined functions. Data can be stored in files and retrieved. A PostScript document is more accurately referred to as a PostScript program. It is printed or displayed by passing it to a PostScript interpreter, a full programming language interpreter.

Being a programming language, PostScript allows print-quality documents that comprise text and graphical components to be expressed in an exceptionally versatile way. Ultimately, when interpreted, the abstract PostScript description is converted into a matrix of dots or pixels through a process known as rasterization or rendering. The dot structure is imperceptible to the eye—commonly available printers have a resolution of 600 dpi, and publishing houses use 1,200 dpi and above (see Table 4.2). This very book is an example of what can be described using the language.

Modern computers are powerful enough that a PostScript description can be quickly rasterized and displayed on the screen. This adds an important dimension to online document management: computers without the original software used to compose a document can still display the finished product exactly as it was intended to be displayed. Indeed, in the late 1980s one computer manufacturer took the idea to an extreme by developing an operating system (called NeXT) in which the display was controlled entirely from PostScript, and all applications generated their on-screen results in this form.

However, PostScript was not designed for screen displays. As is true for ASCII, limitations often arise when a standard is put to use in situations for which it was not designed. Just as ASCII is being superseded by Unicode, the Portable Document Format (PDF) has been devised as the successor to PostScript for online documents. Today the Apple Macintosh uses PDF throughout, just as the NeXT used PostScript.

The language

PostScript is page-based. Graphical marks are drawn one by one until an operator called showpage is encountered, whereupon the page is presented. When one page is complete, the next is begun. Placement is like painting: if a new mark covers a previously painted area, it completely obliterates the old paint. Marks can be black and white, grayscale, or color. They are "clipped" to fit within a given area (not necessarily the page boundary) before being placed on the page. This process defines the imaging model used by PostScript.

Table 4.3 summarizes PostScript’s main graphical components. Various geometric primitives are supplied. Circles and ellipses can be produced using the arc primitive; general curves are drawn using splines, curved lines whose shapes are controlled precisely by a number of control points. A path is a sequence of graphical primitives interspersed with geometric operations and stylistic attributes. Once a path has been defined, it is necessary to specify how it is to be painted: for example, stroke for a line or fill for a solid shape. The moveto operator moves the pen without actually drawing, so that paths do not have to prescribe contiguous runs of paint. An operator called closepath forms a closed shape by generating a line from the latest point back to the last location moved to. The origin of coordinates is located at the bottom left-hand corner of a page, and the unit of distance is set to be one printer’s point, a typographical measure whose size is 1/72 inch.

Table 4.3: Graphical components in postScript

Graphical primitives

Straight lines, arcs, general curves, sampled images and text

Geometrical operations

Scale, translate, and rotate

Line attributes

Width, dashed, start and end caps, joining lines/corner mitre (style)

Font attributes

Font, typeface, size

Color

Color currently in use

Paths

Sequence of graphical primitives and attributes

Rendering

How to render paths: grayscale, color, or outline

Clipping

Restricts what is shown of the path

In PostScript, text characters are just another graphical primitive: they can be rotated, translated, scaled, and colored just like any other object. However, because of its importance, text receives special treatment. The PostScript interpreter stores information about the current font: font name, font type, point size, and so on, and operators like findfont and scalefont are provided to manipulate these components. There is also a special operator called image for sampled images.

Files containing PostScript programs are represented in 7-bit ASCII, but this does not restrict the fonts and characters that can be displayed on a page. A percentage symbol (%) indicates that the remainder of the line contains a comment; however, comments marked with a double percent (%%) extend the language by giving structured information that can be utilized by a PostScript interpreter.

Figure 4.18b shows a simple PostScript program that, when executed, produces the result in Figure 4.18a, which contains the greeting Welcome in five languages. The first line, which is technically a comment but must be present in all PostScript programs, defines the file to be of type PostScript.

(a) Result of executing a postScript program; (b) the postScript program; (c) encapsulated postScript version;

Figure 4.18: (a) Result of executing a postScript program; (b) the postScript program; (c) encapsulated postScript version;

 cont'd: (d) PDF version; (e) network of objects in the PDF version

Figure 4.18, cont’d: (d) PDF version; (e) network of objects in the PDF version

The next two lines set the font to be 14-point Helvetica, and then the current path is moved to a position (10,10) points from the lower left-hand corner of the page.

The five show lines display the Welcome text (plus a space). PostScript, unlike most computer languages, uses a stack-based form of notation where commands follow their arguments. The show commands "show" the text that precedes them; parentheses are used to group characters together into text strings. In the fifth example, the text Akw is "shown" or painted on the page; then there is a relative move (rmoveto) of the current position forward two printer’s points (the coordinate specification (2, 0)); then the character \310 is painted (octal 310, which is in fact an umlaut in the Latin-1 extension of ASCII); the current position is moved back six points; and the characters aba are "shown." The effect is to generate the composite character a in the middle of the word. Finally the showpage operator is issued, causing the graphics that have been painted on the virtual page to be printed on a physical page.

The PostScript program in Figure 4.18b handles the composite character a inelegantly. It depends on the spacing embodied in the particular font chosen—on the fact that moving forward two points, printing an umlaut, and moving back six points will position the forthcoming a directly underneath. There are better ways to accomplish this, using, for instance, ISOLatin1Encoding or composite fonts, but they are beyond the scope of this simple example.

Evolution

Standards and formats evolve. There is a tension between stability, an important feature for any language, and currency, the need to extend in response to the ever-changing face of computing technology. To help resolve the tension, levels of PostScript are defined. The conformance level of a file is encoded in its first line, as can been seen in Figure 4.18b (PS-Adobe-3.0 means Level 3 PostScript). Care is taken to ensure that levels are backward-compatible.

What we have described so far is basic Level 1 PostScript. Level 2 includes

• improved virtual memory management

• device-independent color

• composite fonts

• filters.

The virtual memory enhancements use whatever memory space is available more efficiently, which is advantageous because PostScript printers sometimes run out of memory when processing large documents. Composite fonts, which significantly help internationalization, are described below. Filters provide built-in support for compression, decompression, and other common ways of encoding information.

Level 2 was announced in 1991, six years after PostScript’s original introduction. The additions were quite substantial, and it was a long time before it became widely adopted. Level 3 (sometimes called PostScript 3) was introduced in 1998. Its additions are minor by comparison, and include

• more fonts, and provision for describing them more concisely;

• improved color control and smoother shading;

• advanced processing methods that accelerate rendering.

While PostScript per se does not impose an overall structure on a document, applications can take advantage of a prescribed set of rules known as the document structuring conventions (DSC). These divide documents into three sections: a prolog, document pages, and a trailer. The divisions are expressed as PostScript "comments." For example, %%BeginProlog and %%Trailer define section boundaries. Other conventions are embedded in the document—such as %%BoundingBox, discussed below. There are around 40 document structuring commands in all.

Document structuring commands provide additional information about the document but do not affect how it is rendered. Since the commands are couched as comments, applications that do not use the conventions are unaffected. However, other applications can take advantage of the information.

Applications that generate PostScript, such as word processors, commonly use the prolog to define procedures that are helpful in generating document pages and use the trailer to tidy up any global operations associated with the document or to include information (such as a list of all fonts used) that is not known until the end of the file. This convention enables pages to be expressed more concisely and clearly.

Encapsulated PostScript

Encapsulated PostScript is a variant designed for expressing documents of a single page or less. It is widely used to incorporate artwork created using a software application like a drawing package into a larger document, such as a report being composed in a word processor. Encapsulated PostScript is built on top of the document-structuring conventions.

Figure 4.18c shows the Welcome example expressed in Encapsulated PostScript. The first line is augmented to reflect this (the encapsulation convention has levels as well; this is EPSF-3.0). The %%BoundingBox command that specifies the size of the drawing is mandatory in Encapsulated PostScript. Calculated in points from the origin (bottom left-hand corner), it defines the smallest rectangle that entirely encloses the marks constituting the rendered picture. The rectangle is specified by four numbers: the first pair give the coordinates of the lower left corner, and the second pair define the upper right corner. Figure 4.18c also shows document-structuring commands for the creator of the document (more commonly it gives the name and version number of the software application that generated the file), a suitable title for it, and a list of fonts used (in this case, just Helvetica).

An Encapsulated PostScript file contains raw PostScript along with a few special comments. It can be embedded verbatim, header and all, into a context that is also PostScript. For this to work properly, operators that affect the global state of the rendering process must be avoided. These restrictions are listed in the specification for Encapsulated PostScript and in practice are not unduly limiting.

Fonts

PostScript supports two broad categories of fonts: base and composite fonts. Base fonts accommodate alphabets up to 256 characters. Composite fonts extend the character set beyond this point and also permit several glyphs to be combined into a single character—making them suitable for languages with large alphabets, such as Chinese, and with frequent character combinations, such as Korean.

In Figure 4.18b the find font operator is used to set the font to Helvetica. This searches PostScript’s font directory for the named font (/Helvetica), returning a font dictionary that contains all the information necessary to render characters in that font. Most PostScript products have a built-in font directory with descriptions of 13 standard fonts from the Times, Helvetica, Courier, and Symbol families. Helvetica is an example of a base font format.

The execution of a show command such as (Welcome) show takes place in two steps. For each character, its numeric value (0-255) is used to access an array known as the encoding vector. This provides a name, such as /W(or, for nonalphabetic characters, a name like /hyphen). This name is then used to look up a glyph description in a subsidiary dictionary. A name is one of the basic PostScript types: it is a label that binds itself to an object. The act of executing the glyph object renders the required mark. The font dictionary is a top-level object that binds these operations together.

In addition to the built-in font directory, PostScript lets you provide your own graphical descriptions for the glyphs, which are then embedded in the PostScript file. You can also change the encoding vector.

Font formats

The original specification for PostScript included a means of defining typographical fonts. At the time there were no standard formats for describing character forms digitally. PostScript fonts, which were built into the LaserWriter printer in 1985 and subsequently adopted in virtually all typesetting devices, sparked a revolution in printing technology. However, to protect its investment, Adobe, the company that introduced PostScript, kept the font specification secret. This spurred Apple to introduce a new font description format six years later (and this format was subsequently adopted by the Windows operating system). Adobe then published its format.

Level 3 PostScript incorporates both ways of defining fonts. The original method is called Type 1; the rival scheme is TrueType. For example, Times Roman, Helvetica, and Courier are Type 1 fonts, while Times New Roman, Arial, and Courier New are the TrueType equivalents.

Technically, the two font description schemes have much in common. Both describe glyphs in terms of the straight lines and curves that make up the outline of the character. This means that standard geometric transformations—translation, scaling, rotation—can be applied to text as well as to graphic primitives. One difference between Type 1 and TrueType is the way in which curves are specified. Both use spline curves, but the former uses a kind of cubic spline called a Bezier curve, whereas the latter uses a kind of quadratic spline called a B-spline. From a user perspective these differences are minimal—but they do create incompatibilities.

Both representations are resolution independent. Characters may be resized by scaling the outlines up or down—although a particular implementation may impose practical upper and lower limits. It is difficult to scale down to very small sizes. When a glyph comprises only a few dots, inconsistencies arise in certain letter features depending on where they are placed on the page, because even though the glyphs are the same size and shape, they sit differently on the pixel grid. For example, the width of letter stems may vary from one instance of a letter to another; worse still, when scaled down, key features may disappear altogether.

Both Type 1 and TrueType deal with this by putting additional information called hints into fonts to make it possible to render small glyphs consistently. However, the way that hints are specified is different in each case. Type 1 fonts give hints for vertical and horizontal features, overshoots, snapping stems to the pixel grid, and so on, and in many cases there is a threshold pixel size at which they are activated. TrueType hints define flexible instructions that can do much more. They give the font producer fine control over what happens when characters are rendered under different conditions, but to use them to full advantage, individual glyphs must be manually coded. This is such a daunting undertaking that, in practice, many fonts omit this level of detail. Of course this does not usually affect printed text, because even tiny fonts can be displayed accurately, without hinting, on a 600-dpi device. Hinting is only really important for screen displays.

Composite fonts

Composite fonts became standard in Level 3 PostScript. They are based on two key concepts. First, instead of using a single dictionary for mapping character values, as base fonts do, composite fonts use a hierarchy of dictionaries. At its root, a composite font dictionary directs character mappings to subsidiary dictionaries, which can contain either base fonts or further composite fonts (up to a depth limit of five). Second, the show operator no longer decodes its argument one byte at a time. Instead, a font number and character selector pair are used. The font number locates a dictionary within the hierarchy, while the character selector uses the encoding vector stored with that dictionary to select a glyph description name to use when rendering the character. This latter step is analogous to the way base fonts are used.

The arguments of show can be decoded in several ways. Options include 16 bits per font number and character selector pair, separated into one byte each (note that this differs from a Unicode representation), or using an escape character to change the current font dictionary. The method used is determined by a value in the root dictionary.

Next post:

Previous post: