Word-Processor Documents (Digital Library) Part 1

When word processors store documents, they do so in ways that are specifically designed to support editing. Microsoft Word—currently a leading product—is a good example. Three different styles of document format are associated with Word: Rich Text Format (RTF), a widely published specification dating from 1987; a proprietary internal format that we call simply native Word, which has evolved over many years; and an XML-based format called Microsoft Office Open XML, which is planned for 2010. We also describe the Open Document format, which is based on XML, XSL, and associated standards, and does not differ markedly from Microsoft’s product. We end this discussion by describing an example of a completely different style of document description language, LaTeX, which is widely used in the scientific and mathematical community. LaTeX is very flexible and is capable of producing documents of excellent quality; however, it has the reputation of being difficult to learn and unsuitable for casual use.

RTF is designed to allow word-processor documents to be transferred between applications. Like PostScript and PDF, it uses ASCII text to describe page-based documents that contain a mixture of formatted text and graphics. Unlike them, it is specifically designed to support the editing features we have come to expect in word processors. For example, when Word reads an RTF document generated by WordPerfect or OpenOffice (or vice versa), the file must contain enough information to allow the program to edit the text, change the typesetting, and adjust the pictures and tables. This contrasts with PostScript, where the typography of the document might as well be engraved on a Chinese stone tablet. PDF, as we have seen, supports limited forms of editing—adding annotations or page numbers, for example—but is not designed to have anything approaching the flexibility of RTF.

Many online documents are in native Word format, a binary format that is more compact than RTF, and thus yields faster download and display times. Native Word also supports a wider range of features and is tightly coupled with Internet Explorer, Microsoft’s Web browser, so that a Web-based digital library using Word documents can provide a seamless interface. But native Word has disadvantages. Non-Microsoft communities may be locked out of digital libraries unless other formats are offered. Although documents can be converted to forms like HTML using scriptable utilities, Word’s proprietary nature makes this challenging—and it is hard to keep up to date. Even Microsoft products sometimes can’t read Word documents properly; indeed, opening a file in a version of Word other than the one with which it was created can cause incorrect display of the document. Native Word is really a family of formats rather than a single one and it has nasty legacy problems.

Rendering documents in XML ensures—at least in theory—that they can be read even when the software that created them is not available, provided that details of the format have been published. Microsoft has been working on XML versions of its internal document formats since before 2000, and a new file format for Word was announced in 2002. In 2004, the European Union recommended that Microsoft standardize this, and the following year Microsoft announced it would do so. The result is Office Open XML (OOXML), which was approved as an ISO international standard in 2008. At the time of writing, Microsoft Office 14 (the working title of the next version) has been billed as the first version to support this standard. It also appeared that a service pack would be released to enable Word 2007 to read and write ISO standard OOXML files.

The Open Document Format for Office Applications (ODF) is another standard that represents word-processor documents in XML format. It was created under the auspices of the Organization for the Advancement of Structured Information Standards and is supported by several office productivity tools, most notably the open-source OpenOffice suite. Both ODF and OOXML are billed as a good solution for long-term document presentation, and in fact the differences between them are not large. However, the debate between their proponents is heated.

Rich Text Format: RTF

Figure 4.22a recasts the Welcome example of Figure 4.18 in minimal RTF form. It renders the same text in the same font and size as the PostScript and PDF versions, although it relies on defaults for such things as margins, line spacing, and foreground and background colors.

RTF uses the backslash (\) to denote the start of formatting commands. Commands contain letters only, so when a number (positive or negative) occurs, it is interpreted as a command parameter— thus \yr2001 invokes the \yr command with the value 2001. The command name can also be delimited by a single space, and any symbols that follow—even subsequent spaces—are part of the parameter. For example, {\title Welcome example} is a \title command with the parameter Welcome example.

Figure 4.22: More ways of producing the document of Figure 4.18a: (a) RTF specification; (b) OpenDocument

Braces {…} group together logical units, which can themselves contain further groups. This allows hierarchical structure and permits the effects of formatting instructions to be lexically scoped. An inheritance mechanism is used. For example, if an instruction is not explicitly specified at the current level of the hierarchy, a value that is specified at a higher level will be used instead.

Line 1 of Figure 4.22a gives an RTF header and specifies the character encoding (ANSI 7-bit ASCII), default font number (0), and a series of fonts that can be used in the document’s body. The remaining lines represent the document’s content, including some basic metadata. On line 3, in preparation for generating text, \pard sets the paragraph mode to its default, while \plain initializes the font character properties. Next, \f1 makes entry 1 in the font table—which was set to Helvetica in the header—the currently active font. This overrides the default, set up in our example to be font entry 0 (Times Roman). Following this, the command \fs28—whose units are measured in half points—sets the character size to 14 points.

Text that appears in the body of an RTF file but is not a command parameter is part of the document content and is rendered accordingly. Thus lines 4 through 8 produce the greeting in several languages. Characters outside the 7-bit ASCII range are accessed using backslash commands. Unicode is specified by \u: here we see it used to specify the decimal value 228, which is LATIN SMALL LETTER A WITH DIAERESIS, the fourth letter of Akwaba.

This is a small example. Real documents have headers with more controlling parameters, and the body is far larger. Even so, this example is enough to illustrate that RTF, unlike PostScript, is not intended to be laid out visually. Rather, it is designed to make it easy to write software tools that parse document files quickly and efficiently.

Additions are backward-compatible to avoid disturbing existing files. In Figure 4.22a’s opening line, the numeric parameter \rtf1 gives the version number, 1. The format has grown rapidly because, as well as keeping up with developments like Unicode in the print world, it must support an ever-expanding set of word-processor features, a trend that continues.

Basic types

Now we flesh out some of the details. While RTF’s syntax has not changed since its inception, the command repertoire continues to grow. There are five basic types of command: flag, toggle, value, destination, and symbol.

A flag command has no argument. (If present, arguments are ignored.) One example is \box, which generates a border around the current paragraph; another is pard, which—as we have seen—sets the paragraph mode to its default. A toggle command has two states. No argument (or any nonzero value) turns it on; zero turns it off. For example, \b and \b0 switch boldface on and off, respectively. A value command sets a variable to the value of the argument. The \deff0 in Figure 4.22a is a value command that sets the default font to entry zero in the font table.

A destination command has a text parameter. That text may be used elsewhere, at a different destination (hence the command’s name)—or not at all. For example, text given to the footnote command appears at the bottom of the page; the argument supplied to \author defines metadata that does not actually appear in the document. Destination commands must be grouped in braces with their parameter—which might itself be a group. Both commands specified in {\info{\title Welcome example}} are destination commands.

A symbol command represents a single character. For instance, \bullet generates the bullet symbol (•), and \{ and \} produce braces, escaping their special grouping property in RTF.

Backward compatibility

An important symbol command that was built in from the beginning is \*. Placed in front of any destination command, it signals that if the command is unrecognized it should be ignored. The aim is to maintain backward compatibility with old RTF applications.

For instance, there was no Unicode when RTF was born. An old application would choke on the Welcome example of Figure 4.22a because the \u command is a recent addition. In fact it would ignore it, producing Akwba—not a good approximation.

The \* command provides a better solution. As well as \u, two further new commands are added for Unicode support. Rather than generating Akwaba by writing AkW\u228ba—which works correctly if Unicode support is present but produces Akwba otherwise—one instead writes

The actions performed by the two new commands \upr and \ud are very simple, but before we reveal what they are, consider the effect of this command sequence on an older RTF reader that does not know about them. Unknown commands are ignored but their text arguments are printed, so when the reader works its way through the two destination commands, the first generates the text Akwaba while the second is ignored because it starts with \*. This text is a far more satisfactory approximation than Akwba. Now consider the action of a reader that knows how to process these directives. The first directive, \upr, ignores its first argument and processes the second one. The second directive, \ud, just outputs its argument—it is really a null operator and is only present to satisfy the constraint that \* is followed by a destination command.

File structure

Figure 4.23 shows the structure of an RTF file. Braces enclose the entire description, which is divided into two parts: header followed by body. We have already encountered some header components; there are many others. A commonly used construct is table, which reserves space and initializes data—the font table, for example. The table command takes a sequence of items—each a group in its own right, or separated using a delimiter, such as semicolon—and stores the information away so that other parts of the document can access it. A variety of techniques are deployed to retrieve the information. In a delimited list, an increasing sequence of numeric values is implied for storage, while other tables permit each item to designate its numeric label, and still others support textual labels.

The first command in the header must be \rtf, which encodes the version number followed by the character set used in the file. The default is ASCII, but other encoding schemes can be used. Next, font data is initialized. There are two parts: the default font number (optional) and the font table (mandatory).

Figure 4.23: Structure of an RTF file

Both appear in the Welcome example, although the font table has many more capabilities, including the ability to embed fonts.

The remaining tables are optional. The file table is a mechanism for naming related files and is only used when the document consists of subdocuments in separate files. The color table comprises red, green, and blue value commands, which can then be used to select foreground and background colors through the commands \cf1 and \cb2, respectively. The style sheet is also a form of table. It corresponds to the notion of styles in word processing. Each item specifies a collection of character, paragraph, and section formatting. Items can be labeled; they may define a new style or augment an existing one. When specified in the document body, the appropriate formatting instructions are brought to the fore. List tables provide a mechanism for bulleted and enumerated lists (which can be hierarchically nested). Revision tables provide a way of tracking revisions of a document by multiple authors.

The document body contains three parts, shown in Figure 4.23: top-level information, document formatting, and a sequence of sections (there must be at least one). It begins with an optional information group that specifies document-level metadata—in our example this was used to specify the title. There are over 20 fields, among them author, organization, keywords, subject, version number, creation time, revision time, last time printed, number of pages, and word count.

Next comes a sequence of formatting commands (also optional). Again there are dozens of possible commands: they govern such things as the direction of the text, how words are hyphenated, whether backups are made automatically, and the default tab size (measured in twips, an interestingly named unit that corresponds to one-twentieth of a point).

Finally, the last part of the body specifies the document text. Even here the actual text is surrounded by multiple layers of structure. First the document can be split into a series of sections, each of which consists of paragraphs (at least one). Sections correspond to section breaks inserted by an author using, for instance, Microsoft Word. Sections and paragraphs can both begin with formatting instructions. For sections, formatting instructions control such things as the number and size of columns on a page, page layout, page numbering, borders, and text flow and are followed by commands that specify headers and footers. For paragraphs, formatting instructions include tab settings, revision marks, indenting, spacing, borders, shading, text wrapping, and so forth. Eventually you get down to the actual text. Further formatting instructions can be interspersed to change such things as the active font size.

Other features

So far we have seen how RTF specifies typographic text, based around the structure of sections and paragraphs. It has many other features. Different sampled image formats are supported, including open standards like JPEG and PNG (see Section 5.3) and proprietary formats like Microsoft’s Windows Metafile and Macintosh’s PICT. The raw image data can be specified in hexadecimal using plain text (the default) or as raw binary—in which case care must be taken when transferring the file between operating systems (recall the discussion of FTP’s new-line handling in Section 4.1).

Built into many word processors are tools that draw lines, boxes, arcs, splines, filled-in shapes, text, and other vector graphic primitives. RTF contains over 100 commands to draw, color, fill, group, and transform such shapes. The resulting shapes resemble the graphical shapes that can be described in PostScript and PDF.

Authors use annotations to add comments to a document. RTF can embed within a paragraph a destination command with two parts: a comment and an identifying label (typically used to name the person responsible for the annotation).

Field entities introduce dynamically calculated values, interactive features, and other objects requiring interpretation. They are used to embed today’s date, the current page number, mathematical equations, and hyperlinks into a paragraph. They bind a field instruction command together with its most recently calculated value—which provides backup should an application fail to recognize the field. Accompanying parameters influence what information is displayed, and how. Fields allow metadata such as title and author to be associated with a document, and this information is stored in the RTF file in the form of an \info command. RTF uses the field mechanism to support indexes and a table of contents.

In a word-processor document, bookmarks are a means of navigation. RTF includes begin- and end-bookmark commands that mark segments of the text along with text labels, accessible through the word-processing application. Microsoft has a scheme called object linking and embedding (OLE) that places information created by one application within another. For example, an Excel document can be incorporated into a Word file and still function as a spreadsheet. RTF calls such entities objects and provides commands that wrap the data with basic information, such as the object’s width and height and whether it is linked or embedded.

Commands in the document format section control the overall formatting of footnotes (which in RTF terminology includes endnotes). The footnote command is then used within paragraphs to provide a footnote mark and the accompanying text.

RTF tables are produced by commands that define cells, rows, and the table itself. Formatting commands control each component’s dimensions and govern how text items are displayed—e.g., pad all cells by 20 twips, set this cell’s width to 720 twips and center its text, and so on. However, there is a twist. While the other entities described earlier are embedded within a paragraph, an RTF table is a paragraph and cannot be embedded in one—this definition reflects the practice visible in Word, where inserting a table always introduces a new paragraph.

Using RTF in a digital library

When building a digital library collection from RTF documents, the format’s editable nature is of minor importance. Digital libraries generally deal with completed documents—information that is ready to be shared with others. What matters is how to index the text and display the document.

To extract rudimentary text from an RTF file, simply ignore the backslash commands. The quality of the output improves if other factors, such as the character set, are taken into account. Ultimately, full-text extraction involves parsing the file. RTF was designed to be easy to parse. Three golden rules are emphasized in the specification:

• Ignore control words that you don’t understand.

• Always understand \*.

• Remember that binary data can occur when skipping over information.

RTF files can usually be viewed on Macintosh and Windows computers. Different formats may be more appropriate for other platforms or for speedier access. For example, software is available to convert RTF documents to HTML.

Native Word formats

For much of its history the native Microsoft Word format has been proprietary and its details shrouded in mystery. Although Microsoft has published "as is" their internal technical manual for the Word 97 version, the format continued to evolve. Finally, in 2008, Microsoft made the specification publicly available. (Of course, this does not help people with legacy documents.) Native Word is primarily a binary format, and the abstract structures deployed reflect those of RTF. Documents include summary information, font information, style sheets, sections, paragraphs, headers, footers, bookmarks, annotations, footnotes, embedded pictures—the list goes on. The native Word representation provides more functionality than RTF and is therefore more intricate.

A serious complication is that documents can be written to disk in Fast Save mode, which no longer preserves the order of the text. Instead, new edits are appended, and whatever program reads the file must reconstruct its current state. If this feature has been used, the header marks the file type as "complex."

Using native Word in a digital library

To extract text from Word documents for indexing, one solution is to first convert them to RTF, whose format is better described. The Save As option in Microsoft Word does this, and the process can be automated through scripting. (Visual Basic is well suited to this task.) It may be more expeditious to deliver native Word than RTF because it is more compact. However, non-Microsoft users will typically need a more widely supported option, although this trend is shifting with the advent of OpenOffice.

Word has a Save As HTML option. While the result displays well in Microsoft’s Internet Explorer browser, it is generally less pleasing in other browsers (although it can be improved by performing certain postprocessing operations). Public domain conversion software cannot fully implement the Fast Save format because of lack of documentation and may generate all the text in the file rather than just the text in the final version. The solution is simple: switch this option off and resave all documents (using scripting).

Office Open XML: OOXML

Office Open XML is a new standard designed by Microsoft to represent, in human-readable XML form, Word and other Microsoft Office documents (spreadsheets, presentations, etc.). It was designed specifically for the features in Microsoft Office and is constrained by the need to be backward-compatible with documents created in the binary format. There has been vigorous discussion of the relative merits of OOXML and the Open Document (ODF) format (described in the next section). ODF was designed as a general office document markup language, unconstrained by the features of existing word processors. It is helpful to keep in mind the fact that these two languages were designed for different purposes and will probably be used in different ways.

Although the next section contains a brief technical account of ODF, we omit a technical description of OOXML. One reason is that these are extremely complex standards—the OOXML specification is more than 6,000 pages long, and ODF’s contains 722 pages. Instead, we give a flavor of the debate, which is often heated. Critics of OOXML complain that (for example)

• The standard is controlled by a single commercial company (Microsoft).

• It does not follow established standards for dates, graphics, formulas.

• Parts of it use non-XML formatting codes and are unreadable by XML parsers.

• Components of OOXML can be linked to Windows applications and can be understood only in a Windows context.

Proponents counter these criticisms by pointing out that naturally some OOXML elements are not supported by other software or other formats. Indeed, a project to translate documents from OOXML to ODF (funded in part by Microsoft) found that not all aspects of Office documents could be faithfully rendered. The problem basically stems from the fact that its design requirements force OOXML to capture all features of Microsoft Office, no matter how idiosyncratic or arcane they appear to others.

On the other side, critics of ODF complain that (for example)

• It lacks standard ways of rendering macros, tables, and mathematical symbols.

• It ignores the semantics of spreadsheet formulae.

• The documentation is insufficiently detailed for vendors who seek to build their own ODF software.

• It is not well enough defined to support fully interoperable applications.

In practice, its proponents say, ODF developers will naturally look to OpenOffice for canonical implementations that clarify features like spreadsheet formulae (which seem to have become a specific bone of contention). The code is there for anyone to view, and the XML output can be trivially inspected—providing a supplement to the standard in the form of an operational implementation. Of course, this is a very different philosophy from Microsoft’s.

Although there are certainly differences between the two standards, advocates on both sides usually admit that the differences are not essential—certainly not when it comes to the interoperability and preservation of textual documents. Perhaps the real issue is the potential effect of each standard in the marketplace. ODF may be specified in less detail, but it offers the flexibility to create new products, which encourages competition. OOXML is aimed at reproducing Office documents and makes it easier for other products to work with Office—but does not encourage vendors to replace it.

At any rate, OOXML will be a great improvement over the native Word formats for designers of digital libraries.