Word-Processor Documents (Digital Library) Part 2

Open Document format: ODF

The Open Document format can describe word-processor documents, spreadsheets, presentations, drawings, graphics, images, charts, mathematical formulas, and documents that combine elements of any of these. They are stored in files whose extensions contain ".od" followed by a single letter indicating the document type: .odt for text documents, .ods for spreadsheets, .odp for presentation programs, .odg for graphics.

A basic ODF document is an XML file with <document> as its root element. (Open Document files can also take the format of a ZIP compressed archive containing a number of files and directories as described below.) Figure 4.22b shows the Welcome example of Figure 4.18a in minimal ODF form. It renders the same text as the PostScript, PDF, and RTF versions, although it uses the default font and size. After the obligatory XML processing application statement, the document is defined by its root element, <office:document>, in the office namespace. This begins with four namespace definitions. In practice, ODF files usually define more namespaces—including ones for Formatting Objects (called fo) used in Section 4.4—but these four are enough to make this rudimentary example work. The namespaces that begin with urn are Uniform Resource Names (see Section 7.3).

Metadata can be stored along with documents, either with a set of metadata elements that are predefined in Open Document or using any user-defined metadata set. We have illustrated this by including two metadata elements specified in the Dublin Core standard (see Section 6.2). The namespace beginning with http://is for the Dublin Core standard, and is only necessary because the example uses it to specify some metadata.

Following the metadata is the text of the document, nested inside the appropriate office tags. The <text:p> tag specifies a paragraph. As you can see, Unicode characters can be included using the ordinary # notation of HTML and XML.

In order to make the text appear in Helvetica font as it does in Figure 4.18a, the <text:p> statement near the end of Figure 4.22b needs to be augmented with a style-name attribute to read

To make this work, the Normal text style must be explicitly defined using a statement like

In reality the specifications are slightly more convoluted and verbose, which is why we have not included the details here. One cardinal advantage of XML-style document specifications over the others discussed previously is that it is very easy to create examples and study them to see what is going on.

Open Document files

Figure 4.22b shows a single file containing an entire document in the root XML tag <ofSce:document>. This represents the document content, metadata, and any styles defined in the document (there are none in Figure 4.22b). It also includes any document "settings," which are properties that are not related to the document’s content or layout, like the zoom factor or the cursor position (again, there are none here).

These four parts are usually separated by placing the components in separate XML files (content.xml, meta.xml, settings.xml, and styles.xml) instead. The single top-level <office:document> would be replaced by four top-level roots, one for each file:

• <office:document-content>

• <office:document-meta>

• <office:document-styles>

• <office:document-settings>

The content file is the most important, and carries the actual content of the document (except for binary data, like images). The style file contains most of the stylistic information—Open Document makes heavy use of styles for formatting and layout.

Although Figure 4.22b shows a readable XML file, ODF files are compressed into ZIP archives to reduce their size. Furthermore, an additional file (called mimetype) must be present in the archive, containing a single line specifying the document type—whether a textual document, spreadsheet, presentation file, or graphics. This makes the file extension (.odt, .ods, .odp, or .odg) immaterial to the format: it’s only there for the user’s benefit.

Formatting

Open Document has a comprehensive repertoire of formatting controls that dictate how information is displayed. Style types include:

• paragraph styles

• page styles

• character styles

• frame styles

• list styles.

There are many attributes that dictate the style of specific parts of the text, paragraphs, sections, tables, columns, lists, and fills. Characters can have their font, size, and other properties set. The vertical arrangement of paragraphs can be controlled through attributes that keep lines together and avoid widows and orphans; other attributes (such as "drop caps") provide special formatting for parts of a paragraph.

The usual range of document structuring options is provided, including headings at different levels, numbered and unnumbered lists, numbered paragraphs, and change tracking. Section attributes can be used to control how the text is displayed. Documents can include hyperlinks, bookmarks, and references. Text fields can contain automatically generated content, and there are mechanisms for generating tables of contents, indexes, and bibliographies.

Page layout is controlled by attributes such as page size, number format, paper tray, print orientation, margins, borders, padding, shadow, background, columns, print page order, first page number, scale, table centering, maximum footnote height and separator, and many layout grid properties. Headers and footers can have defined fixed and minimum heights, margins, border line width, padding, background, shadow, and dynamic spacing.

Using ODF in a digital library

Like any well-defined XML-based standard, the Open Document format makes it easy to handle, process, and re-present documents in the way that digital libraries do. As the name implies, the standard is "open." The natural verbosity of XML is curbed using a standard compression mechanism, ZIP, which renders the resulting files significantly smaller than other document files, such as Word’s .doc files. Yet the information is readable and processable. To see the contents of an .odt file, you first decompress it, and then the data is exposed in simple text-based XML files whose content can be easily examined, modified, and processed.

The use of separate files for the content, metadata, style, and settings is designed to make it easy to process these components in different ways. Most digital libraries will simply ignore the settings. The fact that metadata resides in its own file eliminates the need to parse the entire document just to determine its metadata. The fact that style information is separate means that indexers can focus on the textual content of documents. Future digital libraries that deal only with such documents will be easier to construct and maintain, because they will not have to grapple with the complexity that is presently required to isolate such components as plain text and metadata in document formats like PostScript, PDF, and native Word. Legacy documents, unfortunately, will continue to pose problems.

Scientific documents: LaTeX

LaTeX—pronounced la-tech or lay-tech—takes a completely different approach to document representation. Word processors present users with a "what you see is what you get" interface that is specifically intended to hide the gory details of internal representation. In contrast, LaTeX documents are expressed in plain ASCII text and contain typed formatting commands: they explicitly and intentionally give the user direct access to all the internal representation details. Any text editor on any platform can be used to compose a LaTeX document. To view the formatted document, or to generate hard copy, the LaTeX program converts it to a page description language—generally PostScript, but PDF and HTML are possible too.

LaTeX is versatile, flexible, and powerful. It can generate documents of exceptionally high typographical quality. The downside, however, is an esoteric syntax that many people find unsettling and hard to learn. It is particularly good for mathematical typesetting and has been enthusiastically adopted by members of the academic, scientific, and technical communities. It is a nonproprietary system, and excellent implementations are freely available.

Figure 4.24 shows a simple example. Commands in the LaTeX source (Figure 4.24a) are prefixed by the backslash character, \. All documents have the same overall structure. They open with \documentclass, which specifies the document’s principal characteristics (article, report, book, etc.) and gives options, such as paper size, base font size, and whether to print single-sided or back-to-back. Then follows a preamble that gives an opportunity to set up global features before the document content begins. Here "packages" of code can be included. For example, \usepackage{epsfig} allows Encapsulated PostScript files, generally containing the artwork for figures, to be included.

The document content lies between \begin{document} and \end{document} commands. This \begin … \endstructure is used to encapsulate many structural items: tables, figures, lists, bibliography, abstract. The list is endless, because LaTeX allows users to define their own commands. Furthermore, you can wrap up useful features and publish them on Internet sites so that others can download them and access them through \usepackage.

As a document is written, most text is entered normally. Blank lines are used to separate paragraphs. A few characters carry special meaning and must therefore be "escaped" by a preceding backslash whenever they occur in the text; Figure 4.24 contains examples. Structural commands include \section, which generates an automatically numbered section heading (\section* omits the numbering, while \subsection, \subsubsection, . are used for nested headings). Formatting commands include \emph, which uses italics to emphasize text, and \", which superimposes an umlaut on the character that follows. There are hundreds more.

The last part of Figure 4.24a specifies a mathematical expression. The \begin{displaymath} and \end{displaymath} commands switch to a mode that is tuned to represent formulas, which activates additional commands specially tailored to this purpose. LaTeX contains many shortcuts—for example, math mode can alternatively be entered by using dollar signs as delimiters.

Figure 4.24: (a) LaTeX source document;

Using LaTeX in a digital library

LaTeX is a popular source format for collections of mathematical and scientific documents. Of course these documents can be converted to PostScript or PDF and handled in this form instead— which allows them to be mixed with documents produced by other means. However, this lowest-common-denominator approach loses structural and metadata information. In the case of LaTeX, such information is signaled by commands for title, abstract, nested section headings, and so on.

If, on the other hand, the source documents are obtained in LaTeX form and parsed to extract structural and metadata information, the digital library collection will be richer and provide its users with more support. It is easy to parse LaTeX to identify plain text, commands and their arguments, and the document’s block structure.

However, there are two problems. The first is that documents no longer occupy a single file—they use external files such as the "packages" mentioned earlier—and even the document content can be split over several files if desired. In practice it can be surprisingly difficult to obtain the exact set of supporting files that were intended to be used with a particular document. Experience with writing LaTeX documents is necessary to understand which files need copying and, in the case of extra packages, where they might be installed.

The second problem is that LaTeX is highly customizable, and different authors adapt standard commands and invent new ones as they see fit. This makes it difficult to know in advance which commands to seek to extract standard metadata. However, new commands in LaTeX are composites of existing ones, and one solution is to expand all commands to use only built-in basic features.