Other Documents (Digital Library)

Many other document types might be included in a digital library. Prominent among them are multimedia documents, which are discussed in the next topic, but some other kinds of predominantly textual documents are worth a brief mention.

Spreadsheets and presentation files

We have mentioned spreadsheets and presentation files (such as PowerPoint) when discussing Office Open XML and the Open Document format. Both spreadsheets and presentation files have traditionally been encoded in proprietary binary file formats, like native Word documents. They may be presented to the user either in their native form or as PDF image files (or both). Of course, this presentation loses much information, notably the dynamic functionality of a spreadsheet involving formulas and the dynamic aspects of a presentation.

In order to index such files, the text can be extracted using the same procedure as for native-format text documents: use a Save As option to generate a more text-friendly form, such as ASCII (using the CSV or comma-separated values format) or XML for spreadsheets, and HTML or PDF for presentations. In future, open XML-based formats, such as OOXML or ODF (which, as we discuss above, apply to these files as well as to textual documents), will make life much easier.

E-mail

E-mail documents are also candidates for inclusion in digital libraries. Early in the history of international computer networks, there were multiple e-mail clients having various incompatible formats. Seeking interoperability, the U.S. Department of Defense funded efforts to create standards. The result is a set of international standards for e-mail and e-mail extensions called Multipurpose Internet Mail Extensions (MIME). This is the de facto standard on the Internet.


However, corporate e-mail systems often use their own internal format and communicate with servers using a vendor-specific, proprietary protocol. The servers act as gateways for sending and receiving messages over the Internet, which involves undertaking any necessary reformatting. For mail sent and received within a single company, the entire transaction may take place within the corporate system.

Mail is not usually sent directly to a digital library but is imported from wherever it happens to be stored. This can be the user’s e-mail client, or their server, or both places. There are two common standard formats for mailboxes (Maildir and mbox), but several prominent clients use their own proprietary format and conversion software is needed to transfer mail between them—or to ingest it into digital libraries.

Internet e-mail messages have a header and body, separated by a blank line. The former contains metadata and is structured into fields, such as sender, receiver, date, title, and other information. It also includes the clock time and time zone, which together define the actual time the message was sent. The body contains the message content in the form of unstructured text, and sometimes ends with a signature block.

E-mails often contain attachments, which are files that are sent along with the message, encoded as part of the message to which they are attached. In the Internet MIME format, messages and their attachments are sent as a single multipart message, using an encoding scheme (called "base64") for non-text attachments that represents binary information as printable ASCII.

There are many problems with e-mail as it used today. For example, a standard way to quote text is to start each line with the ">" character, possibly followed by a space. Unfortunately e-mail readers usually wrap lines to fit the screen size, or to some predetermined maximum length. Quoting makes the lines overflow. Figure 4.25 shows a message that has been quoted several times. The mailer has tried to wrap the lines, but it has put only a single quote on the continuation lines, not realizing that the text being wrapped has four levels of quotes. Messages become unintelligible after a few cycles of such mutilation.

Next post:

Previous post: