Web Documents: HTML and XML (Digital Library)

HTML, the Hypertext Markup Language, is the underlying document format of the World Wide Web, which makes it an important baseline for interactive viewing. Like all major, long-standing document formats, HTML has undergone growing pains, and its history reflects the anarchy that has characterized the Web’s evolution. Since HTML’s conception in 1989, its development has been largely driven by software vendors who compete for the Web browser market by inventing new features to make their product distinctive—the so called "browser wars."

Many of the introduced features played on people’s desires to exert more control over how their documents appear. Who gets to control font attributes like typeface and size—writer or reader? If you think this is a trivial issue, imagine what it means for the visually disabled. Allowing authors to dictate details of how their documents appear conflicts sharply with the original vision for HTML, which divorced document structure from presentation and left decisions about rendering documents to the browser itself. It makes the pages less predictable because viewing platforms differ in the support they provide. For example, HTML text can be marked up as "emphasized," and while it is common practice to render such items in italics, there is no requirement to follow this convention: boldface would convey the same intention.

Out of the maelstrom, a series of HTML standards has emerged, although HTML’s evolution continues—indeed, a major new revision, HTML 5, is under development although the warring parties (with vested interests) still clash: on this occasion a key controversy centers on whether or not native support for patent-free audio and video formats should be included in the standard.


The birth of HTML did not occur in a vacuum. Dipping further back in time, during the 1970s and 1980s, a generalized system for structural markup was developed called the Standard Generalized Markup Language (SGML); it was ratified as an ISO international standard in 1986. SGML is not a markup language but a metalanguage for describing markup formats. The original HTML format was expressed using SGML, and large organizations like government offices and the military also made significant use of it; however, SGML is rather intricate, and it has proven difficult to develop flexible software tools for the fully blown standard. This fact was the catalyst for the extensible markup language, XML.

XML is a simplified version of SGML designed specifically for interoperability over the Web. Informally speaking, it is a dialect of SGML (whereas HTML is an example of a markup language that SGML can describe). XML provides a flexible way of characterizing document structure and metadata, making it well suited to digital libraries. It has achieved widespread use in a very short stretch of time.

XML has strict syntactic rules that prevent it from describing ancient forms of HTML exactly. The differences expose parts of the early specifications that were loosely formed—ones that cause difficulty when parsing and processing documents. However, with a little trickery—for example, judicious placement of white space—it is possible to generate an XML specification of an extremely close approximation to HTML. Put another way, you can take advantage of HTML’s sloppy specification to produce files that are valid XML. Such files have twin virtues: they can be viewed in any Web browser, and they can be parsed and processed by XML tools. The idea is formalized in HTML 5, which includes parallel specifications: one for "classic" HTML and another, XHTML, which is XML-compliant.

We start this section by describing the development of markup languages and their relation to stylesheet languages. Then we describe the basics of HTML and explain how it can be used in a digital library. Following that, we describe the XML metalanguage and again discuss its role in digital libraries. Section 4.4 covers stylesheet languages for both HTML and XML.

Markup and stylesheet languages

Web culture has advanced at an extraordinary pace, creating a melee of incremental—and at times conflicting—additions and revisions to HTML, XML, and related standards. Figure 4.5 summarizes the main developments.

Although it has been retrospectively fitted with XML descriptions, HTML was created before XML was conceived and drew on the more general expressive capabilities of SGML. It was also forged in the heat of the browser wars, in which Web browsers sprouted a proliferation of innovative nonstan-dard features that vendors thought would make their products more appealing. As a result, browsers became forgiving: they process files that flagrantly violate SGML syntax. One example is tag scope overlap—writing <i>one <b>two </i>three </b> to produce one two three—despite SGML’s requirement that tags be strictly nested. During subsequent attempts at standardization, more tags were added that control typeface and layout, features deliberately excluded from HTML’s original design.

the relationship among XML, SGML, and Html

Figure 4.5: the relationship among XML, SGML, and Html

The notion of style sheets was introduced to resolve the conflict between presentation and structure by moving formatting and layout specifications to a separate file. Style sheets purify the HTML markup to reflect, once again, nothing but document structure. Different documents can share a uniform appearance by adopting the same style sheet. Equally, different style sheets can be associated with the same document, for instance defining one for on-screen viewing and another for printing Style sheets can specify a sequence—a cascade—of inherited stylistic properties and are dubbed cascading style sheets.

The first specification of cascading style sheets was in 1996, and this was quickly followed by an expanded backward-compatible version two years later. Style sheets can be adapted to different media by including formatting commands that are grouped together and associated with a given medium—screen, print, projector, handheld device, and so on. Guided by the user (or otherwise), applications that process the document use the relevant set of style commands. A Web browser might choose screen for online display but switch to print when rendering the document in PostScript.

Modern versions of HTML promote the use of style sheets. Moreover, they encourage them by officially deprecating formatting tags and other elements that affect presentation rather than structure. This is accomplished through three subcategories to the standard. Strict HTML expresses layout exclusively through style sheets: frameset commands and all deprecated tags and elements listed in the standard are excluded. In transitional HTML, style sheets are the principal way of specifying layout, but to provide compatibility with older browsers deprecated commands are permitted, although framesets are prohibited. Frameset HTML permits frameset commands and deprecated elements. Modern HTML files declare their subcategory at the start of the document. The format also adds improved support for multidirectional text (not just left to right) and enhancements for improved access by people with disabilities.

An HTML subset called XHTML has been defined that obeys the stricter syntactic rules imposed by the XML metalanguage. For instance, tags in XML are case sensitive, so XHTML tags and attributes are defined to be lowercase. Attributes within a tag must be enclosed in quotes. Each opening tag must be balanced by a corresponding closing one (there are also single tags that combine opening and closing, with their own special syntax).

The power and flexibility of XML are further increased by related standards. Three are given in Figure 4.5 (there are many others). The extensible stylesheet language XSL described in Section 4.4 represents a more sophisticated approach than cascading style sheets: it can also transform data. The XML linking language XLink provides a more powerful method for connecting resources than HTML hyperlinks: it has bidirectional links, can link more than two entities, and associates metadata with links. Finally, XML Schema provides a rich mechanism for combining components and controlling the overall structure, attributes, and data types used in a document.

From a technical standpoint, it is easier to work with XML and its siblings than HTML because they conform to a strictly defined syntax and are therefore easier to parse. In reality, however, digital libraries have to handle legacy material gracefully. Today’s browsers cope remarkably well with the wide range of HTML files: they take backward compatibility to truly impressive levels. To help promote standardization, an open source software utility called HTML Tidy has been developed that converts older formats. The process is largely automatic, but human intervention may be required if files deviate radically from recognized norms.

Basic HTML

Modern markup languages use words enclosed in angle brackets as tags to annotate text. For example, <title>A really exciting story</title> defines the title element of an HTML document. In HTML, tag names are case insensitive—<Title> is the same as <title>. For each tag, the language defines a "closing" version, which gives the tag name preceded by a slash character (/). However, closing tags can be omitted in certain situations—a practice that some decry as impure while others endorse as legitimate shorthand. For example, <p> is used to mark up paragraphs, and subsequent <p>s are assumed to automatically end the previous paragraph—no intervening </p> is necessary. The shortcut is possible because nesting a paragraph within a paragraph—the only other plausible interpretation on encountering the second <p>—is invalid in HTML.

Opening tags can include a list of qualifiers known as attributes. These have the form name="value". For example, <img src="gsdl.gif" width="537" height="17"> specifies an image with source file name gsdl.gif and dimensions 537 x 17 pixels.

Because the language uses characters such as <, >, and " as special markers, a way is needed to display these characters literally. In HTML these characters are represented as special forms called entities and given names like &lt; for "less than" (<) and &gt; for "greater than" (>). This convention makes ampersand (&) into a special character, which is displayed by &amp; when it appears literally in documents. The semicolon needs no such treatment because its literal use and its use as a terminator are syntactically distinct. The same kind of special form can be used to specify Unicode characters beyond the ASCII range, such as &egrave; for e.

Figure 4.6 shows a sample page that illustrates several parts of HTML, along with a snapshot of how it is rendered by a Web browser. It contains "typical" code that you find on the Web, rather than exemplary HTML. Some attributes miss out double quotes (such as align and valign used in some of the table elements), and not all the elements stipulate all the attributes they should (e.g., two of the <img> tags are missing their alt attribute through which an alternative text description is given). The example would not even pass the test for transitional HTML, let alone strict HTML; however, as Figure 4.6b shows, the Web browser renders it just fine.

HTML documents are divided into a header and a body. The header gives global information: the title of the document, the character encoding scheme, any metadata. The <meta> tag is used in Figure 4.6a to acknowledge the New Zealand Digital Library Project as the document’s creator. Creator imitates the Dublin Core metadata element (see Section 6.2) that is used to represent the name of the entity responsible for generating a document, be it a person, organization, or software application; however, there is no requirement in HTML to conform to such standards. Following the header is a comment and a command that sets the background to a Polynesian motif.

This particular page is laid out as two tables. The first controls the main layout. The second, nested within it, lays out the poem and the image of a greenstone pendant. The tags <tr> and <td> are used to mark table rows and cells, respectively.

 (a) Sample Html code involving graphics, text, and some special symbols;

Figure 4.6: (a) Sample Html code involving graphics, text, and some special symbols;

The list item <li> near the end illustrates various special characters. Most take the &…; form, but the last two ( ; and #) do not need to be escaped because their normal meaning is syntactically unambiguous. To generate the letter a with a line above (called a macron and used in the Maori language) the appropriate Unicode value is given in decimal (#257), demonstrating one way of specifying non-ASCII characters. The example illustrates several other features, including images specified by the <img> tag, paragraphs beginning with <p>, italicized words given by <t>, and a bulleted list introduced by <ul> (for "unordered list"), along with a <li> tag for each list item (just one in this case).

cont'd: (b) snapshot rendered by a Web browser

Figure 4.6, cont’d: (b) snapshot rendered by a Web browser

Hyperlinks are an important feature of HTML. In the example, the tag pair <a> … </a> near the end defines a link anchor element. The document to link to—in this case, another page on the Web—is specified as an attribute. Hyperlinks can reference PDF documents, audio and video material, and many other formats—such as the Virtual Reality Modeling Language, VRML, which specifies a navigable virtual reality experience. Browsers display the anchor text—the text appearing between the start and end hyperlink tag—differently to emphasize the presence of a link. When the hyperlink is clicked, the browser loads the new document.

HTML was originally encoded in ASCII for transmission over byte-oriented protocols. Other encoding schemes are supported by setting the charset attribute in the header element to the appropriate encoding name. In Figure 4.6a, line 5 sets it explicitly to UTF-8, which, as mentioned in Section 4.1, is a representation scheme for Unicode. In fact UTF-8 (which is backward-compatible with ASCII) is now the default, and the behavior would be the same if the attribute were omitted.

HTML has many more features. For example, locally defined link anchors permit navigation within a single document. Fonts, colors, and page backgrounds can be specified explicitly. Forms can be created that collect data from the user—such as text data, fielded data, and selections from lists of items.

A mechanism called frames allows an HTML document to be tiled into smaller, independent segments, each an HTML page in its own right. A set of frames, called a frameset, can be displayed simultaneously. This is often used to add a navigation bar to every page of a Web site, along the top or down the side of the browser pane. When a link in the navigation bar is clicked, a new page is loaded into the main display frame, and the bar remains in place. Clicking on a link in the main display frame also loads the new page into the main frame.

Frames were introduced by one vendor during the browser wars and were soon supported by other browsers too. However, they have serious drawbacks. For instance, now that a browser can display more than one HTML document at a time, what happens when you create a bookmark? People often click around a site to reach an interesting document, then bookmark it in the usual way—only to find that the bookmark returns not to the intended page but to the point where the site splits into frames instead. This can be very frustrating.

Many of the effects for which frames were invented—such as persistent navigation bars—can also be accomplished by the newer and more principled mechanism of style sheets, avoiding the problems of frames; hence the three demarcated forms of HTML in more recent specifications of the standard: frameset, transitional, and strict, increasing in conformity to XML. We describe style sheets in Section 4.4.

Using HTML in a digital library

As the lingua franca for the Web, HTML underpins virtually all digital library interfaces. Moreover, digital library source documents are often presented in HTML. This eliminates most of the difficulties associated with the plain text representation introduced earlier. For example, the HTML header disambiguates the character set, while the <br> and <p> tags disambiguate line and paragraph breaks.

To extract text from HTML documents for indexing purposes, the obvious strategy of parsing them according to a well-defined grammar quickly runs into difficulty. The permissive nature of Web browsers encourages authors to depart from the defined standard. A better way to identify and remove tags is to write them in the form of "regular expressions" (a scheme described in the next section), which generally achieves greater success for less effort, in this particular circumstance. An alternative is to use the very application that caused the complication in the first place: Web browsers. A plain text browser called lynx provides a fast and reliable method of extracting text from HTML documents—you give it a command-line argument (dump) and a URL, and it dumps out the contents of that URL in the form of plain text.

As the example in Figure 4.6 illustrates, HTML allows metadata to be specified explicitly using <meta> tags. However, this mechanism is rather limited. For one thing, you might hesitate before tampering with source documents by inserting new metadata (perhaps determined separately, perhaps mined from the document content) in this way. When developing a digital library you need to consider whether it is wise (or even ethical, as discussed in Section 1.5) to add new information that cannot be disentangled from that present in the source document. Users might legitimately object if you serve up an altered version in place of the original.

Sample XML document

Figure 4.7: Sample XML document

Basic XML

Figure 4.7 shows a formatted list of information about United Nations agencies encoded in XML. For each agency, the file records its full name, an optional abbreviation, and the URL of a picture of its headquarters. Included with the name is the address of the headquarters, stored as an attribute.

The file contains three broad sections, separated by comments in the form <!– . . . –>. Line 1 is a header: it uses the special notation <? . . . ?> to denote an application-processing instruction. This syntax originates in SGML, which uses it to embed information for specific application programs that process the document. Here it is used to declare the version of XML, the character encoding (UTF-8), and whether or not external files are used. Lines 5 to 19 dictate the syntactic structure in which the remainder of the file is expressed, in the form of a Document Type Definition (DTD). Lines 21 to 44 provide the content of the document.

The style of the content section is reminiscent of HTML. The tag specifications have the same syntactic conventions, and many tags are identical—examples are <Head>, <Title>, and <Body>. However, in lines 27 to 40 the markup creates structures that HTML cannot represent.

Because it is a metalanguage, XML gives document designers a great deal of freedom. In Figure 4.7 the main document structure resembles HTML, but this does not have to be the case. Different element names could be chosen, and different ways could be used to express the information. For example, Figure 4.7 gives the headquarters address as the hq attribute of the <Name> element. Alternatively, a new element could have been defined to contain this information. It could be constrained to appear immediately following the <Name> element, or left optional, or sited anywhere within the <Agency> element.

Structural decisions are recorded in the DTD (lines 5-19). DTD tags use the special syntax <! . . . > and express keywords in block capitals. For example, ELEMENT and ATTLIST are used to define elements and element attributes. Our document designer decided to capitalize the initial letter of all document elements and leave attributes in lowercase. This improves the legibility of Figure 4.7 considerably.

Line 5 starts the DTD, and the square bracket syntax [. . .] indicates that the DTD will appear in-line. (It must, for line 1 declares that the file stands alone.) Alternatively, the DTD could be placed in an external file, referred to by a URL—which is the usual practice.

New elements are introduced in lines 6 to 11 by the keyword ELEMENT, followed by the new tag name and a description of what the element may contain. A leaf is an element that comprises plain text, with no markup. This is accomplished through parsed character data (#PCDATA^, in which special characters may be included. For example, when the <Title> tag defined on line 10 is used, markup characters may appear in the title’s text, encoded in the familiar HTML way—&lt; &amp; and so on. (This convention originated in SGML.)

Lines 6 to 9 describe nonleaf structures. These are defined in a form known as a regular expression. Here a comma signifies an ordered sequence: line 6 declares that the top-level element <NGODoc> contains a <Head> element followed by a <Body> element. A vertical bar (|) represents a choice of one element from a sequence of named elements, and an asterisk (*) indicates zero or more occurrences. Thus <Body> (line 8) is a mixture of parsed character data and <Agency> elements where it is permissible for nothing at all to appear. A plus sign (+) means one or more occurrences, and a question mark (?) signifies either nothing or just one occurrence. Line 9 includes all four symbols |, *, +, and ?: it declares that <Agency> must include a name element, but that <Abbrev> is optional and there can be zero or more occurrences of <Photo> (the example is contrived: there are more concise ways of expressing the same thing). The inner pair of brackets bind these last two tags together, adding the extra stipulation that there must be at least one occurrence of the <Abbrev> and <Photo> specifications.

Attributes also give a set of possible values, but here there is no nesting. Lines 12 and 13 show an example. The attribute is signaled by the keyword ATTLIST, followed by the element to which it applies (Name), the attribute’s name (hq), its type (character data), and any appearance restrictions (this one is optional). Lines 16 to 18 show another example, which introduces two attributes of the element Photo. Line 17 states that the src attribute is required, while line 18 provides a default value (namely "A photo") for the desc attribute.

In addition to &lt; and &amp; XML incorporates definitions for &gt; &apos; and &quot;. These are called entities, and new ones can be added in the DTD using the syntax ENTITY name "value". For instance, although XML does not have a definition for a as HTML does, one can be defined by <!ENTITY agrave "&#224;">, which relies on the Unicode standard for the numeric value. Entities are not restricted to single characters, but can be used for any excerpt of text (even if it is marked up). For example, <!ENTITY howto "How to Build a Digital Library"> is a shorthand way of encoding the title of this topic.

If several elements shared exactly the same attributes, it would be tedious (and error-prone) to repeat the definitions in each element. This can be handled using a special type of entity known as a parameter entity. To illustrate it, Figure 4.8 shows a modified and slightly restructured version of the DTD for the document in Figure 4.7 that defines attributes ident and style under the name sharedattrib (lines 3-5), which is then used to bestow these attributes on the <Title>, <Abbrev>, and <Name> elements (lines 11-14). Parameter entities are signaled using the percent symbol (%) and provide a form of shorthand for use within a DTD.

Declaring the shared attribute style as NMTOKEN (line 4) restricts this attribute’s characters to alphanumeric characters plus period (.), colon (:), hyphen (-), and underscore (_), where the first character must be a letter. Its twin ident is defined as ID (line 5), which is the same as NMTOKEN with the additional constraint that no two such attributes in the document can have the same value. ID therefore provides a mechanism for uniquely identifying elements, which in fact HTML enforces for an attribute with the particular name id. In XML, uniqueness can be bestowed on any attribute, whatever its name—such as ident.

Sample DTD using a parameterized entity

Figure 4.8: Sample DTD using a parameterized entity

DTDs also support enumerated types, although none are present in the example. It can also include lists of tokens separated by white space (NMTOKENS) and attributes that are references to ID attributes (IDREF).

Parsing XML

A document that conforms to XML syntax but does not supply a DTD is said to be well formed. One that conforms to XML syntax and does supply a DTD is said to be valid—provided that the content does indeed abide by the syntactic constraints defined in the DTD. DTDs can be stored externally, replacing the bracketed section in Figure 4.7 lines 5-19 by a URL. This allows documents with the same structure to be shared within or between organizations.

XML allows you to define new languages, and makes it easy to develop parsers for them. Generic parsers are available that are capable of parsing any XML file, and—if a DTD is present—also check that the file is valid. However, merely parsing a document is of limited utility. The result of a parser is just a yes/no indication of whether the document conforms to the general rules of XML (and to the more specific DTD). Far more useful would be a way of specifying what the parser should do with the data it is processing. This is arranged by having it build a parse tree and provide a programming interface—commonly called an API or application program interface—that lets the user traverse the tree and retrieve the data it contains.

The result of parsing any XML file is a root node whose descendants reflect both textual content and nested tags. At each tag’s node are stored the values of the tag’s attributes. There is a cross-platform and cross-language API called the document object model (DOM) that allows you to write programs that access and modify the document’s content, structure, and style.

Using XML in a digital library

XML is a powerful tool. It allows file formats within an organization—or a digital library—to be rationalized and shared. Furthermore, organizations can provide an explanation of the structures used in the form of a published machine-readable document. By formulating appropriate DTDs, for instance, different organizations can develop comprehensive formats for sharing information.

A notable example is the Text Encoding Initiative (TEI), founded in 1987, which developed a set of DTDs for representing scholarly texts in the humanities and social sciences. SGML was the implementation backbone, but the work has since been reconciled with XML. These DTDs are widely used by universities, museums, and commercial organizations to represent museum and archival information, classical and medieval works, dictionaries and lexicographies, religious tracts, legal documents, and many other forms of writing.

Examples are legion. The Oxford Text Archive is a nonprofit group that has provided long-term storage and maintenance of electronic texts for scholars over the last quarter-century. Perseus is a pioneering digital library project, dating from 1985, which focuses upon the ancient Greek world. Der Junge Goethe in Seiner Zeit is a collection of early works—poems, essays, legal writings, and letters—by the great German writer Johann von Goethe (1749-1832). The Japanese Text Initiative is a collaborative project that makes available a steadily increasing set of Japanese literature, accompanied by English, French, and German translations.

Various related standards increase XML’s power and expand its applicability. Used on its own, XML provides a way of expressing a document’s structural information, and/or metadata. Indeed, whether information is metadata or not is really a matter of perspective. Combined with additional standards, XML goes much further: it supports document restructuring, querying, information extraction, and formatting. The next section expands on the formatting standards, which equip XML with display capabilities comparable with HTML. More details appear in the topic, entitled More on markup and XML, at the book’s Web site: www.nzdl.org/howto.

Next post:

Previous post: