Representing Textual Documents (Digital Library)

Way back in 1963, at the dawn of interactive computing, the American National Standards Institute (ANSI) began work on a character set that would standardize text representation across a range of computing equipment and printers. (At the time, a variety of codes were in use by different computer manufacturers, such as an extension of a binary-coded decimal punched card code to deal with letters—EBCDIC, or Extended Binary Coded Decimal for Information Interchange—and the European Baudot code for teleprinters that accommodated mixed-case text by switching between upper- and lowercase modes.) In 1968, ANSI finally ratified the result, called ASCII: American Standard Code for Information Interchange. Until recently, ASCII dominated text representation in computing.

ASCII

Table 4.1 shows the ASCII character set, with code values in decimal, octal, and hexadecimal. Codes 65-90 (decimal) represent the uppercase letters of the Roman alphabet, while codes 97-122 are lowercase letters. Codes 48-57 give the digits zero through nine. Codes 0-32 and 127 are control characters that have no printed form. Some of these govern physical aspects of the printer—for instance, BEL rings the bell (now downgraded to an electronic beep), BS backspaces the print head (now the cursor position). Others indicate parts of a communication protocol: SOH starts the header, STX starts the transmission. Interspersed between these blocks are sequences of punctuation and other nonletter symbols (codes 33-47, 58-64, 91-96, 123-126). Each code is represented in seven bits, which fits into a computer byte with one bit (the top one) free. In the original vision for ASCII, this was earmarked for a parity check.


ASCII was a great step forward. It helped computers evolve over the following decades from scientific number-crunchers and fixed-format card-image data processors to interactive information appliances that permeate all walks of life. However, it has proved a great source of frustration to speakers of other languages. Many different extensions have been made to the basic character set, using codes 128-255 to specify accented and non-Roman characters for particular languages. ISO 8859-1, from the International Organization for Standardization (the international counterpart of the American standards organization, ANSI), extends ASCII for Western European languages. For example, it represents e as the single decimal value 233 rather than the ASCII sequence "e followed by backspace followed by ‘."

Table 4.1: The ASCII character set

Dec

Oct

Hex

Char

Dec

Oct

Hex

Char

0

000

00

NUL

44

O54

2C

1

001

01

SOH

45

O55

2D

-

2

002

02

STX

46

O56

2E

3

003

03

ETX

47

O57

2F

/

4

004

04

EOT

48

O6O

30

0

5

005

05

ENQ

49

O61

31

1

6

006

06

ACK

5O

O62

32

2

7

007

07

BEL

51

O63

33

3

8

010

08

BS

52

O64

34

4

9

011

09

HT

53

O65

35

5

10

012

OA

LF

54

O66

36

6

11

013

OB

VT

55

O67

37

7

12

014

OC

FF

56

O7O

38

8

13

015

OD

CR

57

O71

39

9

14

016

OE

SO

58

O72

3A

15

017

OF

SI

59

O73

3B

;

16

020

10

DLE

6O

O74

3C

<

17

021

11

DC1

61

O75

3D

=

18

022

12

DC2

62

O76

3E

>

19

023

13

DC3

63

O77

3F

?

20

024

14

DC4

64

100

40

@

21

025

15

NAK

65

1O1

41

A

22

026

16

SYN

66

102

42

B

23

027

17

ETB

67

103

43

C

24

030

18

CAN

68

104

44

D

25

031

19

EM

69

105

45

E

26

032

1A

SUB

7O

106

46

F

27

033

1B

ESC

71

107

47

G

28

034

1C

FS

72

110

48

H

29

035

1D

GS

73

111

49

I

30

036

1E

RS

74

112

4A

j

31

037

1F

US

75

113

4B

K

32

040

2O

SPAC

76

114

4C

L

33

041

21

!

77

115

4D

M

34

042

22

"

78

116

4E

N

35

043

23

#

79

117

4F

O

36

044

24

$

8O

120

50

P

37

045

25

%

81

121

51

Q

38

046

26

&

82

122

52

R

39

047

27

83

123

53

S

40

050

28

(

84

124

54

T

41

051

29

)

85

125

55

U

42

052

2A

*

86

126

56

V

43

053

2B

+

87

127

57

W

Table 4.1: The ASCII character set—cont’d

Dec

Oct

Hex

Char

Dec

Oct

Hex

Char

88

130

58

X

108

154

6C

l

89

131

59

Y

109

155

6D

m

90

132

5A

Z

110

156

6E

n

91

133

5B

[

111

157

6F

o

92

134

5C

\

112

160

70

p

93

135

5D

]

113

161

71

q

94

136

5E

A

114

162

72

r

95

137

5F

115

163

73

s

96

140

60

116

164

74

t

97

141

61

a

117

165

75

u

98

142

62

b

118

166

76

v

99

143

63

c

119

167

77

w

100

144

64

d

120

170

78

x

101

145

65

e

121

171

79

y

102

146

66

f

122

172

7A

z

103

147

67

g

123

173

7B

{

104

150

68

h

124

174

7C

|

105

151

69

i

125

175

7D

}

106

152

6A

j

126

176

7E

~

107

153

6B

k

127

177

7F

DEL

The latter is alien to the French way of thinking, for e is really a single character, generated by a single keystroke on French keyboards. For non-European languages like Hebrew and Chinese, ASCII is irrelevant. For them, other schemes have arisen: for example, GB and Big-5 are competing standards for Chinese; the former is used in the People’s Republic of China and the latter in Taiwan and Hong Kong.

As the Internet exploded into the World Wide Web and burst into all countries and all corners of our lives, the situation became untenable. The world needed a new way of representing text.

Unicode

In 1988, Apple and Xerox began work on Unicode, a successor to ASCII that aimed to represent all the characters used in all the world’s languages. As word spread, the Unicode Consortium, a group of international and multinational companies, government organizations, and other interested parties, was formed in 1991. The result was a new standard, ISO-10646, ratified by the International Organization for Standardization in 1993. In fact, the standard melded the Unicode Consortium’s specification with ISO’s own work in this area.

Unicode continues to evolve. The main goal of representing the scripts of languages in use around the world has been achieved. Current work is addressing historic languages, such as Egyptian hieroglyphics and Indo-European languages, and notations like music. A steady stream of additions, clarifications, and amendments eventually lead to new published versions of the standard. Of course, backward compatibility with the existing standard is taken for granted.

A standard is sterile unless it is adopted by vendors and users. Programming languages like Java have built-in Unicode support. Earlier ones—C, Perl, Python, to name a few—have standard Unicode libraries. Today’s operating systems all support Unicode, and application programs, including Web browsers, have passed on the benefits to the end user. Unicode is the default encoding for HTML and XML. People of the world, rejoice!

Unicode is universal: any document in an existing character set can be mapped into it. But it also satisfies a stronger requirement: the resulting Unicode file can be mapped back to the original character set without any loss of information. This requirement is called round-trip compatibility with existing coding schemes, and it is central to Unicode’s design. If a letter with an accent is represented as a single character in some existing character set, then an equivalent must also be placed in the Unicode set, even though there might be another way to achieve the same visual effect. Because the ISO 8859-1 character set mentioned above includes e as a single character, it must be represented as a single Unicode character—even though an identical glyph can be generated using a sequence along the lines of "e followed by backspace followed by ‘."

Round-trip compatibility is an attractive way to facilitate integration with existing software and was most likely motivated by the pressing need for a nascent standard to gain wide acceptance. You can safely convert any document to Unicode, knowing that it can always be converted back again if necessary to work with legacy software. This is indeed a useful property. However, multiple representations for the same character can cause complications. These issues are discussed further in topic 8 (Section 8.2), which gives a detailed account of the Unicode standard.

In the meantime, it is enough to know that the most popular method of representing Unicode is called UTF-8. UTF stands for "UCS Transformation Format," which is a nested acronym: UCS is Unicode Character Set—so called because Unicode characters are "transformed" into this encoding format. UTF-8 is a variable-length encoding scheme where the basic entity is a byte. ASCII characters are automatically 1-byte UTF-8 codes; existing ASCII files are valid UTF-8.

Plain text

Unicode provides an all-encompassing form for representing characters, including manipulation, searching, storage, and transmission. Now we turn attention to document representation. The lowest common denominator for documents on computers has traditionally been plain, simple, raw ASCII text. Although there is no formal standard for this, certain conventions have grown up.

A text document comprises a sequence of character values interpreted in ordinary reading order: left to right, top to bottom. There is no header to denote the character set used. While 7-bit ASCII is the baseline, the 8-bit ISO ASCII extensions are often used, particularly for non-English text. This works well when text is processed by just one application program on a single computer, but when transferring between different applications—perhaps through e-mail, news, the Web, or file transfer—the various programs involved may make different assumptions. An alphabet mismatch often means that characters in the range 128-255 are displayed incorrectly.

Plain text documents are formatted in simple ways. Explicit line breaks are usually included. Paragraphs are separated by two consecutive line breaks, or else the first line is indented. Tabs are frequently used for indentation and alignment. A fixed-width font is assumed; tab stops usually occur at every eighth character position. Common typing conventions are adopted to represent characters like dashes (two hyphens in a row). Headings are underlined manually using rows of hyphens or equal signs. Emphasis is often indicated by surrounding text with a single underscore (_like this_), or flanking words with asterisks (*like* *this*).

Different operating systems have adopted conflicting conventions for specifying line breaks. Historically, teletypes were modeled after typewriters. The line-feed character (ASCII 10, LF in Table 4.1) moves the paper up one line but retains the position of the print head. The carriage-return character (ASCII 13, CR in Table 4.1) returns the print head to the left margin but does not move the paper. A new line is constructed by issuing carriage return followed by line feed (logically the reverse order could be used, but the carriage return line feed sequence is conventional, and universally relied upon). Microsoft Windows uses this teletype-oriented interpretation. However, Unix and the Apple Macintosh adopt a different convention: the ASCII line-feed character moves to the next line and returns the print head to the left margin. This difference in interpretation can produce a strange-looking control character at the end of every line: AMor "carriage return." (Observe that CR and M are in the same row of Table 4.1. Control characters in the first column are often made visible by prefixing the corresponding character in the second column with A.) While the meaning of the message is not obscured, the effect is distracting.

People who use the standard Internet file transfer protocol (FTP) sometimes wonder why it has separate ASCII and binary modes. The difference is that in ASCII mode, new lines are correctly translated when copying files between different systems. It would be wrong to apply this transformation to binary files, however. Modern text-handling programs conceal the difference from users by automatically detecting which new-line convention is being used and behaving accordingly. Of course, this can lead to brittleness: if assumptions break down, all line breaks are messed up and users are mystified.

Plain text is a simple, straightforward, but impoverished representation of documents in a digital library. Metadata cannot be included explicitly (except, possibly, as part of the file name). However, automatic processing is sometimes used to extract title, author, date, and so on. Extraction methods rely on informal document structuring conventions. The more consistent the structure, the easier extraction becomes. Conversely, the simpler the extraction technique, the more seriously things break down when formatting quirks are encountered. Unfortunately you cannot normally expect complete consistency and accuracy in large plain text document collections.

Indexing

Rapid searching is a core function of digital libraries that distinguishes them from physical libraries. The ability to search full text adds great value when large document collections are used for study or reference. Search can be used for particular words, sets of words, or sequences of words. Obviously, it is less important in recreational reading, which normally takes place sequentially, in one pass.

Before computers, full-text searching was confined to highly valued—often sacred—works for which a concordance had already been prepared. For example, some 300,000 word appearances are indexed in Cruden’s concordance of the Bible, printed on 774 pages. They are arranged alphabetically, from Aaron to Zuzims, and any particular word can be located quickly using binary search. Each probe into the index halves the number of potential locations for the target, and the correct page for an entry can be located by looking at no more than 10 pages—fewer if the searcher interpolates the position of an entry from the position of its initial letter in the alphabet. A term can usually be located in a few seconds, which is not bad considering that simple manual technology is being employed. Once an entry has been located, the concordance gives a list of references that the searcher can follow up. Figure 4.1 shows some of Cruden’s concordance entries for the word search.

In digital libraries searching is done by a computer, rather than by a person, but essentially the same techniques are used. The difference is that things happen faster. Usually it is possible to keep a list of terms in the computer’s main memory, and the list can be searched in a matter of microseconds. When the computer’s equivalent of the concordance entry becomes too large to store in main memory, secondary storage (usually a disk) is accessed to obtain the list of references, which typically takes a few milliseconds.

Entries for the word search in a biblical concordance.

Figure 4.1: Entries for the word search in a biblical concordance.

A full-text index to a document collection gives, for each word, the position of every occurrence of that word in the collection’s text. A moment’s reflection shows that the size of the index is commensurate with the size of the text, because an occurrence position is likely to occupy roughly the same number of bytes as a word in the text. (Four-byte integers, which are convenient in practice, are able to specify word positions in a 4-billion-word corpus. Conversely, an average English word has five or six characters and so also occupies a few bytes, the exact number depending on how it is stored and whether it is compressed.) We have implicitly assumed a word-level index, where occurrence positions give actual word locations in the collection. Space will be saved if locations are recorded within a unit, such as a paragraph, topic, or document, yielding a coarser index—partly because pointers can be smaller, but chiefly because when a particular word occurs several times in the same unit, only one pointer is required for that unit.

A comprehensive index, capable of rapidly accessing all documents that satisfy a particular query, is a large data structure. Size, as well as being a drawback in its own right, also affects retrieval time, for the computer must read and interpret appropriate parts of the index to locate the desired information. Fortunately, there are interesting data structures and algorithms that can be applied to solve these problems. They are beyond the scope of this book, but references can be found in the "Notes and sources" section at the end of the topic.

The basic function of a full-text index is to provide, for any particular query term, a list of all the units that contain it, along with (for reasons to be explained shortly) the number of times it occurs in each unit on the list. It’s simplest to think of the "units" as being documents, although the granularity of the index may use paragraphs or topics as the units instead—or even individual words, in which case what is returned is a list of the word numbers corresponding to the query term. And it’s simplest to think of the query term as a word, although if stemming or case-folding is in effect, the term may correspond to several different words. For example, with stemming, the term computer may correspond to the words computer, computers, computation, compute, and so on; and with case-folding it may correspond to computer, Computer, COMPUTER, and even Computer (an unusual word, but not completely unknown—for example, it appears in this book!).

When one indexes a large text, it rapidly becomes clear that just a few common words—such as of, the, and and—account for a large number of the entries in the index. People have argued that these words should be omitted, since they take up so much space and are not likely to be needed in queries, and for this reason they are often called stop words. However, some index compilers and users have observed that it is better to leave stop words in. Although a few dozen stop words may account for around 30 percent of the references that an index contains, it is possible to represent them in a way that consumes relatively little space.

A query to a full-text retrieval system usually contains several words. How they are interpreted depends on the type of query. Two common types, both explained in Section 3.4, are Boolean queries and ranked queries. In either case, the process of responding to the query involves looking up, for each term, the list of documents it appears in, and performing logical operations on these lists. In the case of Boolean queries, the result is a list of documents that satisfy the query, and this list (or the first part of it) is displayed to the user.

In the case of ranked queries, the final list of documents is sorted according to the ranking heuristic that is in place. As Section 3.4 explains, these heuristics gauge the similarity of each document to the set of terms that constitute the query. For each term, they weigh the frequency with which it appears in the document being considered (the more it is mentioned, the greater the similarity) against its frequency in the document collection as a whole (common terms are less significant). This is why the index stores the number of times each word appears in each document. A great many documents— perhaps all documents in the collection—may satisfy a particular ranked query (if the query contains the word the, all English documents would probably qualify). Retrieval systems take great pains to work efficiently even on such queries; for example, they use techniques that avoid the need to sort the list fully in order to find the top few elements.

In effect, the indexing process treats each document (or whatever the unit of granularity is) as a "bag of words." What matters is the words that appear in the document and (for ranked queries) the frequency with which they appear. The query is also treated as a bag of words. This representation provides the foundation for full-text indexing. Whenever documents are presented in forms other than word-delineated plain text, they must be reduced to this form so that the corresponding bag of words can be determined.

Word segmentation

Before an index is created, the text must first be divided into words. A word is a sequence of alphanumeric characters surrounded by white space or punctuation. Usually some large limit is placed on the length of words—perhaps 16 characters, or 256 characters. Another practical rule of thumb is to limit numbers that appear in the text to a far smaller size—perhaps four numeric characters, so that only numbers less than 9,999 are indexed. Without this restriction, the size of the vocabulary might be artificially inflated—for example, a long document with numbered paragraphs could contain hundreds of thousands of different integers—which negatively affects certain technical aspects of the indexing procedure. Years, however, which have four digits, should be preserved as single words.

In some languages, plain text presents special problems. Languages like Chinese and Japanese are written without any spaces or other word delimiters (except for punctuation marks). This causes obvious problems in full-text indexing: to get a bag of words, we must be able to identify the words. One possibility is to treat each character as an individual word. However, this produces poor retrieval results. An extreme analogy is an English-language retrieval system that, instead of finding all documents containing the words digital library, found all documents containing the constituent letters a b dg i l r ty. Of course the searcher will receive all sought-after documents, but they are diluted with countless others. And ranking would be based on letter frequencies, not word frequencies. The situation in Chinese is not so bad, for individual characters are far more numerous, and more meaningful, than individual letters in English. But they are less meaningful than words. Topic 8 discusses the problem of segmenting such languages into words.

Next post:

Previous post: