In Depth Tutorials and Information

Representing Textual Documents (Digital Library)

Way back in 1963, at the dawn of interactive computing, the American National Standards Institute (ANSI) began work on a character set that would standardize text representation across a range of computing equipment and printers. (At the time, a variety of codes were in use by different computer manufacturers, such as an extension of a binary-coded decimal punched card code to deal with letters—EBCDIC, or Extended Binary Coded Decimal for Information Interchange—and the European Baudot code for teleprinters that accommodated mixed-case text by switching between upper- and lowercase modes.) In 1968, ANSI finally ratified the result, called ASCII: American Standard Code for Information Interchange. Until recently, ASCII dominated text representation in computing.

ASCII

Table 4.1 shows the ASCII character set, with code values in decimal, octal, and hexadecimal. Codes 65-90 (decimal) represent the uppercase letters of the Roman alphabet, while codes 97-122 are lowercase letters. Codes 48-57 give the digits zero through nine. Codes 0-32 and 127 are control characters that have no printed form. Some of these govern physical aspects of the printer—for instance, BEL rings the bell (now downgraded to an electronic beep), BS backspaces the print head (now the cursor position). Others indicate parts of a communication protocol: SOH starts the header, STX starts the transmission. Interspersed between these blocks are sequences of punctuation and other nonletter symbols (codes 33-47, 58-64, 91-96, 123-126). Each code is represented in seven bits, which fits into a computer byte with one bit (the top one) free. In the original vision for ASCII, this was earmarked for a parity check.

ASCII was a great step forward. It helped computers evolve over the following decades from scientific number-crunchers and fixed-format card-image data processors to interactive information appliances that permeate all walks of life. However, it has proved a great source of frustration to speakers of other languages. Many different extensions have been made to the basic character set, using codes 128-255 to specify accented and non-Roman characters for particular languages. ISO 8859-1, from the International Organization for Standardization (the international counterpart of the American standards organization, ANSI), extends ASCII for Western European languages. For example, it represents e as the single decimal value 233 rather than the ASCII sequence "e followed by backspace followed by ‘."

Table 4.1: The ASCII character set

Dec	Oct	Hex	Char	Dec	Oct	Hex	Char
0	000	00	NUL	44	O54	2C
1	001	01	SOH	45	O55	2D	-
2	002	02	STX	46	O56	2E
3	003	03	ETX	47	O57	2F	/
4	004	04	EOT	48	O6O	30	0
5	005	05	ENQ	49	O61	31	1
6	006	06	ACK	5O	O62	32	2
7	007	07	BEL	51	O63	33	3
8	010	08	BS	52	O64	34	4
9	011	09	HT	53	O65	35	5
10	012	OA	LF	54	O66	36	6
11	013	OB	VT	55	O67	37	7
12	014	OC	FF	56	O7O	38	8
13	015	OD	CR	57	O71	39	9
14	016	OE	SO	58	O72	3A
15	017	OF	SI	59	O73	3B	;
16	020	10	DLE	6O	O74	3C	<
17	021	11	DC1	61	O75	3D	=
18	022	12	DC2	62	O76	3E	>
19	023	13	DC3	63	O77	3F	?
20	024	14	DC4	64	100	40	@
21	025	15	NAK	65	1O1	41	A
22	026	16	SYN	66	102	42	B
23	027	17	ETB	67	103	43	C
24	030	18	CAN	68	104	44	D
25	031	19	EM	69	105	45	E
26	032	1A	SUB	7O	106	46	F
27	033	1B	ESC	71	107	47	G
28	034	1C	FS	72	110	48	H
29	035	1D	GS	73	111	49	I
30	036	1E	RS	74	112	4A	j
31	037	1F	US	75	113	4B	K
32	040	2O	SPAC	76	114	4C	L
33	041	21	!	77	115	4D	M
34	042	22	"	78	116	4E	N
35	043	23	#	79	117	4F	O
36	044	24	$	8O	120	50	P
37	045	25	%	81	121	51	^Q
38	046	26	&	82	122	52	R
39	047	27	‘	83	123	53	S
40	050	28	(	84	124	54	T
41	051	29	)	85	125	55	U
42	052	2A	*	86	126	56	V
43	053	2B	+	87	127	57	W

Table 4.1: The ASCII character set—cont’d

Dec	Oct	Hex	Char	Dec	Oct	Hex	Char
88	130	58	X	108	154	6C	l
89	131	59	Y	109	155	6D	m
90	132	5A	Z	110	156	6E	n
91	133	5B	[	111	157	6F	o
92	134	5C	\	112	160	70	p
93	135	5D	]	113	161	71	q
94	136	5E	A	114	162	72	r
95	137	5F		115	163	73	s
96	140	60		116	164	74	t
97	141	61	a	117	165	75	u
98	142	62	b	118	166	76	v
99	143	63	c	119	167	77	w
100	144	64	d	120	170	78	x
101	145	65	e	121	171	79	y
102	146	66	f	122	172	7A	z
103	147	67	g	123	173	7B	{
104	150	68	h	124	174	7C	\|
105	151	69	i	125	175	7D	}
106	152	6A	^j	126	176	7E	~
107	153	6B	k	127	177	7F	DEL

The latter is alien to the French way of thinking, for e is really a single character, generated by a single keystroke on French keyboards. For non-European languages like Hebrew and Chinese, ASCII is irrelevant. For them, other schemes have arisen: for example, GB and Big-5 are competing standards for Chinese; the former is used in the People’s Republic of China and the latter in Taiwan and Hong Kong.

As the Internet exploded into the World Wide Web and burst into all countries and all corners of our lives, the situation became untenable. The world needed a new way of representing text.

Unicode

In 1988, Apple and Xerox began work on Unicode, a successor to ASCII that aimed to represent all the characters used in all the world’s languages. As word spread, the Unicode Consortium, a group of international and multinational companies, government organizations, and other interested parties, was formed in 1991. The result was a new standard, ISO-10646, ratified by the International Organization for Standardization in 1993. In fact, the standard melded the Unicode Consortium’s specification with ISO’s own work in this area.

Unicode continues to evolve. The main goal of representing the scripts of languages in use around the world has been achieved. Current work is addressing historic languages, such as Egyptian hieroglyphics and Indo-European languages, and notations like music. A steady stream of additions, clarifications, and amendments eventually lead to new published versions of the standard. Of course, backward compatibility with the existing standard is taken for granted.

A standard is sterile unless it is adopted by vendors and users. Programming languages like Java have built-in Unicode support. Earlier ones—C, Perl, Python, to name a few—have standard Unicode libraries. Today’s operating systems all support Unicode, and application programs, including Web browsers, have passed on the benefits to the end user. Unicode is the default encoding for HTML and XML. People of the world, rejoice!

Unicode is universal: any document in an existing character set can be mapped into it. But it also satisfies a stronger requirement: the resulting Unicode file can be mapped back to the original character set without any loss of information. This requirement is called round-trip compatibility with existing coding schemes, and it is central to Unicode’s design. If a letter with an accent is represented as a single character in some existing character set, then an equivalent must also be placed in the Unicode set, even though there might be another way to achieve the same visual effect. Because the ISO 8859-1 character set mentioned above includes e as a single character, it must be represented as a single Unicode character—even though an identical glyph can be generated using a sequence along the lines of "e followed by backspace followed by ‘."

Round-trip compatibility is an attractive way to facilitate integration with existing software and was most likely motivated by the pressing need for a nascent standard to gain wide acceptance. You can safely convert any document to Unicode, knowing that it can always be converted back again if necessary to work with legacy software. This is indeed a useful property. However, multiple representations for the same character can cause complications. These issues are discussed further in topic 8 (Section 8.2), which gives a detailed account of the Unicode standard.

In the meantime, it is enough to know that the most popular method of representing Unicode is called UTF-8. UTF stands for "UCS Transformation Format," which is a nested acronym: UCS is Unicode Character Set—so called because Unicode characters are "transformed" into this encoding format. UTF-8 is a variable-length encoding scheme where the basic entity is a byte. ASCII characters are automatically 1-byte UTF-8 codes; existing ASCII files are valid UTF-8.

Plain text

Unicode provides an all-encompassing form for representing characters, including manipulation, searching, storage, and transmission. Now we turn attention to document representation. The lowest common denominator for documents on computers has traditionally been plain, simple, raw ASCII text. Although there is no formal standard for this, certain conventions have grown up.

A text document comprises a sequence of character values interpreted in ordinary reading order: left to right, top to bottom. There is no header to denote the character set used. While 7-bit ASCII is the baseline, the 8-bit ISO ASCII extensions are often used, particularly for non-English text. This works well when text is processed by just one application program on a single computer, but when transferring between different applications—perhaps through e-mail, news, the Web, or file transfer—the various programs involved may make different assumptions. An alphabet mismatch often means that characters in the range 128-255 are displayed incorrectly.

Plain text documents are formatted in simple ways. Explicit line breaks are usually included. Paragraphs are separated by two consecutive line breaks, or else the first line is indented. Tabs are frequently used for indentation and alignment. A fixed-width font is assumed; tab stops usually occur at every eighth character position. Common typing conventions are adopted to represent characters like dashes (two hyphens in a row). Headings are underlined manually using rows of hyphens or equal signs. Emphasis is often indicated by surrounding text with a single underscore (_like this_), or flanking words with asterisks (*like* *this*).

Different operating systems have adopted conflicting conventions for specifying line breaks. Historically, teletypes were modeled after typewriters. The line-feed character (ASCII 10, LF in Table 4.1) moves the paper up one line but retains the position of the print head. The carriage-return character (ASCII 13, CR in Table 4.1) returns the print head to the left margin but does not move the paper. A new line is constructed by issuing carriage return followed by line feed (logically the reverse order could be used, but the carriage return line feed sequence is conventional, and universally relied upon). Microsoft Windows uses this teletype-oriented interpretation. However, Unix and the Apple Macintosh adopt a different convention: the ASCII line-feed character moves to the next line and returns the print head to the left margin. This difference in interpretation can produce a strange-looking control character at the end of every line: AMor "carriage return." (Observe that CR and M are in the same row of Table 4.1. Control characters in the first column are often made visible by prefixing the corresponding character in the second column with A.) While the meaning of the message is not obscured, the effect is distracting.

People who use the standard Internet file transfer protocol (FTP) sometimes wonder why it has separate ASCII and binary modes. The difference is that in ASCII mode, new lines are correctly translated when copying files between different systems. It would be wrong to apply this transformation to binary files, however. Modern text-handling programs conceal the difference from users by automatically detecting which new-line convention is being used and behaving accordingly. Of course, this can lead to brittleness: if assumptions break down, all line breaks are messed up and users are mystified.

Plain text is a simple, straightforward, but impoverished representation of documents in a digital library. Metadata cannot be included explicitly (except, possibly, as part of the file name). However, automatic processing is sometimes used to extract title, author, date, and so on. Extraction methods rely on informal document structuring conventions. The more consistent the structure, the easier extraction becomes. Conversely, the simpler the extraction technique, the more seriously things break down when formatting quirks are encountered. Unfortunately you cannot normally expect complete consistency and accuracy in large plain text document collections.

Indexing

Rapid searching is a core function of digital libraries that distinguishes them from physical libraries. The ability to search full text adds great value when large document collections are used for study or reference. Search can be used for particular words, sets of words, or sequences of words. Obviously, it is less important in recreational reading, which normally takes place sequentially, in one pass.

Before computers, full-text searching was confined to highly valued—often sacred—works for which a concordance had already been prepared. For example, some 300,000 word appearances are indexed in Cruden’s concordance of the Bible, printed on 774 pages. They are arranged alphabetically, from Aaron to Zuzims, and any particular word can be located quickly using binary search. Each probe into the index halves the number of potential locations for the target, and the correct page for an entry can be located by looking at no more than 10 pages—fewer if the searcher interpolates the position of an entry from the position of its initial letter in the alphabet. A term can usually be located in a few seconds, which is not bad considering that simple manual technology is being employed. Once an entry has been located, the concordance gives a list of references that the searcher can follow up. Figure 4.1 shows some of Cruden’s concordance entries for the word search.

In digital libraries searching is done by a computer, rather than by a person, but essentially the same techniques are used. The difference is that things happen faster. Usually it is possible to keep a list of terms in the computer’s main memory, and the list can be searched in a matter of microseconds. When the computer’s equivalent of the concordance entry becomes too large to store in main memory, secondary storage (usually a disk) is accessed to obtain the list of references, which typically takes a few milliseconds.

Figure 4.1: Entries for the word search in a biblical concordance.

A full-text index to a document collection gives, for each word, the position of every occurrence of that word in the collection’s text. A moment’s reflection shows that the size of the index is commensurate with the size of the text, because an occurrence position is likely to occupy roughly the same number of bytes as a word in the text. (Four-byte integers, which are convenient in practice, are able to specify word positions in a 4-billion-word corpus. Conversely, an average English word has five or six characters and so also occupies a few bytes, the exact number depending on how it is stored and whether it is compressed.) We have implicitly assumed a word-level index, where occurrence positions give actual word locations in the collection. Space will be saved if locations are recorded within a unit, such as a paragraph, topic, or document, yielding a coarser index—partly because pointers can be smaller, but chiefly because when a particular word occurs several times in the same unit, only one pointer is required for that unit.

A comprehensive index, capable of rapidly accessing all documents that satisfy a particular query, is a large data structure. Size, as well as being a drawback in its own right, also affects retrieval time, for the computer must read and interpret appropriate parts of the index to locate the desired information. Fortunately, there are interesting data structures and algorithms that can be applied to solve these problems. They are beyond the scope of this book, but references can be found in the "Notes and sources" section at the end of the topic.

The basic function of a full-text index is to provide, for any particular query term, a list of all the units that contain it, along with (for reasons to be explained shortly) the number of times it occurs in each unit on the list. It’s simplest to think of the "units" as being documents, although the granularity of the index may use paragraphs or topics as the units instead—or even individual words, in which case what is returned is a list of the word numbers corresponding to the query term. And it’s simplest to think of the query term as a word, although if stemming or case-folding is in effect, the term may correspond to several different words. For example, with stemming, the term computer may correspond to the words computer, computers, computation, compute, and so on; and with case-folding it may correspond to computer, Computer, COMPUTER, and even Computer (an unusual word, but not completely unknown—for example, it appears in this book!).

When one indexes a large text, it rapidly becomes clear that just a few common words—such as of, the, and and—account for a large number of the entries in the index. People have argued that these words should be omitted, since they take up so much space and are not likely to be needed in queries, and for this reason they are often called stop words. However, some index compilers and users have observed that it is better to leave stop words in. Although a few dozen stop words may account for around 30 percent of the references that an index contains, it is possible to represent them in a way that consumes relatively little space.

A query to a full-text retrieval system usually contains several words. How they are interpreted depends on the type of query. Two common types, both explained in Section 3.4, are Boolean queries and ranked queries. In either case, the process of responding to the query involves looking up, for each term, the list of documents it appears in, and performing logical operations on these lists. In the case of Boolean queries, the result is a list of documents that satisfy the query, and this list (or the first part of it) is displayed to the user.

In the case of ranked queries, the final list of documents is sorted according to the ranking heuristic that is in place. As Section 3.4 explains, these heuristics gauge the similarity of each document to the set of terms that constitute the query. For each term, they weigh the frequency with which it appears in the document being considered (the more it is mentioned, the greater the similarity) against its frequency in the document collection as a whole (common terms are less significant). This is why the index stores the number of times each word appears in each document. A great many documents— perhaps all documents in the collection—may satisfy a particular ranked query (if the query contains the word the, all English documents would probably qualify). Retrieval systems take great pains to work efficiently even on such queries; for example, they use techniques that avoid the need to sort the list fully in order to find the top few elements.

In effect, the indexing process treats each document (or whatever the unit of granularity is) as a "bag of words." What matters is the words that appear in the document and (for ranked queries) the frequency with which they appear. The query is also treated as a bag of words. This representation provides the foundation for full-text indexing. Whenever documents are presented in forms other than word-delineated plain text, they must be reduced to this form so that the corresponding bag of words can be determined.

Word segmentation

Before an index is created, the text must first be divided into words. A word is a sequence of alphanumeric characters surrounded by white space or punctuation. Usually some large limit is placed on the length of words—perhaps 16 characters, or 256 characters. Another practical rule of thumb is to limit numbers that appear in the text to a far smaller size—perhaps four numeric characters, so that only numbers less than 9,999 are indexed. Without this restriction, the size of the vocabulary might be artificially inflated—for example, a long document with numbered paragraphs could contain hundreds of thousands of different integers—which negatively affects certain technical aspects of the indexing procedure. Years, however, which have four digits, should be preserved as single words.

In some languages, plain text presents special problems. Languages like Chinese and Japanese are written without any spaces or other word delimiters (except for punctuation marks). This causes obvious problems in full-text indexing: to get a bag of words, we must be able to identify the words. One possibility is to treat each character as an individual word. However, this produces poor retrieval results. An extreme analogy is an English-language retrieval system that, instead of finding all documents containing the words digital library, found all documents containing the constituent letters a b dg i l r ty. Of course the searcher will receive all sought-after documents, but they are diluted with countless others. And ranking would be based on letter frequencies, not word frequencies. The situation in Chinese is not so bad, for individual characters are far more numerous, and more meaningful, than individual letters in English. But they are less meaningful than words. Topic 8 discusses the problem of segmenting such languages into words.

Next post: Textual Images (Digital Library) Part 1

Previous post: Textual documents: The raw material (Digital Library)