Searching (Digital Library) Part 1

Electronic document delivery is the primary raison d’etre for most digital libraries. But searching comes a close second—in particular, searching the full text of documents. Conventional library catalog searches are restricted to metadata, but digital libraries have access to the full content of the objects they contain. This is a great advantage.

Figure 3.10: Document snippets used as surrogates

Figure 3.11 shows a request that seeks paragraphs in English containing both the word begin and the word beginning (in order to find the Alice in Wonderland quotation "begin at the beginning"), and the computer’s response: a list of documents that match. These pages provide a plain and simple search mechanism with rudimentary functionality. Digital libraries, particularly ones targeted to the general public, need simple facilities that fulfill what most people want most of the time, with options for more advanced users. To quote a famous saying, "Simple things should be simple, complex things should be possible."

In Figure 3.11 readers can specify the unit to be searched, the language, and the type of search. The selection varies from collection to collection. Searchable units typically include paragraphs, sections, and documents; as well as section titles, document titles, and authors. The first group involves the full text; the second involves metadata.

Full documents, not just the relevant paragraphs, are returned in the list of titles in Figure 3.11b— even though the search was for paragraphs. In this digital library system, documents are defined as what the user finally sees on the screen as the result of a search. If you want paragraphs to be presented individually, you must define paragraphs to be the "documents"—and arrange for them to link to the paragraphs before and after so that users can read the text in sequence. This is not hard to do.

Types of queries

In Figure 3.11 the option all of the words has been chosen; the alternative in this collection is some of the words. Queries comprise a list of terms—words to be sought. In a general Boolean query, terms are combined using the connectives AND, OR, and NOT. AND is the most common connective, and that is how the option all of the words is interpreted in Figure 3.11. A query such as digital AND library might be used to retrieve books on the same subject as this one. Both terms (or equivalent lexical variants) must occur somewhere in every answer. They need not be adjacent, nor in any particular order. Documents containing phrases like library management in the digital age will be returned, as will documents containing the text a software library for digital signal processing—perhaps not quite what is sought, but nonetheless a correct answer to the Boolean query. A document in which the word library appeared in the first paragraph and digital in the last also would be considered a correct result.

Retrieval systems inevitably return many irrelevant answers, which users must filter out manually. Users have to make a difficult choice between casting a broad query, to be sure of retrieving all relevant material (albeit diluted with many irrelevant answers), and a making a narrow query where most retrieved documents are of interest but others slip through the net because the query is too restrictive. A broad search that identifies nearly all the relevant documents is said to have high recall, while one in which nearly all retrieved documents are relevant is said to have high precision. Searchers must choose between high precision and high recall and formulate a query appropriately. In Web searches, for example, precision is generally preferred over recall. There is so much out there that you rarely want to find every relevant document (you probably couldn’t review them all anyway) and you certainly don’t want to have to check through a host of irrelevant ones.

Figure 3.11: Searching for a quotation: (a) query page; (b) query response

However, if you are counsel for the defense looking for precedents for a legal case, you probably care a great deal about recall— you want to be sure that you have examined every relevant precedent, because the last thing you want is for the prosecutor to spring a nasty surprise in court.

An enduring theme in information retrieval is the tension between recall and precision.

Another problem is that small variations can yield quite different results. You might think the query electronic AND document AND collection is similar to digital AND library, but they are likely to produce very different answers. To catch all desired documents, professional librarians become adept at adding extra terms, learning to pose queries such as (digital OR virtual OR electronic) AND (library OR (document AND collection)) where the parentheses indicate operation order.

Until around 1990, Boolean retrieval systems were the primary means of accessing information in commercial and scientific applications. However, Internet search engines have shown that they are not the only way a database can be queried.

Rather than seeking exact Boolean answers, it is often better simply to list words that are of interest and have the retrieval mechanism supply whatever documents seem most relevant. For example, to locate books on digital libraries, the list of terms digital, virtual, electronic, library, document, collection, large-scale, information, retrieval is, to a nonprofessional at least, probably a clearer encapsulation of the topic than the Boolean query cited earlier.

Yet, identifying relevant documents is not just a matter of converting a list of terms to a Boolean query. It would be fruitless to connect these particular terms with AND operators, since vanishingly few documents are likely to match. (We cannot say that no documents will match. This page certainly does.) It would be equally pointless to use OR connectives, since far too many documents will match and few are likely to be useful.

The solution is to use ranking. In ranking, an artificial measure is used to gauge the similarity of each document to the query, and a fixed number of the closest matching documents are returned as answers. If the measure is good, high precision is guaranteed if few documents are returned, and high recall is guaranteed if many are returned. In practice, there is a trade-off: low precision invariably accompanies high recall, since many irrelevant documents come to light before the last of the relevant ones appears in the ranking. Conversely, low recall accompanies high precision, because precision will be high only near the beginning of the ranked list, at which point only a few of the total set of relevant documents will have been encountered.

A simple ranking technique is to count the number of query terms that appear somewhere in the document: this is called coordinate matching. A document that contains five of the query terms will be ranked higher than one containing only three, and documents that match just one query term will be ranked lower still. Obviously, coordinate matching favors long documents over short ones, since by virtue of size alone long documents are more likely to contain a broader selection of the query terms. Furthermore, common terms in the query are treated just the same as highly specific ones—for example, the query the digital library might rank a document containing the digital age alongside or even ahead of one containing a digital library. Words such as the in the query should probably not be given the same importance as library. For this reason, most ranking techniques assign a weight to each term based on its frequency in the document collection, giving common terms low weight. They also compensate for the length of the document, so that long ones are not automatically favored.

It is difficult to describe ranking mechanisms in a few words. The simple interface in Figure 3.11 mentions only that answers should match some of the words, as the alternative to the all of the words that is shown in the figure (but does not give any indication as to how this is done). Experience indicates that readers are rarely confused by the responses they receive—they don’t even think about the ranking mechanism.

Professional information retrieval specialists like librarians want to understand exactly how their query will be interpreted and are prepared to issue complex queries provided that the semantics are clear. For most tasks they prefer precisely targeted Boolean queries, which are especially appropriate if metadata is being searched, and particularly if professional catalogers have entered it, because the terms that are used are relatively unambiguous and predictable. Casual users, on the other hand, may not be concerned about how queries are interpreted; they prefer to trust the system to do a good job and are prepared to scroll further down the ranked list if they want to expend more effort on their search. They like ranked queries, which are especially appropriate if full text is being searched, because term usage in full text is relatively unpredictable.

Case-folding and stemming

Two common operations in querying are case-folding and stemming. Figure 3.12 shows a Preferences page (which allows users to choose query options), and case-folding and stemming have been selected (near the bottom). The interface supplies three- or four-word descriptions of these operations so that their meaning can be grasped immediately by casual users.

Case-folding replaces all uppercase characters in the query with their lowercase equivalents, treating uppercase versions of words as equivalent to lowercase words (in our example queries, case-folding treats Digital and DIGITAL and Library and LIBRARY as equivalent to digital and library). Users seeking an exact match need to disable case-folding. For example, a user looking for documents containing Digital AND Equipment AND Corporation (a now-defunct computer manufacturer) would disable case-folding in order to avoid being flooded with answers about corporation policy on capital equipment for digital libraries. Thus, users must be able to specify whether case-folding applies.

Stemming reduces words by stripping off suffixes, converting them to neutral stems that are devoid of tense, number, and—in some languages—case and gender information. This relaxes the match between query terms and words in the documents so that, for example, libraries is deemed equivalent to library. Like case-folding, stemming is not appropriate for all queries, particularly those involving names and other very specific words.

Figure 3.12: Choosing search preferences

The addition of suffixes is governed by linguistic rules. Converting the y in library to the ies in libraries is one example; another is the doubling of a final consonant, as when stem is augmented to stemming. Reducing a word to its root form, or morphological reduction, requires language dictionaries that include information about which rules apply to which words. However, for the purposes of information retrieval, simpler solutions suffice. All that is necessary is for different variants of a word to be reduced to an unambiguous common form—the stem. The stem does not have to be the same as the morphological root form. The desired effect will be obtained so long as all variants reduce to the same stem, and no other words do. Also, only certain kinds of suffixes need be considered. Linguists use the word inflectional to describe suffixes that mark a word for grammatical categories. In contrast, derivational suffixes modify the major category of a word—for example, when -less is appended to words like hope or cloud it converts a noun to an adjective. Derivational suffixes should not be stripped because they alter the meaning—radically, in this example.

Stemming is language-dependent: an algorithm for English will not work well on French and vice versa. Indeed, the concept of stemming differs widely from one language to another. Many languages use prefixes as well as suffixes to indicate derivational forms; others contain complex compound words that should be split into constituent parts before being entered into a query. Case-folding, too, is not relevant to certain languages, such as Chinese and Arabic.

Furthermore, stemming and case-folding complicate the highlighting of search terms in retrieved documents. Simply finding the stem and highlighting all words that contain it could highlight the wrong word. For example, a particular system might make library and libraries match when searching by stemming them both to librar, but deliberately avoid stemming librarian because its derivational suffix changes the meaning. However, a simple textual match with the stem will result in the word librarian being incorrectly highlighted in the search results. And a simplistic method may fail to highlight correct terms—a different algorithm might stem libraries to the root form library when searching, but fail to highlight the retrieved search term.

The correct procedure is to stem each word in the retrieved document using the same algorithm and then to assess if the result matches the stemmed query word. However, this procedure can be prohibitively time-consuming. A more practical alternative is to record the stemmed form of each word and to expand the query by adding all unstemmed variants prior to the highlighting operation.

Phrase searching

Users often want to search for contiguous groups of words. Indeed, most of our examples—digital library, Digital Equipment Corporation—would be better posed as phrase searches. Phrases are generally indicated in a query by using quotation marks.

Although phrase searching is simple and natural for users, it is actually quite complex to implement. Users might think the computer looks through all the documents just as they would, but faster—a lot faster. Because the computer can find the individual words in all the documents, it seems natural to conclude that it can just as easily check to determine if they come together as a phrase. However, a computer search doesn’t scan through the text: that would take too long. (Computers are not all that fast, and for documents stored on disk, access is by no means instantaneous.)

Instead, computers first create an index that records, for each word, the documents that contain that word. Every word in the query is looked up in the index, and the computer compiles a list of retrieved document numbers for each word. Then the query is answered by manipulating the lists—in the case of a Boolean AND query, by checking which documents are in all the lists. (The process is described more in Section 4.1.)

Phrase searching changes everything. No longer can queries be answered simply by manipulating lists of document numbers. Instead, there are two options. Having treated the query as a set of individual words, one option is to look inside the documents themselves: checking through all documents that contain the query terms to see if they occur together as a phrase. This is a postretrieval scan. The other option is to record word numbers as well as document numbers in the index—the position of each occurrence of the word in the document as well as the documents it occurs in. We refer to this as a word-level index. Then, when each word in the query is looked up in the index, the computer can tell from the list of word positions if the words occur together in a phrase, because words in a phrase will be numbered consecutively.

How phrase queries are implemented greatly affects the resources required by the system. A postre-trieval scan can take a great deal of time because—depending on how common the terms in the phrase query are—many documents might have to be examined. Phrases containing unusual words can be processed quickly: few documents will include them and therefore few will need to be scanned for the occurrence of the phrase. But phrases containing common words, such as digital library, will require substantial computer time, and response will be slow.

With a word-level index, the exact position of each occurrence of every word is recorded, instead of just the documents in which it occurs. This makes the index significantly larger. Not only are word numbers larger than document numbers, and hence require more storage space, but any given word probably appears many times in each document, and every occurrence must be recorded.

Which mechanism is employed also affects how phrases can be used in queries. For example, people often want to specify proximity: the query terms must appear within so many words of each other, but not necessarily contiguously in a phrase. If word numbers are stored, responding to a proximity query is just a matter of checking that the positions do not differ by more than the specified amount. If phrase scanning is employed, proximity searching is far more difficult.

Users sometimes seek phrases that include punctuation and even white space. Most word-level indexes treat the documents as sequences of words, ignoring punctuation and spacing. Distinguishing between differently punctuated versions of a phrase requires a postretrieval scan even if a word-level index is available—unless the index also includes word separators.

Phrases complicate ranked searching. Here the frequency of a query term throughout the corpus is used to measure how influential that word should be in determining the ranking of each document— common words like the are less influential than rare ones like aspidistra. However, if the query contains a phrase, it should really be the frequency of the phrase, not the frequency of the individual words in it, that is used for ranking. For example, the English words civil, rights, and movement are used in many contexts, but the phrase civil rights movement has a specific meaning. The importance of this phrase relative to other words in a query should be judged according to the frequency of the phrase, not the constituent words.

Including phrases in queries complicates things technically. In practice, building digital libraries involves pragmatic decisions. Word-level indexes are recommended if phrase searching is likely to be common and if space is not a significant constraint. Simple systems use a postretrieval scan, which suffices when phrase searching is rare or if phrases contain punctuation. In either case, ranking is based on individual word frequencies, not phrase frequencies, for practical reasons.

How can we communicate these decisions to users? As we have seen, it is difficult to fully understand what is happening in phrase searching. An alternative is a more advanced search technique, like the one described in Section 3.6, which provides a natural interface whose workings are easy to grasp.

Searching (Digital Library) Part 1

Types of queries

Case-folding and stemming

Phrase searching

Related Links

:: Search WWH ::