Searching (Digital Library) Part 2

Query interfaces

The search pages we have seen have minuscule query boxes, implicitly encouraging users to type just one or two terms. In reality, most queries contain only a few words. In fact, studies have shown that the most common number of terms in queries to Web search engines is—zero! People just hit the search button, or the Enter key, without typing anything, presumably by accident. The second most common number of search terms is one. Note that for single-term queries there is no difference between AND and OR , although in some systems all of the words returns the documents in some predetermined order—say by date—whereas some of the words implies ranking. For single-term queries, ranking returns documents in order of how often they contain the query term (normalized by document length). The third most common query has two terms; after that, query frequency falls off rapidly with query length.

Modern search engines easily deal with large queries: indeed, large queries can often be processed more efficiently than smaller queries because they are more likely to contain rare words that restrict the scope of the search. Figure 3.13 shows a large query box into which users can paste paragraph-sized chunks of text—and it is scrollable to facilitate even larger queries.

Large-query search interface


Figure 3.13: Large-query search interface

People often repeat searches, which is easy if they have access to their search history. "Those who ignore history," to adapt George Santayana’s famous dictum, "will be forced to retype it." New queries are often modifications of old ones—new terms are added if the query is too broad, to increase precision at the expense of recall, or terms are removed if the query is too restrictive, to increase recall. The interface in Figure 3.14 presents the last four queries issued by the user. The buttons on the left move the query into the search box, where it can be modified. For example, clicking on the button to the left of the first field will place begin beginning in the search box. The Preferences page in Figure 3.12 is used to select how much history to display.

What if users change search options, or even collections, between queries? The history display should make this explicit. Maybe users are experimenting with these options to test their effect on a query—Does the number of results change with stemming? Does one collection contain more potential answers than another? This is why details are given alongside the history item when such changes occur, as Figure 3.14 shows. Normally, these details don’t appear, because users rarely alter their query options. When the details do appear, they clarify the context within which the query history should be interpreted.

Particularly for academic study, searches on different fields often need to be combined. For example, a researcher might be seeking a book by a certain author with a particular word in the title or a particular phrase in the main text. Library catalog systems have a search form that supplies several fields into which specifications can be entered, like the one in Figure 3.15a.

Query with history

Figure 3.14: Query with history

Users type each part of the query into a box and use the menu to the right of the box to select the field. Finally, they decide whether the documents should satisfy some or all of these conditions. If necessary, users can go to the Preferences page to request more boxes (Figure 3.12).

More complex fielded searches can be undertaken using the form in Figure 3.15b. Again, specifications are placed in the entry boxes and a field is selected for each one. Case-folding and stemming can be set for each individual field. The selection boxes that precede the fields allow the Boolean operators AND, OR, and AND NOT. This form cannot be used to specify completely general Boolean queries because users cannot control the order of operation—there is no place to insert parentheses. The specifications are executed sequentially: each one applies to all the fields beneath it.

The line near the top, just under the navigation bar, allows users to decide whether the results should be sorted into ranked or natural order (the latter is the order in which documents appear in the collection, usually by date). Users can also limit the search to certain date ranges. They can also select which date field to use, because in some collections more than one date may be associated with each document, with different dates corresponding to different kinds of metadata. For example, in a historical collection there can be a big difference between the period a document covers and the date it was written.

This advanced interface is intended for expert users. However, experts often find it frustrating to have to fill out forms: they prefer to type textual queries rather than to click between fields. Behind the form-based interface is an underlying query language, and users may prefer to use this for query entry.

Searching multimedia

The oral history collection in Figure 3.6 is a searchable multimedia collection based on textual metadata painstakingly entered by hand. An intriguing alternative is to base retrieval directly on the multimedia by analyzing the content itself. For example, optical character recognition (OCR) and automatic speech recognition (ASR) turn digitized textual images and spoken audio into text. In principle, these technologies allow source content to be fed directly into full-text indexing engines and to be retrieved in just the same way as text. However, there are some caveats, because such systems are error prone.

Media like photographs and music, which have no textual representation, present even greater challenges than textual images and spoken audio. Here are two examples.

Searching music

In a digital music library (Figure 3.7) multiple representations of music can be derived and presented. Search might be based on textual metadata, making each item retrievable by title, composer, the year it was written, who performed a particular version, etc.

But text-based search does not always map well to what users can express, particularly when the underlying form is non-textual. To locate Vivaldi’s The Four Seasons, should you search by overall title or for one of the individual parts—spring, summer, autumn and winter? In actual fact, one of the authors tried this as a student many years ago.

Form search: (a) simple; (b) advanced

Figure 3.15: Form search: (a) simple; (b) advanced

Entering the title query "four seasons" (with quotes) into the university library catalog returned nothing; removing the quotes produced a deluge of irrelevant matches—Vivaldi’s work certainly did not appear in the first three pages. Much time was wasted varying the composition of the query (keyword, Boolean, adding the names of seasons, etc.), to no avail. Eventually, he resorted to searching for Vivaldi and painstakingly working through several hundred matching items. When the sought-after work was finally located, it was revealed that it was cataloged under its Italian name (Le quattro stagioni: La primavera, L ‘estate, L ‘autunno, L ‘inverno) and could never have been found by a title-based search formulated in English. If only it had been possible to sing a few bars of one of the themes and use that for searching!

Figure 3.16 illustrates a digital library of popular MIDI tunes where searching is based upon what has been called "query by humming." In Figure 3.16a a virtual piano keyboard is being used to tap out notes to form a symbolic query. An alternative is raw audio input, illustrated in Figure 3.16b, where the user presses a record button and starts to sing or hum a query—in this case, the first ten notes of Fields of Gold by Sting. Then the user presses search, which after a short delay brings up the view shown in Figure 3.16c.

During this time, the system analyzes the audio signal, segments it into individual notes and determines the pitch of each one. Then the collection is searched for matching songs. But musical themes recur in modified forms, and a user’s recollection will not necessarily match the version in the collection, so the machine is programmed to seek approximate matches. The amount by which the notes in the query have to change to match one of the songs in the collection (technically known as the edit distance) is used to order the results. Exact matches have an edit distance of zero and appear at the top of the list.

The result of converting the raw audio into symbolic notes is displayed in Figure 3.16c. Transcription errors may occur, depending on how clearly and accurately the user has sung, and the result is shown in music notation for the user to check. Alternatively, the query notes can be synthesized and played back. Below the query are the matching documents, with Fields of Gold at the top. Clicking the adjacent icon launches a MIDI player to play the song (Figure 3.16d).

Searching images

Figure 3.17 shows a system that supports text and image content queries, both individually and jointly, and illustrates the state of the art in automatic image content analysis. The user enters the query flower, which brings up a grid of matching thumbnails (there are 16 pages of thumbnails in all). The second thumbnail shows a carnation with a butterfly on its stem, against a defocused background of foliage. The user selects it and switches to the Content Viewer tab, which shows the larger version in Figure 3.17b, including some metadata—width and height, filename and textual keywords (in this case, buds flora Bower grass hawkweed insect insects moth nature on range). Part of the metadata (although not visible here) is a time-stamp for each image, and a temporal view appears along the bottom of the window. The current image is in the center, flanked by its neighbors in time order, shrinking into the distance in both directions.

The user returns to the first view and right-clicks the image. This brings up a menu with options that include augmenting the search with the image itself. Selecting this and pressing the search button

A musical digital library: (a) symbolic audio query; (b) singing a query;

Figure 3.16: A musical digital library: (a) symbolic audio query; (b) singing a query;

cont'd: (c) result set showing the transcribed query; (d) playing the top match, Fields of Gold

Figure 3.16, cont’d: (c) result set showing the transcribed query; (d) playing the top match, Fields of Gold

The sliders to the left of the screen in Figures 3.17a and c each represent an aspect of similarity. There are nine in all, including one for text (the last five are obtained by scrolling). They are used to simultaneously control each aspect’s influence on the decision. For example, setting the text slider to 0% and the others to 100% restricts the influence to features derived from the image content. (In this case, the same effect could be achieved by clearing the text query box.) The slider labels are cryptic—Gabor 2-4, Convolution 2, etc.—because the system is still an experimental research tool and these are technical names for the underlying algorithms.

Searching image content: (a) text query based on metadata; (b) viewing an image;

Figure 3.17: Searching image content: (a) text query based on metadata; (b) viewing an image;

cont'd: (c) augmenting the query through content analysis

Figure 3.17, cont’d: (c) augmenting the query through content analysis

More conventional labels would have to be found for a general audience.

Returning to Figure 3.17c, combining the text query flower with images similar to the chosen one replaces the matching documents with ones whose hue is more reddish (although this is not apparent in the printed version of Figure 3.17c). The user’s mouse is hovering over the fourth image in the central row, and details pop up explaining which features were responsible for this item’s appearance so early in the result set. A feature called HSV-L-Colour Focus (HSV stands for hue, saturation and value) is dominant (at 24%); it is followed by features that gauge uniformity (19%), variance (18%), and convolution (11%).

Next post:

Previous post: