Introduction - Approximate String Processing

Database Reference

In-Depth Information

This work explains in detail the available choices for each primitive, in

an effort to delineate the application space related to every choice.

For example, consider a relevant document retrieval application that

uses cosine similarity and token frequency/inverse document frequency

weights 1 to retrieve the most relevant documents to a keyword query.

The application uses a set based similarity function, implying a word-

based, non-overlapping tokenization for keyword identification, a clear

focus on selection queries, and most probably an underlying inverted

index on keywords. Notice that this particular application is not related

to approximate matching of keywords. A misspelled keyword, either

in the query or the documents, will miss relevant answers. Clearly,

to support approximate matching of keywords, a relevant document

retrieval engine will have to use a combination of primitives.

As another example, consider an application that produces query

completion suggestions interactively, as the user is typing a query in

a text box. Usually, query completion is based on the most popular

queries present in the query logs. A simple way to enable query sugges-

tions based on approximate matching of keywords as the user is typing

(in order to account for spelling mistakes) is to use edit distance to

match what the user has typed so far as an approximate substring of

any string in the query logs. This application setting implies an edit

based similarity, possibly overlapping tokenization for enabling identi-

fication of errors on a per keyword level, focus on selection queries, and

either an inverted index structure built on string signatures tailored for

edit distance, or specialized trie structures.

The monograph is organized into eight sections. In the first four

sections we discuss in detail the fundamental primitives that charac-

terize any approximate string matching indexing technique. Section 2

presents in detail some of the most widely used similarity functions

for strings. Section 3 discusses string tokenization schemes. Section 4

gives a formal definition of the four primitive query types on strings.

Finally, Section 5 discusses the two basic types of data structures used

to answer approximate string matching queries. The next three sections

are dedicated to specialized indexing techniques and algorithms for

1 Token frequency is also referred to as term frequency.

Search WWH ::

Custom Search

Home