Database Reference
In-Depth Information
This work explains in detail the available choices for each primitive, in
an effort to delineate the application space related to every choice.
For example, consider a relevant document retrieval application that
uses cosine similarity and token frequency/inverse document frequency
weights 1 to retrieve the most relevant documents to a keyword query.
The application uses a set based similarity function, implying a word-
based, non-overlapping tokenization for keyword identification, a clear
focus on selection queries, and most probably an underlying inverted
index on keywords. Notice that this particular application is not related
to approximate matching of keywords. A misspelled keyword, either
in the query or the documents, will miss relevant answers. Clearly,
to support approximate matching of keywords, a relevant document
retrieval engine will have to use a combination of primitives.
As another example, consider an application that produces query
completion suggestions interactively, as the user is typing a query in
a text box. Usually, query completion is based on the most popular
queries present in the query logs. A simple way to enable query sugges-
tions based on approximate matching of keywords as the user is typing
(in order to account for spelling mistakes) is to use edit distance to
match what the user has typed so far as an approximate substring of
any string in the query logs. This application setting implies an edit
based similarity, possibly overlapping tokenization for enabling identi-
fication of errors on a per keyword level, focus on selection queries, and
either an inverted index structure built on string signatures tailored for
edit distance, or specialized trie structures.
The monograph is organized into eight sections. In the first four
sections we discuss in detail the fundamental primitives that charac-
terize any approximate string matching indexing technique. Section 2
presents in detail some of the most widely used similarity functions
for strings. Section 3 discusses string tokenization schemes. Section 4
gives a formal definition of the four primitive query types on strings.
Finally, Section 5 discusses the two basic types of data structures used
to answer approximate string matching queries. The next three sections
are dedicated to specialized indexing techniques and algorithms for
1 Token frequency is also referred to as term frequency.
Search WWH ::




Custom Search