Database Reference
In-Depth Information
provide answers at interactive speeds. Previous experience has shown
that for most complex problems there is almost never a one size fits all
solution. Given the importance of strings in a wide array of applica-
tions, it is safe to assume that different application domains will benefit
from specialized solutions.
There are four fundamental primitives that characterize an indexing
solution for approximate string matching:
The similarity function: As already discussed, there are two
types of similarity functions for strings, set based and edit
based.
String tokenization: Tokenization is the process of decompos-
ing a string into a set of primitive components, called tokens.
For example, in a particular application a primitive compo-
nent might refer to a word, while in some other application a
primitive component might refer to a whole sentence. There
are two fundamental tokenization schemes, overlapping and
non-overlapping tokenization.
The query type: There are two fundamental query types,
selections and joins. Selection queries retrieve strings sim-
ilar to a given query string. Join queries retrieve all simi-
lar pairs of strings between two sets of strings. There are
also two flavors of selection and join queries, all-match and
top- k queries. All-match queries retrieve all strings (or pairs
of strings) within a user specified similarity threshold. Top- k
queries retrieve the k most similar strings (or pairs of strings).
The underlying index structure: There are two fundamental
indexing schemes, inverted indexes and trees. An inverted
index consists of a set of lists, one list per token in the token
universe produced by the tokenization scheme. A tree orga-
nizes strings into a hierarchical structure specifically designed
to answer particular queries.
Every approximate string indexing technique falls within the space
of the above parametrization. Different parameters can be used to
solve a variety of problems, and the right choice of parameters — or
combination thereof — is dependent only on the application at hand.
Search WWH ::




Custom Search