Database Reference
In-Depth Information
similarity functions (e.g., Edit Distance, Hamming) evaluate the simi-
larity of strings as a function of the total number of edit operations that
are necessary to convert one string into the other. Edit operations can
be insertions, deletions, replacements, and transpositions of characters
or tokens.
Approximate text processing has two flavors, online and oine. In
the online version, the query can be pre-processed but the text can-
not, and the query is answered without using an index. A survey on
existing work for this problem was conducted by Navarro [54]. In the
oine version of the problem the text is pre-processed and the query
is answered using an index. A review of existing work for this problem
was conducted by Chan et al. [16].
Here, we focus on a special case of the fundamental approximate
text processing problem:
Definition 1.2 (Approximate String Matching). Given a set of
strings S and a query string v , one desires to identify all strings s ∈ S
that have a user specified degree of similarity to v .
The approximate string matching problem (which is also referred to as
the approximate dictionary matching problem in related literature) is
inherently simpler than the text matching problem, since the former
relates to retrieving strings that are similar to the query as a whole,
while the latter relates to retrieving strings that contain a substring that
is similar to the query. Clearly, a solution for the text matching problem
will yield a solution for the string matching problem. Nevertheless,
due to the simpler nature of approximate string matching, there is
a variety of specialized algorithms for solving the problem that are
faster, simpler, and with smaller space requirements than well-known
solutions for text matching. The purpose of this work is to provide an
overview of concepts, techniques and algorithms related specifically to
the approximate string matching problem.
To date, the field of approximate string matching has been devel-
oping at a very fast pace. There now exists a gamut of specialized data
structures and algorithms for a variety of string similarity functions
and application domains that can scale to millions of strings and can
 
Search WWH ::




Custom Search