Introduction - Approximate String Processing

Database Reference

In-Depth Information

similarity functions (e.g., Edit Distance, Hamming) evaluate the simi-

larity of strings as a function of the total number of edit operations that

are necessary to convert one string into the other. Edit operations can

be insertions, deletions, replacements, and transpositions of characters

or tokens.

Approximate text processing has two flavors, online and oine. In

the online version, the query can be pre-processed but the text can-

not, and the query is answered without using an index. A survey on

existing work for this problem was conducted by Navarro [54]. In the

oine version of the problem the text is pre-processed and the query

is answered using an index. A review of existing work for this problem

was conducted by Chan et al. [16].

Here, we focus on a special case of the fundamental approximate

text processing problem:

Definition 1.2 (Approximate String Matching). Given a set of

strings S and a query string v , one desires to identify all strings s ∈ S

that have a user specified degree of similarity to v .

The approximate string matching problem (which is also referred to as

the approximate dictionary matching problem in related literature) is

inherently simpler than the text matching problem, since the former

relates to retrieving strings that are similar to the query as a whole,

while the latter relates to retrieving strings that contain a substring that

is similar to the query. Clearly, a solution for the text matching problem

will yield a solution for the string matching problem. Nevertheless,

due to the simpler nature of approximate string matching, there is

a variety of specialized algorithms for solving the problem that are

faster, simpler, and with smaller space requirements than well-known

solutions for text matching. The purpose of this work is to provide an

overview of concepts, techniques and algorithms related specifically to

the approximate string matching problem.

To date, the field of approximate string matching has been devel-

oping at a very fast pace. There now exists a gamut of specialized data

structures and algorithms for a variety of string similarity functions

and application domains that can scale to millions of strings and can

Search WWH ::

Custom Search

Home