String Similarity Functions - Approximate String Processing

Database Reference

In-Depth Information

studied by Amir et al. [2] and Lee et al. [42]. More recently, Cormode

and Muthukrishnan [26] introduced a variation of edit distance allow-

ing sub-strings to be moved with smaller cost. Chaudhuri et al. [18]

introduced a weighted edit distance for sets of tokens, based on the

idf weights of tokens. Approximating edit distance has also been well

studied in the literature of computer algorithms. Bar-Yossef et al. [9]

showed that an O ( n 3 / 7 ) approximation of edit distance can be obtained

in linear time. Ostrovsky and Rabani [56] proved that edit distance can

be probabilistically well preserved by Hamming distance after project-

ing the strings onto randomly selected subspaces. Andoni and Onak [3]

extended this idea by reducing both the space and time complexities.

The problem of approximate string matching using set based similar-

ity functions has received wide attention from the information retrieval

and core database communities. A detailed analysis and history of set

based similarity in information retrieval was conducted by Baeza-Yates

and Ribeiro-Neto [8].

Search WWH ::

Custom Search

Home