Database Reference
In-Depth Information
edit distance between sets of tokens (rather than sequences of char-
acters). In this scenario, the goal is to compute the sum of weighted
primitive operations needed to convert one set of tokens into another.
A primitive operation can be a deletion, an insertion, or a replacement
of a token. The weight of each operation might depend on the actual
weight of the token being inserted or deleted, or the weight and the
actual edit distance between the token being replaced and the token
replacing it. For example consider the two strings 'Feed the Children'
and 'Food Child'. From a practical viewpoint, the two strings are very
similar. From a set based similarity perspective the two strings do not
have any words in common (if stemming is not performed). From an
edit distance perspective the edit distance between the two strings is
very large. A combination distance would consider inserting the popu-
lar word 'the' with a very small cost (e.g., proportional to the idf weight
of the word), and replacing words 'Food' and 'Child', that are within
edit distance 2 and 3 from the respective 'Feed' and 'Children' using a
cost proportional to the actual edit distance.
2.3 Related Work
The edit distance concept was originally introduced by Levenshtein [47].
Wagner and Fischer [70] introduced the first algorithm for computing
edit distance with time and space complexity of O (
and
|r| are the lengths of the two strings. Cormen et al. [25] presented a
space-ecient algorithm with space complexity O (max
|
s
||
r
|
), where
|
s
|
). The
fastest edit distance algorithm, proposed by Masek and Patterson [52],
requires O (
{|
s
|
,
|
r
|}
2 / log
|
s
|
|
s
|
) time using a split-and-merge strategy to parti-
tion the problem. The edit distance verification algorithm was proposed
by Ukkonen [66]. There is a very rich literature of algorithms for com-
puting variations of the basic edit distance measure. One of the first
dynamic programming algorithms for aligning biological sequences was
proposed by Needleman and Wunsch [55]. A variation of this algo-
rithm that uses a substitution matrix and gap-scoring was proposed by
Smith and Waterman [61]. Edit distance with block moves was pro-
posed by Tichy [65]. Edit distance with reversals was discussed by
Kececioglu and Sankoff [41]. Edit distance allowing swaps has been
Search WWH ::




Custom Search