Database Reference
In-Depth Information
that takes into account the actual length of the strings, since it is more
meaningful to allow larger edit distance thresholds for longer strings.
Another extension is to compute a weighted edit distance , where each
edit operation has its own weight. The weights can also depend on the
particular character being inserted, deleted, or substituted. One can
also compute a normalized version of weighted edit distance, which is
defined as the minimum fraction W ( P ) /
, where P is a sequence of
edit operations, W ( P ) is the total weight of P , and
|
P
|
|
P
|
is the length
of P . More involved variations of edit distance allow whole substrings
to be replaced or moved with a unit cost as a primitive operation. A
similar idea is to allow the introduction of gaps with a unit cost depen-
dent on the length of the gap, to enable alignment of strings. Finally,
Hamming distance can be considered a special case of edit distance,
where only replacement operations are allowed.
Each variation of edit distance is designed with a specific applica-
tion in mind. For example, computing similarity of biological sequences,
where mutations are more likely to appear as a variation of several con-
secutive, rather than individual, nucleotides, necessitates the introduc-
tion of primitive edit operations on tokens. Similarly, performing local
alignment of DNA sequences where there might exist long sequences
of low complexity or noise introduced by evolutionary mutation that
should be ignored, necessitates the introduction of gaps as a primitive
operation.
Edit distance has been shown to work very well for a variety of
applications. Nevertheless, it does have some drawbacks that make it
insucient in certain scenarios. First, edit distance does not inher-
ently discount mismatching substrings that might not be very impor-
tant from a practical perspective, in some data domains. For example,
consider very common words in the English language, like 'the'; the edit
distance between 'Feed the Children' and 'Feed Children' is four, even
though from a practical viewpoint the two strings are the same. The
solution here is to use customized weighing schemes to alleviate these
problems, but in this case the computation of edit distance becomes
more expensive (ecient approximate solutions though do exist; for
example see BLAST [1]). A second inherent drawback of edit distance
is that, by construction, it is not amenable to word re-ordering. For
Search WWH ::




Custom Search