Database Reference
In-Depth Information
EXAMPLE 3.14 The edit distance between the strings x = abcde and y = acfdeg is 3. To
convert x to y :
(1) Delete b .
(2) Insert f after c .
(3) Insert g after e .
No sequence of fewer than three insertions and/or deletions will convert x to y . Thus, d ( x,
y ) = 3.
Another way to define and calculate the edit distance d ( x, y ) is to compute a longest com-
mon subsequence (LCS) of x and y . An LCS of x and y is a string that is constructed by
deleting positions from x and y , and that is as long as any string that can be constructed that
way. The edit distance d ( x, y ) can be calculated as the length of x plus the length of y minus
twice the length of their LCS.
EXAMPLE 3.15 The strings x = abcde and y = acfdeg from Example 3.14 have a unique
LCS, which is acde . We can be sure it is the longest possible, because it contains every
symbol appearing in both x and y . Fortunately, these common symbols appear in the same
order in both strings, so we are able to use them all in an LCS. Note that the length of x is
5, the length of y is 6, and the length of their LCS is 4. The edit distance is thus 5 + 6 − 2 ×
4 = 3, which agrees with the direct calculation in Example 3.14 .
For another example, consider x = aba and y = bab . Their edit distance is 2. For ex-
ample, we can convert x to y by deleting the first a and then inserting b at the end. There
are two LCS's: ab and ba . Each can be obtained by deleting one symbol from each string.
As must be the case for multiple LCS's of the same pair of strings, both LCS's have the
same length. Therefore, we may compute the edit distance as 3 + 3 − 2 × 2 = 2.
Edit distance is a distance measure. Surely no edit distance can be negative, and only
two identical strings have an edit distance of 0. To see that edit distance is symmetric, note
that a sequence of insertions and deletions can be reversed, with each insertion becoming a
deletion, and vice-versa. The triangle inequality is also straightforward. One way to turn a
string s into a string t is to turn s into some string u and then turn u into t . Thus, the number
of edits made going from s to u , plus the number of edits made going from u to t cannot be
less than the smallest number of edits that will turn s into t .
Non-Euclidean Spaces
Notice that several of the distance measures introduced in this section are not Euclidean spaces. A property of Euc-
lidean spaces that we shall find important when we take up clustering in Chapter 7 is that the average of points in a
Search WWH ::




Custom Search