Database Reference
In-Depth Information
Proof. To prove the claim, notice that s consists of |s|−q +1 q -grams
and r of
|
r
|−
q +1 q -grams. The maximum number of q -grams is x =
max(
q + 1. Each edit operation can affect at most qq -grams,
thus θ edit operations can affect at most qθ q -grams. Hence, if E ( s, r )
θ , then with θ edit operations we can obtain s from r (or r from s )if
and only if s and r have at least x
|
s
|
,
|
r
|
)
qθ q -grams in common.
The intuition is that strings within a small edit distance must have
a large number of q -grams in common. Imagine that string s can be
obtained from string r with a replacement of a single character. Then,
intuitively, in the worst case s and r differ in at most qq -grams, since
one edit operation can affect at most qq -grams.
By using Lemma 8.1 an edit distance constraint is converted
into a simpler Intersection similarity constraint on q -grams. Thus, all
algorithms described in Section 6, suitable for computing Weighted
Intersection similarity with unit token weights, can now be applied to
evaluate edit distance (notice that the actual token weighing function
W is irrelevant when evaluating edit distance).
Given that strings with lengths shorter than q result in empty q -
gram sets, as already discussed, for implementation purposes we can
extend all strings with q
1 copies of a special beginning and ending
character # /
Σ to avoid special cases. The intersection constraint of
Lemma 8.1 now becomes
E ( s, r ) ≤ θ ⇒I ( s, r ) max( |s|,|r| )+ q − 1 − qθ,
given that the total number of q -grams per string increases to
|s| +
2( q
q +1.
For brevity, in the rest we refer to the q -gram intersection thresh-
old as
1)
τ = max(
|
s
|
,
|
r
|
)
q +1
(or τ = max(
|
s
|
,
|
r
|
)+ q
1
depending on whether extended
q -gram sets are used).
Notice that Lemma 8.1 results in a very loose intersection constraint
that does not take into account the positions of matching q -grams
within the strings. Thus, a variety of optimizations, based on q -gram
Search WWH ::




Custom Search