Database Reference
In-Depth Information
q
-grams we can count the minimum number of
common
q
-grams that
two strings need to have to be within
θ
edit operations.
An inverse argument is based on the number of
q
-grams that the
strings
do not have in common
(i.e., the mismatching
q
-grams). The
question now becomes what is the minimum number of edit operations
needed to cause the observed mismatching
q
-grams, which is equal to
the number of edit operations needed to fix these
q
-grams. This num-
ber is at least as large as the number of edit operations needed to
simply
destroy
all mismatching
q
-grams. For example, consider the two
strings
s
= 'One Laptop' and
r
= '1 Laptop'. The positional 2-gram
sets are
1
2
3
4
5
6
7
8
9
'On'
'ne'
'e '
' L'
'La'
'ap'
'pt'
'to'
'op'
'1 ' ' L' 'La' 'ap' 'pt' 'to' 'op'
Assuming an edit distance threshold
θ
= 3, the mismatching posi-
tional 2-grams belonging to
s
are ('On', 1), ('ne', 2), ('e', 3) (based on
Lemma 8.2). Notice that a replacement of 'n' with '1' will affect two
2-grams at the same time. Hence, only two edit operations are needed
in this case to destroy all three 2-grams, one for 2-grams ('On', 1), ('ne',
2) and one for 2-gram ('e ', 3).
Reasoning about mismatching
q
-grams leads to the following
statement:
Lemma 8.4(Mismatch Filtering).
Consider strings
s, r
and the set
of positional mismatching
q
-grams
Q
s→r
=
},λ
i
∈
s
. Let
φ
be the minimum number of edit operations needed to destroy
all
q
-grams in
Q
s→r
. Then,
(
λ
1
,p
1
)
,...,
(
λ
s
,p
s
)
{
E
(
s, r
)
≥
φ
.
Proof.
If we need exactly
φ
operations to simply destroy all mismatch-
ing
q
-grams, we need at least
φ
operations to correct the same
q
-grams,
in the best case scenario.
We call the minimum number of edit operations needed to destroy
a set of
q
-grams
Q
,
the minimum destroy operations
for
Q
. We can