Database Reference
In-Depth Information
q -grams we can count the minimum number of common q -grams that
two strings need to have to be within θ edit operations.
An inverse argument is based on the number of q -grams that the
strings do not have in common (i.e., the mismatching q -grams). The
question now becomes what is the minimum number of edit operations
needed to cause the observed mismatching q -grams, which is equal to
the number of edit operations needed to fix these q -grams. This num-
ber is at least as large as the number of edit operations needed to
simply destroy all mismatching q -grams. For example, consider the two
strings s = 'One Laptop' and r = '1 Laptop'. The positional 2-gram
sets are
1
2
3
4
5
6
7
8
9
'On'
'ne'
'e '
' L'
'La'
'ap'
'pt'
'to'
'op'
'1 ' ' L' 'La' 'ap' 'pt' 'to' 'op'
Assuming an edit distance threshold θ = 3, the mismatching posi-
tional 2-grams belonging to s are ('On', 1), ('ne', 2), ('e', 3) (based on
Lemma 8.2). Notice that a replacement of 'n' with '1' will affect two
2-grams at the same time. Hence, only two edit operations are needed
in this case to destroy all three 2-grams, one for 2-grams ('On', 1), ('ne',
2) and one for 2-gram ('e ', 3).
Reasoning about mismatching q -grams leads to the following
statement:
Lemma 8.4(Mismatch Filtering). Consider strings s, r and the set
of positional mismatching q -grams Q s→r =
},λ i
s . Let φ be the minimum number of edit operations needed to destroy
all q -grams in Q s→r . Then,
( λ 1 ,p 1 ) ,..., ( λ s ,p s )
{
E
( s, r )
φ .
Proof. If we need exactly φ operations to simply destroy all mismatch-
ing q -grams, we need at least φ operations to correct the same q -grams,
in the best case scenario.
We call the minimum number of edit operations needed to destroy
a set of q -grams Q , the minimum destroy operations for Q . We can
 
Search WWH ::




Custom Search