Database Reference
In-Depth Information
positions, can be applied over the existing inverted index based algo-
rithms. It holds that
s
=
{
(
λ
1
,p
1
)
,...,
(
λ
s
m
,p
s
m
)
}
r
=
Lemma 8.2.
Given
strings
and
(
λ
1
,p
1
)
,...,
(
λ
n
,p
n
)
{
}
, let the matching set of positional
q
-grams be
r
=
{
(
λ
i
,p
i
)
,
(
λ
j
,p
j
)
s
∩
}
. Then
λ
i
,λ
j
∈
p
i
p
j
|≤
E
(
s, r
)
≤
θ
⇒∀
s
∩
r
:
|
−
θ.
θ
and that a
q
-gram
λ
i
∈
Proof.
Assuming that
E
(
s, r
)
≤
s
matches
q
-gram
λ
j
∈
p
i
p
j
|
>θ
, clearly we need at least
θ
+ 1 edit
operations to transform the position of
λ
i
into that of
λ
j
, without
affecting any other
q
-grams, to obtain one
q
-gram set from the other.
This is a contradiction.
r
with
|
−
The claim states that if two strings are within edit distance
θ
, then
the positions of the matching
q
-grams between the two strings cannot
differ by more than
θ
.
We can use Lemma 8.2 when scanning inverted lists to prune
candidates based on positional information of
q
-grams, by comparing
against the positions of these
q
-grams in the query string. Positional
information can be used for all algorithms discussed in Section 6. In
the algorithms, while scanning a list corresponding to a query
q
-gram
with position
p
, all strings whose positions for that
q
-gram differ by
more than
θ
from
p
can be ignored. A string is considered a viable
candidate if and only if the resulting
q
-gram intersection with the
query exceeds threshold
τ
.
Finally, we can state a simple length filter for edit distance, which
can be used in all algorithms for pruning.
Lemma 8.3 (Edit Distance
L
0
-norm Filter).
Given strings
s, r
and edit distance threshold
θ
E
(
s, r
)
≤
θ
⇒||
s
|−|
r
|| ≤
θ.