Database Reference
In-Depth Information
8.2 Filtering Techniques
8.2.1
The Prefix, Prefix Enumeration, and Pigeonhole
Signatures
Based on Lemma 8.1 and the fact that we are dealing with unit token
weights, we can use all filtering techniques discussed in Section 7 with-
out modifications. We simply build string signatures for Intersection
similarity. Given that we are dealing with unit weights, we sort tokens
using idf weights. Recall that this ordering has the advantage that pre-
fixes contain the least frequent tokens, which decreases the probability
of collisions that lead to false positives. Given query string v and an
arbitrary data string s , Lemma 8.1 states that the intersection of v
and s need to be at least τ = max(
. According to
Definition 7.1, given that we have a maximum of max( |v|,|s| ) − q +1
tokens per string, we need at most τ
|
v
|
,
|
s
|
)
q +1
1 tokens in the string sux
to guarantee no false negatives. Hence, the prefix signature has size
equal to
[max(
|
v
|
,
|
s
|
)
q +1]
( τ
1) = +1 .
Thus, the prefixes of all strings have a fixed size of +1 q -grams,
which is independent of the query or data string length, and depends
only on edit distance threshold θ .
An additional optimization of the prefix filter for top- k join and
self-join queries is also possible, based on the fact that we are dealing
with unit token weights, as discussed in Section 7.1.5.
There is no known way of extending filtering algorithms for evalu-
ating weighted and normalized edit distance.
8.2.2
The Mismatch Filter
The mismatch filter is an extension of the prefix filter that takes into
account the positional information and the content of mismatching q -
grams between two prefix signatures, to perform tighter pruning.
The q -gram Intersection Filter for Edit Distance (see Lemma 8.1)
is based on the fact that since one edit operation can affect at most q
Search WWH ::




Custom Search