Algorithms for Edit Based Similarity - Approximate String Processing

Database Reference

In-Depth Information

8.2 Filtering Techniques

8.2.1

The Prefix, Prefix Enumeration, and Pigeonhole

Signatures

Based on Lemma 8.1 and the fact that we are dealing with unit token

weights, we can use all filtering techniques discussed in Section 7 with-

out modifications. We simply build string signatures for Intersection

similarity. Given that we are dealing with unit weights, we sort tokens

using idf weights. Recall that this ordering has the advantage that pre-

fixes contain the least frequent tokens, which decreases the probability

of collisions that lead to false positives. Given query string v and an

arbitrary data string s , Lemma 8.1 states that the intersection of v

and s need to be at least τ = max(

qθ . According to

Definition 7.1, given that we have a maximum of max( |v|,|s| ) − q +1

tokens per string, we need at most τ

|

v

|

,

|

s

|

)

−

q +1

−

1 tokens in the string sux

to guarantee no false negatives. Hence, the prefix signature has size

equal to

−

[max(

|

v

|

,

|

s

|

)

−

q +1]

−

( τ

−

1) = qθ +1 .

Thus, the prefixes of all strings have a fixed size of qθ +1 q -grams,

which is independent of the query or data string length, and depends

only on edit distance threshold θ .

An additional optimization of the prefix filter for top- k join and

self-join queries is also possible, based on the fact that we are dealing

with unit token weights, as discussed in Section 7.1.5.

There is no known way of extending filtering algorithms for evalu-

ating weighted and normalized edit distance.

8.2.2

The Mismatch Filter

The mismatch filter is an extension of the prefix filter that takes into

account the positional information and the content of mismatching q -

grams between two prefix signatures, to perform tighter pruning.

The q -gram Intersection Filter for Edit Distance (see Lemma 8.1)

is based on the fact that since one edit operation can affect at most q

Approximate String Processing

Search WWH ::

Custom Search

Home