Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

token multiple times in order to compute the minimum prefix (even

though when computing the intersection of two prefixes multiplicity

is not important), since considering each token only once, irrespective

of its frequency, would allow extra tokens in the prefix which might

increase the number of false positives at query time.

Extending the prefix filter algorithm for Normalized Weighted

Intersection, Jaccard, Dice, and Cosine similarity is straightforward.

Consider Jaccard similarity first. The prefix signature of a string is

defined as

P θ ( s ) for Jaccard

Definition 7.2 (Jaccard Prefix). The prefix

similarity is defined with respect to:

l

W ( λ i )

π = arg max

1 ≤π≤l

≥ θs 1 .

i = π

The following is true:

P θ

Lemma 7.2. Given two string prefixes

( s ) ,

( r )

⇒P θ

∩P θ

J

( s, r )

≥

θ

( s )

( r )

=

∅

.

Proof. The proof is based on the fact that Jaccard similarity can be

expressed as a weighted intersection constraint. Given two strings s, r

J

( s, r )

≥

θ

⇒

s

∩

r

1

1 ≥

θ

⇒

s

1 +

r

1 −

s

∩

r

θ

1+ θ (

s

∩

r

≥

s

1 +

r

1 ) .

1

From Lemma 6.11,

J

( s, r )

≥

θ

⇒

r

≥

θ

s

1 . Hence,

1

J

( s, r )

≥

θ

⇒

s

∩

r

≥

θ

s

1 .

1

The proof follows directly from the proof of Lemma 7.1 where the

threshold for the weighted intersection similarity becomes θ

s

1 instead

of θ .

Approximate String Processing

Search WWH ::

Custom Search

Home