Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

Careful observation reveals that given two strings s, r the condition

P θ ( s ) ∩P θ ( r ) = ∅ for Jaccard similarity allows false positives that

could be pruned by computing a tighter bound on the weighted inter-

section between two strings from the complete information provided in

their prefixes, as opposed to using only their intersection.

Claim 7.3. Given two strings s, r

1 ≤P θ ( s )

∩P θ ( r )

S θ ( s )

S θ ( r )

∩

1 + max(

1 ,

1 ) .

(7.1)

P θ

∩S θ

P θ

∩S θ

Proof. Let

( r )

( s )=

∅

(the case

( s )

( r )=

∅

is symmet-

ric). Then,

P θ ( r )

∩P θ ( s )

P θ ( r )

∩S θ ( s )

∩

1 =

+ S θ ( r ) ∩P θ ( s ) 1 + S θ ( r ) ∩S θ ( s ) 1

1 +

P θ ( r )

∩P θ ( s )

S θ ( r )

1 +

1 .

P θ

∩S θ

P θ

∩S θ

Since either

has to be true

(given that tokens are sorted in decreasing order of weights) the claim

follows.

( r )

( s )=

∅

( s )

( r )=

∅

Hence

P θ ( s ) ,

P θ ( r )

Lemma 7.4. Given two string prefixes

⇒P θ ( s )

∩P θ ( r )

S θ ( s )

S θ ( r )

( s, r )

≥

1 + max(

1 ,

1 )

1+ θ ( s 1 + r 1 ) .

≥

It is easy to see that similar conditions hold for Weighted Intersection,

Normalized Weighted Intersection, Dice, and Cosine similarity. One

drawback of using the tighter pruning conditions is that in order to

evaluate the condition we need to know both the norm of the prefix of

each data string and the norm of the sux (or alternatively, the norm

of the sux can be computed as the norm of the whole string minus

Approximate String Processing

Search WWH ::

Custom Search

Home