Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

Nevertheless, we cannot use this fact for designing termination condi-

tions for the simple reason that α i 's cannot be evaluated on a per token

list basis since

∪

1 is not known in advance (knowledge of

∪

implies knowledge of the whole string s and hence knowledge of

which is equivalent to directly computing the similarity). Recall that

an alternative expression for Jaccard is

∩

s∩v 1

s 1 + v 1 −s∩v 1 . This

expression cannot be decomposed into aggregate parts on a per token

basis, and hence is not useful either. Nevertheless, we can still prove

various properties of Jaccard that enable us to use all threshold algo-

rithms and the optimized multiway merge algorithm. In particular

( s, v )=

Lemma 6.11(Jaccard L 1 -norm Filter). Given sets s, r and Jaccard

similarity threshold 0 <θ≤ 1 the following holds:

J ( s, r ) ≥ θ ⇔ θr 1 ≤s 1 ≤

Proof. For the lower bound:

It holds that

∪

≥

1 and

∩

≤

1 . Hence,

⇒

∩

⇒

( s, r )

≥

1 ≥

⇒

≤

1 .

∪

For the upper bound:

It holds that

∪

1 ≥

1 and

∩

1 ≤

1 . Hence,

⇒ s ∩ r 1

⇒ r 1

≤ r 1

( s, r )

≥

1 ≥

⇒

∪

Lemma 6.12.

Let L a ⊆

L v be the set of active lists. Let I i =

L ( λ j ) ∈L a : f j 1 ≤f i 1 W ( λ j ). The terminating condition

I i

f =

max

L ( λ i

<θ,

f i 1 +

1 −

I i

) ∈L a

does not lead to any false dismissals.

Approximate String Processing

Search WWH ::

Custom Search

Home