Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

set whose best-case similarity is below the query threshold. The best-

case similarity for every candidate is computed by taking the partial

similarity score already computed for each candidate after list L ( λ i )

has been scanned, and assuming that the candidate exists in all subse-

quent lists L ( λ i +1 ) ,...,L ( λ v m ).

The important observation here is that we can compute a tighter

L 1 -norm filtering bound each time a new list is scanned. The idea is to

treat the remaining lists as a new query v = λ i +1 ...λ v m and recompute

the bounds using Lemma 6.1. The intuition is that the most promising

new candidates in lists L ( λ i +1 ) ,...,L ( λ v m ), are the ones containing all

tokens λ i +1 ,...,λ v m and hence potentially satisfying s = v .

Lemma 6.15(Norm Tightening). Without loss of generality, given

query string v = λ 1 ···

λ v m s.t. W ( λ 1 )

W ( λ v m ) and query thresh-

≥ ··· ≥

L ( λ 1 ) ,...L ( λ i ) the

old 0 <θ

≤

1, for any viable candidate s s.t. s/

∈

following holds:

•

Normalized Weighted Intersection:

λ i +1 ···

λ v m

≤

( v,s )

≥

⇒

≤

•

Jaccard:

1+ θ

λ i +1 ···

λ v m

( v,s )

≥

⇒

≤

−

1 .

•

Dice:

λ i +1 ···

λ v m

( v,s )

≥

⇒

≤

−

1 .

−

•

Cosine:

( λ i +1 ··· λ v m 2 ) 2

( v,s )

≥

⇒

≤

Care needs to be taken though in order to accurately complete the

partial similarity scores of all candidates already inserted in the can-

didate set from previous lists, which can have L 1 -norms that do not

satisfy the recomputed bounds for v . For that purpose the algorithm

Approximate String Processing

Search WWH ::

Custom Search

Home