Database Reference
In-Depth Information
set whose best-case similarity is below the query threshold. The best-
case similarity for every candidate is computed by taking the partial
similarity score already computed for each candidate after list L ( λ i )
has been scanned, and assuming that the candidate exists in all subse-
quent lists L ( λ i +1 ) ,...,L ( λ v m ).
The important observation here is that we can compute a tighter
L 1 -norm filtering bound each time a new list is scanned. The idea is to
treat the remaining lists as a new query v = λ i +1 ...λ v m and recompute
the bounds using Lemma 6.1. The intuition is that the most promising
new candidates in lists L ( λ i +1 ) ,...,L ( λ v m ), are the ones containing all
tokens λ i +1 ,...,λ v m and hence potentially satisfying s = v .
Lemma 6.15(Norm Tightening). Without loss of generality, given
query string v = λ 1 ···
λ v m s.t. W ( λ 1 )
W ( λ v m ) and query thresh-
≥ ··· ≥
L ( λ 1 ) ,...L ( λ i ) the
old 0
1, for any viable candidate s s.t. s/
following holds:
Normalized Weighted Intersection:
λ i +1 ···
λ v m
1
N
( v,s )
θ
θ
v
s
.
1
1
θ
Jaccard:
1+ θ
θ
λ i +1 ···
λ v m
J
( v,s )
θ
θ
v
s
v
1 .
1
1
1
Dice:
θ
2
θ
λ i +1 ···
λ v m
D
( v,s )
θ
θ
v
s
v
1 .
1
1
1
2
Cosine:
( λ i +1 ··· λ v m 2 ) 2
θ
C
( v,s )
θ
θ
v
s
.
2
2
v
2
Care needs to be taken though in order to accurately complete the
partial similarity scores of all candidates already inserted in the can-
didate set from previous lists, which can have L 1 -norms that do not
satisfy the recomputed bounds for v . For that purpose the algorithm
 
Search WWH ::




Custom Search