Database Reference
In-Depth Information
set whose best-case similarity is below the query threshold. The best-
case similarity for every candidate is computed by taking the partial
similarity score already computed for each candidate after list
L
(
λ
i
)
has been scanned, and assuming that the candidate exists in all subse-
quent lists
L
(
λ
i
+1
)
,...,L
(
λ
v
m
).
The important observation here is that we can compute a tighter
L
1
-norm filtering bound each time a new list is scanned. The idea is to
treat the remaining lists as a new query
v
=
λ
i
+1
...λ
v
m
and recompute
the bounds using Lemma 6.1. The intuition is that the most promising
new candidates in lists
L
(
λ
i
+1
)
,...,L
(
λ
v
m
), are the ones containing all
tokens
λ
i
+1
,...,λ
v
m
and hence potentially satisfying
s
=
v
.
Lemma 6.15(Norm Tightening).
Without loss of generality, given
query string
v
=
λ
1
···
λ
v
m
s.t.
W
(
λ
1
)
W
(
λ
v
m
) and query thresh-
≥ ··· ≥
L
(
λ
1
)
,...L
(
λ
i
) the
old 0
<θ
≤
1, for any viable candidate
s
s.t.
s/
∈
following holds:
•
Normalized Weighted Intersection:
λ
i
+1
···
λ
v
m
≤
1
N
(
v,s
)
≥
θ
⇒
θ
v
≤
s
.
1
1
θ
•
Jaccard:
1+
θ
θ
λ
i
+1
···
λ
v
m
J
(
v,s
)
≥
θ
⇒
θ
v
≤
s
≤
−
v
1
.
1
1
1
•
Dice:
θ
2
θ
λ
i
+1
···
λ
v
m
D
(
v,s
)
≥
θ
⇒
θ
v
≤
s
≤
−
v
1
.
1
1
1
2
−
•
Cosine:
(
λ
i
+1
···
λ
v
m
2
)
2
θ
C
(
v,s
)
≥
θ
⇒
θ
v
≤
s
≤
.
2
2
v
2
Care needs to be taken though in order to accurately complete the
partial similarity scores of all candidates already inserted in the can-
didate set from previous lists, which can have
L
1
-norms that do not
satisfy the recomputed bounds for
v
. For that purpose the algorithm