Database Reference
In-Depth Information
Nevertheless, we cannot use this fact for designing termination condi-
tions for the simple reason that
α
i
's cannot be evaluated on a per token
list basis since
s
∪
v
1
is not known in advance (knowledge of
s
∪
v
1
implies knowledge of the whole string
s
and hence knowledge of
1
which is equivalent to directly computing the similarity). Recall that
an alternative expression for Jaccard is
s
∩
v
s∩v
1
s
1
+
v
1
−s∩v
1
. This
expression cannot be decomposed into aggregate parts on a per token
basis, and hence is not useful either. Nevertheless, we can still prove
various properties of Jaccard that enable us to use all
threshold
algo-
rithms and the
optimized multiway merge
algorithm. In particular
J
(
s, v
)=
Lemma 6.11(Jaccard
L
1
-norm Filter).
Given sets
s, r
and Jaccard
similarity threshold 0
<θ≤
1 the following holds:
J
(
s, r
)
≥ θ ⇔ θr
1
≤s
1
≤
r
1
.
θ
Proof.
For the lower bound:
It holds that
s
∪
r
≥
r
1
and
s
∩
r
≤
s
1
. Hence,
1
1
⇒
s
∩
r
1
⇒
s
1
J
(
s, r
)
≥
θ
1
≥
θ
1
≥
θ
⇒
θ
r
≤
s
1
.
1
s
∪
r
r
For the upper bound:
It holds that
s
∪
r
1
≥
s
1
and
s
∩
r
1
≤
r
1
. Hence,
⇒
s
∩
r
1
⇒
r
1
≤
r
1
θ
J
(
s, r
)
≥
θ
1
≥
θ
1
≥
θ
⇒
s
.
1
s
∪
r
s
Lemma 6.12.
Let
L
a
⊆
L
v
be the set of active lists. Let
I
i
=
L
(
λ
j
)
∈L
a
:
f
j
1
≤f
i
1
W
(
λ
j
). The terminating condition
I
i
f
=
J
max
L
(
λ
i
<θ,
f
i
1
+
v
1
−
I
i
)
∈L
a
does not lead to any false dismissals.