Database Reference
In-Depth Information
Nevertheless, we cannot use this fact for designing termination condi-
tions for the simple reason that α i 's cannot be evaluated on a per token
list basis since
s
v
1 is not known in advance (knowledge of
s
v
1
implies knowledge of the whole string s and hence knowledge of
1
which is equivalent to directly computing the similarity). Recall that
an alternative expression for Jaccard is
s
v
s∩v 1
s 1 + v 1 −s∩v 1 . This
expression cannot be decomposed into aggregate parts on a per token
basis, and hence is not useful either. Nevertheless, we can still prove
various properties of Jaccard that enable us to use all threshold algo-
rithms and the optimized multiway merge algorithm. In particular
J
( s, v )=
Lemma 6.11(Jaccard L 1 -norm Filter). Given sets s, r and Jaccard
similarity threshold 0 <θ≤ 1 the following holds:
J ( s, r ) ≥ θ ⇔ θr 1 ≤s 1
r
1
.
θ
Proof. For the lower bound:
It holds that
s
r
r
1 and
s
r
s
1 . Hence,
1
1
s
r
1
s
1
J
( s, r )
θ
1
θ
1
θ
θ
r
s
1 .
1
s
r
r
For the upper bound:
It holds that
s
r
1
s
1 and
s
r
1
r
1 . Hence,
s r 1
r 1
r 1
θ
J
( s, r )
θ
1
θ
1
θ
s
.
1
s
r
s
Lemma 6.12.
Let L a
L v be the set of active lists. Let I i =
L ( λ j ) ∈L a : f j 1 ≤f i 1 W ( λ j ). The terminating condition
I i
f =
J
max
L ( λ i
<θ,
f i 1 +
v
1
I i
) ∈L a
does not lead to any false dismissals.
 
Search WWH ::




Custom Search