Database Reference
In-Depth Information
For the upper bound:
It holds that
s ∩ r
2
≤r
2
. Hence,
2
)
2
s
2
r
2
2
)
2
s
2
r
2
≥
s
∩
r
r
≤
r
(
(
1
C
(
s, r
)
≥
θ
⇒
≥
θ
⇒
θ
⇒
s
.
2
θ
In addition
Observation 6.2.
Given a string
s
with
L
2
-norm
s
2
and the
L
2
-
norms of all frontier elements
2
, we can immediately deduce
whether
s
potentially appears in list
L
(
λ
i
) or not by a simple com-
parison: If
f
i
2
and
s
has not been encountered in list
L
(
λ
i
)
yet, then
s
does not appear in
L
(
λ
i
) (given that lists are sorted in
increasing order of
L
2
-norms).
s
2
<
f
i
Hence
Lemma 6.9.
Let
L
a
⊆ L
v
be the set of active lists. The terminating
condition
L
(
λ
j
)
∈
L
a
:
f
j
2
≤
f
i
2
W
(
λ
j
)
2
f
=
C
max
L
(
λ
i
<θ,
f
i
v
)
∈L
a
2
2
does not lead to any false dismissals.
Lemma 6.10.
Given query
v
, candidate string
s
, and a subset of lists
L
v
⊆
L
v
potentially containing
s
, the best-case similarity score for
s
is
b
(
s, v
)=
(
v
2
)
2
C
s
2
v
2
.
The actual algorithms in principle remain the same.
6.1.2.4
Jaccard
Jaccard similarity presents some diculties. Notice that Jaccard is a
monotone aggregate function with partial similarity
α
i
(
s, v
)=
W
(
λ
i
)
1
.
s
∪
v