Database Reference
In-Depth Information
Lemma 6.6.
Let
L
a
⊆ L
v
be the set of active lists. The terminating
condition
L
(
λ
j
)
∈
L
a
:
f
j
1
≤
f
i
1
2
W
(
λ
j
)
f
=
D
max
L
(
λ
i
<θ,
f
i
v
1
+
)
∈L
a
1
does not lead to any false dismissals.
Lemma 6.7.
Given query
v
, candidate string
s
, and a subset of list
L
v
⊆ L
v
potentially containing
s
, the best-case similarity score for
s
is
v
2
1
b
(
s, v
)=
D
.
s
v
1
+
1
6.1.2.3
Cosine
Recall that cosine similarity is computed based on the
L
2
-norm of the
strings. Once again, an inverted index is built where this time each list
element contains tuples (string-identifier/token-position/
L
2
-norm) for
sequences, (string-identifier/token-frequency/
L
2
-norm) for frequency-
sets and (string-identifier/
L
2
-norm) for sets. Cosine similarity is a
monotone aggregate function with partial similarity
α
i
(
s, v
)=
W
(
λ
i
)
2
s
2
v
2
.
An
L
2
-norm filter also holds.
Lemma 6.8 (Cosine similarity
L
2
-norm Filter).
Given sets
s, r
and Cosine similarity threshold 0
<θ≤
1 the following holds:
≤
r
2
θ
C
(
s, r
)
≥
θ
⇔
θ
r
≤
s
.
2
2
Proof.
For the lower bound:
It holds that
s
∩
r
≤
s
2
. Hence,
2
2
)
2
2
)
2
(
s
∩
r
(
s
C
(
s, r
)
≥
θ
⇒
≥
θ
⇒
2
≥
θ
⇒
θ
r
≤
s
2
.
2
s
2
r
2
s
2
r