Database Reference
In-Depth Information
Lemma 6.6. Let L a ⊆ L v be the set of active lists. The terminating
condition
L ( λ j ) L a : f j 1 f i 1 2 W ( λ j )
f =
D
max
L ( λ i
<θ,
f i
v
1 +
) ∈L a
1
does not lead to any false dismissals.
Lemma 6.7. Given query v , candidate string s , and a subset of list
L v ⊆ L v potentially containing s , the best-case similarity score for s is
v
2
1
b ( s, v )=
D
.
s
v
1 +
1
6.1.2.3
Cosine
Recall that cosine similarity is computed based on the L 2 -norm of the
strings. Once again, an inverted index is built where this time each list
element contains tuples (string-identifier/token-position/ L 2 -norm) for
sequences, (string-identifier/token-frequency/ L 2 -norm) for frequency-
sets and (string-identifier/ L 2 -norm) for sets. Cosine similarity is a
monotone aggregate function with partial similarity α i ( s, v )= W ( λ i ) 2
s 2 v 2 .
An L 2 -norm filter also holds.
Lemma 6.8 (Cosine similarity L 2 -norm Filter). Given sets s, r
and Cosine similarity threshold 0 <θ≤ 1 the following holds:
r 2
θ
C
( s, r )
θ
θ
r
s
.
2
2
Proof. For the lower bound:
It holds that
s
r
s
2 . Hence,
2
2 ) 2
2 ) 2
(
s
r
(
s
C
( s, r )
θ
θ
2
θ
θ
r
s
2 .
2
s
2
r
2
s
2
r
 
Search WWH ::




Custom Search