Database Reference
In-Depth Information
token multiple times in order to compute the minimum prefix (even
though when computing the intersection of two prefixes multiplicity
is not important), since considering each token only once, irrespective
of its frequency, would allow extra tokens in the prefix which might
increase the number of false positives at query time.
Extending the prefix filter algorithm for Normalized Weighted
Intersection, Jaccard, Dice, and Cosine similarity is straightforward.
Consider Jaccard similarity first. The prefix signature of a string is
defined as
P θ ( s ) for Jaccard
Definition 7.2 (Jaccard Prefix). The prefix
similarity is defined with respect to:
l
W ( λ i )
π = arg max
1 ≤π≤l
≥ θs 1 .
i = π
The following is true:
P θ
P θ
Lemma 7.2. Given two string prefixes
( s ) ,
( r )
⇒P θ
∩P θ
J
( s, r )
θ
( s )
( r )
=
.
Proof. The proof is based on the fact that Jaccard similarity can be
expressed as a weighted intersection constraint. Given two strings s, r
J
( s, r )
θ
s
r
1
1
θ
s
1 +
r
1
s
r
θ
1+ θ (
s
r
s
1 +
r
1 ) .
1
From Lemma 6.11,
J
( s, r )
θ
r
θ
s
1 . Hence,
1
J
( s, r )
θ
s
r
θ
s
1 .
1
The proof follows directly from the proof of Lemma 7.1 where the
threshold for the weighted intersection similarity becomes θ
s
1 instead
of θ .
 
Search WWH ::




Custom Search