Database Reference
In-Depth Information
Careful observation reveals that given two strings s, r the condition
P θ ( s ) ∩P θ ( r ) = for Jaccard similarity allows false positives that
could be pruned by computing a tighter bound on the weighted inter-
section between two strings from the complete information provided in
their prefixes, as opposed to using only their intersection.
Claim 7.3. Given two strings s, r
1 ≤P θ ( s )
∩P θ ( r )
S θ ( s )
S θ ( r )
s
r
1 + max(
1 ,
1 ) .
(7.1)
P θ
∩S θ
P θ
∩S θ
Proof. Let
( r )
( s )=
(the case
( s )
( r )=
is symmet-
ric). Then,
P θ ( r )
∩P θ ( s )
P θ ( r )
∩S θ ( s )
s
r
1 =
1
+ S θ ( r ) ∩P θ ( s ) 1 + S θ ( r ) ∩S θ ( s ) 1
1 +
P θ ( r )
∩P θ ( s )
S θ ( r )
<
1 +
1 .
P θ
∩S θ
P θ
∩S θ
Since either
has to be true
(given that tokens are sorted in decreasing order of weights) the claim
follows.
( r )
( s )=
or
( s )
( r )=
Hence
P θ ( s ) ,
P θ ( r )
Lemma 7.4. Given two string prefixes
⇒P θ ( s )
∩P θ ( r )
S θ ( s )
S θ ( r )
J
( s, r )
θ
1 + max(
1 ,
1 )
θ
1+ θ ( s 1 + r 1 ) .
It is easy to see that similar conditions hold for Weighted Intersection,
Normalized Weighted Intersection, Dice, and Cosine similarity. One
drawback of using the tighter pruning conditions is that in order to
evaluate the condition we need to know both the norm of the prefix of
each data string and the norm of the sux (or alternatively, the norm
of the sux can be computed as the norm of the whole string minus
 
Search WWH ::




Custom Search