Database Reference
In-Depth Information
Fig. 7.2 If two prefixes have empty intersection then, at best, even if a sux matches
completely, by construction the total intersection weight cannot be larger than θ .
since tokens are sorted in decreasing order of weights one of the follow-
ing holds:
P θ ( s )
∩S θ ( r )
P θ ( r )
∩S θ ( s )=
(1) If
=
, then
. By defini-
1 = λ i ∈S θ ( r ) W ( λ i ) . Hence,
S θ ( r )
tion,
P θ ( r )
∩P θ ( s )
P θ ( r )
∩S θ ( s )
I
( s, r )=
1
+ S θ ( r ) ∩P θ ( s ) 1 + S θ ( r ) ∩S θ ( s ) 1
1 +
S θ ( r )
∩P θ ( s )
S θ ( r )
∩S θ ( s )
=
1 +
1
≤S θ ( r )
1 <θ.
P θ ( r )
∩S θ ( s )
P θ ( s )
∩S θ ( r )=
(2) If
=
,
then
.
Hence,
similarly
≤S θ ( s )
I
( s, r )
1 <θ.
Both cases lead to a contradiction.
The prefix filter reduces the problem of computing the intersection
between the token sets of two strings to that of computing whether the
prefixes of those strings have a non-empty intersection.
Notice that the prefix signature was defined in terms of strings rep-
resented as sets of tokens. The definition for sequences and frequency-
sets is similar. For sequences, each token/position pair is considered as
one element of the prefix. For frequency-sets, each token/frequency
pair is considered as many times as the frequency of the token in
the prefix (i.e., the prefix becomes a bag of tokens with a compressed
token/frequency pair representation). It is important to consider each
Search WWH ::




Custom Search