Database Reference
In-Depth Information
Fig. 7.2 If two prefixes have empty intersection then, at best, even if a sux matches
completely, by construction the total intersection weight cannot be larger than
θ
.
since tokens are sorted in decreasing order of weights one of the follow-
ing holds:
P
θ
(
s
)
∩S
θ
(
r
)
P
θ
(
r
)
∩S
θ
(
s
)=
(1) If
=
∅
, then
∅
. By defini-
1
=
λ
i
∈S
θ
(
r
)
W
(
λ
i
)
<θ
. Hence,
S
θ
(
r
)
tion,
P
θ
(
r
)
∩P
θ
(
s
)
P
θ
(
r
)
∩S
θ
(
s
)
I
(
s, r
)=
1
+
S
θ
(
r
)
∩P
θ
(
s
)
1
+
S
θ
(
r
)
∩S
θ
(
s
)
1
1
+
S
θ
(
r
)
∩P
θ
(
s
)
S
θ
(
r
)
∩S
θ
(
s
)
=
1
+
1
≤S
θ
(
r
)
1
<θ.
P
θ
(
r
)
∩S
θ
(
s
)
P
θ
(
s
)
∩S
θ
(
r
)=
(2) If
=
∅
,
then
∅
.
Hence,
similarly
≤S
θ
(
s
)
I
(
s, r
)
1
<θ.
Both cases lead to a contradiction.
The prefix filter reduces the problem of computing the intersection
between the token sets of two strings to that of computing whether the
prefixes of those strings have a non-empty intersection.
Notice that the prefix signature was defined in terms of strings rep-
resented as sets of tokens. The definition for sequences and frequency-
sets is similar. For sequences, each token/position pair is considered as
one element of the prefix. For frequency-sets, each token/frequency
pair is considered as many times as the frequency of the token in
the prefix (i.e., the prefix becomes a bag of tokens with a compressed
token/frequency pair representation). It is important to consider each