Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing - page 74

Database Reference

In-Depth Information

Fig. 7.2 If two prefixes have empty intersection then, at best, even if a sux matches

completely, by construction the total intersection weight cannot be larger than θ .

since tokens are sorted in decreasing order of weights one of the follow-

ing holds:

P θ ( s )

∩S θ ( r )

P θ ( r )

∩S θ ( s )=

(1) If

=

∅

, then

∅

. By defini-

1 = λ i ∈S θ ( r ) W ( λ i ) <θ . Hence,

S θ ( r )

tion,

P θ ( r )

∩P θ ( s )

P θ ( r )

∩S θ ( s )

I

( s, r )=

1

+ S θ ( r ) ∩P θ ( s ) 1 + S θ ( r ) ∩S θ ( s ) 1

1 +

S θ ( r )

∩P θ ( s )

S θ ( r )

∩S θ ( s )

=

1 +

1

≤S θ ( r )

1 <θ.

P θ ( r )

∩S θ ( s )

P θ ( s )

∩S θ ( r )=

(2) If

=

∅

,

then

∅

.

Hence,

similarly

≤S θ ( s )

I

( s, r )

1 <θ.

Both cases lead to a contradiction.

The prefix filter reduces the problem of computing the intersection

between the token sets of two strings to that of computing whether the

prefixes of those strings have a non-empty intersection.

Notice that the prefix signature was defined in terms of strings rep-

resented as sets of tokens. The definition for sequences and frequency-

sets is similar. For sequences, each token/position pair is considered as

one element of the prefix. For frequency-sets, each token/frequency

pair is considered as many times as the frequency of the token in

the prefix (i.e., the prefix becomes a bag of tokens with a compressed

token/frequency pair representation). It is important to consider each

Next Page

Approximate String Processing

Search WWH ::

Custom Search

Home