Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

in any of the already indexed tokens λ 1 ,...,λ π . Then, it has to insert

a new event s, π +1 ,J

π + s is

larger than the k -th similarity in the result heap. The algorithm stops

once there are no more events to process or the k -th similarity in the

result heap is larger than the similarity upper bound at the top of the

event heap.

We can further improve the pruning eciency of this algorithm

by using the fact that the token lists are created incrementally in

decreasing order of similarity upper bounds. Let string s with cur-

rent indexed prefix

π +1

in the event heap, if and only if J

π ( s ) and last prefix token λ π . Let string r with

π ( r ) whose first common token with s is token

current indexed prefix

π ( r )). By definition, the intersection of the current

λ π (hence λ π ∈S

π ( r )=

. Since λ π is the first token in com-

mon between s and r , and tokens are processed in sorted weight order,

it holds that

π ( s )

prefixes is empty

∩P

∅

π ( r )

π ( s )

∩

1 =

∩S

≤

min(

1 ,

1 ) .

There are two cases to consider

π ( s )

s ,

1 =

1 J

≤

1 J

(1)

∩

≤

π ( r )

π r ,

s .

1 =

≤

(2)

For case (1)

∪

≥

1 +

−

1 J

≥

1 +

−

1 J

⇒

( s, r )=

∩

s ∪ r 1

s 1 J

≤

s 1 + s 1 J s

−s 1 J

J π

≤

s +

−J

Approximate String Processing

Search WWH ::

Custom Search

Home