Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

Algorithm 7.1.1: Top- k Join( S,k )

R is a min-heap initialized with the first k pairs in S

E is a max-heap initialized with one entry

s, 1 , 1 . 0

per s

∈

L ( λ i ) are empty token lists

Let

k denote the similarity of the top element in R

while E is not empty

s, π,J

←E .pop

≤J

then Break

for all ( r,r 1 ) ∈ L ( λ π )

≤

1 /

then R .insert(

( s, r ) , ( s, r ))

R .pop

= s 1 − λ 1 ··· λ π− 1 1

s 1

π +1

if J

J k

then L ( λ π )

←

( s,

1 )

π +1

E .push(

s, π +1 ,

)

algorithm probes the existing index (as it is being built incrementally)

to identify candidate pairs s, r , immediately computes the exact simi-

larity

( s, r ) for all r , and updates the top- k result heap as necessary.

Then, the algorithm inserts s in token list L ( λ π ) if necessary, and also

computes a hypothetical maximum similarity threshold between s and

any unseen string that does not match with s on the already indexed

tokens. The algorithm continues iteratively until there is no string s

whose hypothetical maximum similarity threshold is larger than the

k -th similarity in the result set.

In more detail, first the algorithm initializes a min-heap with k arbi-

trary pairs (e.g., the first k pairs in S ) sorted according to their sim-

ilarity and creates an event max-heap containing one entry per string

in S . Each event

is a triple consisting of a string identifier s ,

a prefix length π , and a similarity upper bound

s, π,

s . The event heap is

Approximate String Processing

Search WWH ::

Custom Search

Home