Database Reference
In-Depth Information
Algorithm 7.1.1: Top- k Join( S,k )
R is a min-heap initialized with the first k pairs in S
E is a max-heap initialized with one entry
s, 1 , 1 . 0
per s
S
L ( λ i ) are empty token lists
Let
k denote the similarity of the top element in R
while E is not empty
J
s
s, π,J
←E .pop
s
if
J
≤J
k
then Break
for all ( r,r 1 ) ∈ L ( λ π )
if
J
r
s
r
1 /
J
k
1
1
k
then R .insert(
do
J
( s, r ) , ( s, r ))
R .pop
= s 1 λ 1 ··· λ π− 1 1
s 1
π +1
s
J
π +1
s
if J
>
J k
then L ( λ π )
( s,
s
1 )
π +1
s
E .push(
s, π +1 ,
J
)
algorithm probes the existing index (as it is being built incrementally)
to identify candidate pairs s, r , immediately computes the exact simi-
larity
( s, r ) for all r , and updates the top- k result heap as necessary.
Then, the algorithm inserts s in token list L ( λ π ) if necessary, and also
computes a hypothetical maximum similarity threshold between s and
any unseen string that does not match with s on the already indexed
tokens. The algorithm continues iteratively until there is no string s
whose hypothetical maximum similarity threshold is larger than the
k -th similarity in the result set.
In more detail, first the algorithm initializes a min-heap with k arbi-
trary pairs (e.g., the first k pairs in S ) sorted according to their sim-
ilarity and creates an event max-heap containing one entry per string
in S . Each event
J
s
is a triple consisting of a string identifier s ,
a prefix length π , and a similarity upper bound
s, π,
J
s . The event heap is
J
 
Search WWH ::




Custom Search