Database Reference
In-Depth Information
Algorithm 7.1.1:
Top-
k
Join(
S,k
)
R
is a min-heap initialized with the first
k
pairs in
S
E
is a max-heap initialized with one entry
s,
1
,
1
.
0
per
s
∈
S
L
(
λ
i
) are empty token lists
Let
k
denote the similarity of the top element in
R
while
E
is not empty
J
s
s, π,J
←E
.pop
s
if
J
≤J
k
then Break
for all
(
r,r
1
)
∈ L
(
λ
π
)
if
J
r
≤
s
≤
r
1
/
J
k
1
1
k
then
R
.insert(
do
J
(
s, r
)
,
(
s, r
))
R
.pop
=
s
1
−
λ
1
···
λ
π−
1
1
s
1
π
+1
s
J
π
+1
s
if
J
>
J
k
then
L
(
λ
π
)
←
(
s,
s
1
)
π
+1
s
E
.push(
s, π
+1
,
J
)
algorithm probes the existing index (as it is being built incrementally)
to identify candidate pairs
s, r
, immediately computes the exact simi-
larity
(
s, r
) for all
r
, and updates the top-
k
result heap as necessary.
Then, the algorithm inserts
s
in token list
L
(
λ
π
) if necessary, and also
computes a hypothetical maximum similarity threshold between
s
and
any unseen string that does not match with
s
on the already indexed
tokens. The algorithm continues iteratively until there is no string
s
whose hypothetical maximum similarity threshold is larger than the
k
-th similarity in the result set.
In more detail, first the algorithm initializes a min-heap with
k
arbi-
trary pairs (e.g., the first
k
pairs in
S
) sorted according to their sim-
ilarity and creates an event max-heap containing one entry per string
in
S
. Each event
J
s
is a triple consisting of a string identifier
s
,
a prefix length
π
, and a similarity upper bound
s, π,
J
s
. The event heap is
J