Database Reference
In-Depth Information
answers as eciently as possible. Clearly, given a query, it is impossi-
ble to know a priori what the
k
-th similarity will be. Hence, we can
only resort to heuristics. A simple idea is to start merging lists until
the similarity of
k
elements has been computed. Then, we populate a
heap using those
k
candidates and use the
k
-th similarity as the query
threshold. The algorithm proceeds by maintaining the top
k
elements
in the heap as the similarity of new candidates is computed. The cur-
rent, always increasing,
k
-th similarity can be used by the
optimized
multiway merge
algorithm as a threshold for skipping elements from
the lists and terminating early.
6.2.2
Lists Sorted in
L
p
-norm Order
As we argued for all-match selection queries, Weighted Intersection
similarity cannot benefit from
L
p
-norm ordering, since the function
itself does not depend on any string norm.
Consider Normalized Weighted Intersection. Let token lists be
sorted in increasing order of
L
1
-norms. We modify the
nra
algorithm
(see Algorithm 6.1.2) to maintain a heap
H
containing the
k
strings
with the largest partial similarity computed so far from all strings
appearing in candidate set
M
. The partial similarity of a string is
clearly a
lower bound
on the true similarity (not to be confused with
the best-case similarity computed using Lemma 6.4, which is an
upper
bound
on the similarity). Recall that
nra
maintains a best-case similar-
ity score
f
is an upper
bound on the similarity of any string that has not been encountered in
any of the lists yet. Let
f
N
for a conceptual frontier string. Clearly
N
k
be the
k
-th smallest similarity score in heap
H
. Straightforwardly, when
N
f
<
k
N
N
there are no strings
s
∈
M
s.t.
k
, hence the algorithm can stop inserting new strings in
M
.
The algorithm can also evict from
M
strings whose best-case similarity
is smaller than
N
(
s, v
)
≥N
k
and the best-case similarity of strings
in
M
keep getting tighter. After the frontier condition has been met,
the algorithm needs to simply complete the partial similarity scores of
strings already in
M
in order to determine the final top-k answers.
The important question is how to seed the algorithm with a good
set of
k
initial candidates. The following observation is essential.
k
,asboth
N
N