Database Reference
In-Depth Information
answers as eciently as possible. Clearly, given a query, it is impossi-
ble to know a priori what the k -th similarity will be. Hence, we can
only resort to heuristics. A simple idea is to start merging lists until
the similarity of k elements has been computed. Then, we populate a
heap using those k candidates and use the k -th similarity as the query
threshold. The algorithm proceeds by maintaining the top k elements
in the heap as the similarity of new candidates is computed. The cur-
rent, always increasing, k -th similarity can be used by the optimized
multiway merge algorithm as a threshold for skipping elements from
the lists and terminating early.
6.2.2
Lists Sorted in L p -norm Order
As we argued for all-match selection queries, Weighted Intersection
similarity cannot benefit from L p -norm ordering, since the function
itself does not depend on any string norm.
Consider Normalized Weighted Intersection. Let token lists be
sorted in increasing order of L 1 -norms. We modify the nra algorithm
(see Algorithm 6.1.2) to maintain a heap H containing the k strings
with the largest partial similarity computed so far from all strings
appearing in candidate set M . The partial similarity of a string is
clearly a lower bound on the true similarity (not to be confused with
the best-case similarity computed using Lemma 6.4, which is an upper
bound on the similarity). Recall that nra maintains a best-case similar-
ity score
f is an upper
bound on the similarity of any string that has not been encountered in
any of the lists yet. Let
f
N
for a conceptual frontier string. Clearly
N
k be the k -th smallest similarity score in heap
H . Straightforwardly, when
N
f <
k
N
N
there are no strings s
M s.t.
k , hence the algorithm can stop inserting new strings in M .
The algorithm can also evict from M strings whose best-case similarity
is smaller than
N
( s, v )
≥N
k and the best-case similarity of strings
in M keep getting tighter. After the frontier condition has been met,
the algorithm needs to simply complete the partial similarity scores of
strings already in M in order to determine the final top-k answers.
The important question is how to seed the algorithm with a good
set of k initial candidates. The following observation is essential.
k ,asboth
N
N
Search WWH ::




Custom Search