Database Reference
In-Depth Information
or might have to reduce the threshold to a very small value before
identifying the k -th answer.
Similar arguments hold for the optimized mutliway merge algo-
rithm based on sorting lists primarily by norms and secondarily by
string identifiers.
6.2.3
Prioritization Across Lists
Consider weighted intersection similarity first. As was shown for all-
match selection queries, the threshold based algorithms do not help
prune any strings when considering simple weighted intersection, hence
they do not work for top-k queries either. An alternative is to use the
heaviest first algorithm based on prefix and sux lists. Assume
that lists are sorted in increasing order of string identifiers and that
the algorithm processes lists in decreasing order of the corresponding
token weights. Notice that for top-k queries there is no query thresh-
old specified, which is necessary for determining the lists in the pre-
fix/sux sets. Nevertheless, we can assume that the initial threshold
is zero (which implies that all query lists belong to the prefix set) and
start scanning the first list. Each time an element is read, we probe the
remaining lists and complete its similarity. Given that lists are sorted in
increasing order of string identifiers we can use binary search to locate
the strings in these lists (alternatively, an index on string identifiers can
be built for each token list). Once we have retrieved k strings, we can
use the k -th smallest similarity as a new threshold based on which the
algorithm determines a new set of prefix/sux lists. As the k -th simi-
larity keeps improving, lists move from the prefix to the sux set. The
algorithm terminates once the similarity of all viable candidates in the
prefix set has been computed, and the top-k answers have been found.
Notice that once the k -th string has been retrieved the algorithm can
stop doing random accesses to compute the actual similarity of strings,
and it can revert to sequential accesses only. The algorithm terminates
once all lists in the current prefix set have been fully traversed and the
scores of all valid candidates have been computed. A better alternative
here might be to perform random accesses for completing the score of
all remaining viable candidates once the prefix lists have been fully
traversed, or to simply retrieve the candidate strings and compute the
Search WWH ::




Custom Search