Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

or might have to reduce the threshold to a very small value before

identifying the k -th answer.

Similar arguments hold for the optimized mutliway merge algo-

rithm based on sorting lists primarily by norms and secondarily by

string identifiers.

6.2.3

Prioritization Across Lists

Consider weighted intersection similarity first. As was shown for all-

match selection queries, the threshold based algorithms do not help

prune any strings when considering simple weighted intersection, hence

they do not work for top-k queries either. An alternative is to use the

heaviest first algorithm based on prefix and sux lists. Assume

that lists are sorted in increasing order of string identifiers and that

the algorithm processes lists in decreasing order of the corresponding

token weights. Notice that for top-k queries there is no query thresh-

old specified, which is necessary for determining the lists in the pre-

fix/sux sets. Nevertheless, we can assume that the initial threshold

is zero (which implies that all query lists belong to the prefix set) and

start scanning the first list. Each time an element is read, we probe the

remaining lists and complete its similarity. Given that lists are sorted in

increasing order of string identifiers we can use binary search to locate

the strings in these lists (alternatively, an index on string identifiers can

be built for each token list). Once we have retrieved k strings, we can

use the k -th smallest similarity as a new threshold based on which the

algorithm determines a new set of prefix/sux lists. As the k -th simi-

larity keeps improving, lists move from the prefix to the sux set. The

algorithm terminates once the similarity of all viable candidates in the

prefix set has been computed, and the top-k answers have been found.

Notice that once the k -th string has been retrieved the algorithm can

stop doing random accesses to compute the actual similarity of strings,

and it can revert to sequential accesses only. The algorithm terminates

once all lists in the current prefix set have been fully traversed and the

scores of all valid candidates have been computed. A better alternative

here might be to perform random accesses for completing the score of

all remaining viable candidates once the prefix lists have been fully

traversed, or to simply retrieve the candidate strings and compute the

Search WWH ::

Custom Search

Home