Database Reference
In-Depth Information
to the heap, and repeat. An important observation here is that an ele-
ment that was popped from the heap might end up subsequently being
re-inserted to the heap, if it happens to be equal to the top element
of the heap after the θ
1 total removals. This is possible given that
a skip operation might result in skipping backwards on some of the
lists. The correctness of the algorithm follows from the fact that if an
element appears in at least θ lists, eventually it will have to appear
at the top of the heap θ times (after potentially multiple reinsertions
to the heap). The advantage of the algorithm is that it will result in
many list elements that appear in fewer than θ lists to be instantly
skipped. Skipping might be beneficial for main memory processing or
lists stored on solid state drives, but not so for disk resident inverted
lists due to the cost of the random seeks required in order to skip over
elements.
A generalization of this algorithm is to utilize the fact that some
token lists are short and others are long. We divide the lists into two
groups, one with the τ longest lists and one with the rest of the lists.
We use the aforementioned optimized multiway merge algorithm on
the short lists to identify candidate strings that appear at least θ − τ
times. Then, we probe the τ long lists to verify the candidates. Depend-
ing on the query and the data at hand we can find an appropriate value
τ that will result in a good tradeoff between the cost of running the
multiway merge algorithm on the short lists and the subsequent prob-
ing of the long lists. An alternative approach that can be beneficial
in certain scenarios is to avoid probing the long lists altogether, and
simply retrieve all candidate strings produced by the multiway merge
algorithm on the short lists and evaluate their similarity with the query
directly. This algorithm can result in faster processing whenever the
total number of candidates produced is suciently small.
The optimized multiway merge algorithm can be generalized for all
similarity functions with unit token weights. It cannot be generalized
for arbitrary weights.
Nevertheless, there is another simple optimization for skipping list
elements in the case of arbitrary token weights. Given k elements in the
heap and the fact that each list has a fixed weight, if we keep track of the
list from which each element has been popped off from, we can directly
Search WWH ::




Custom Search