Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

to the heap, and repeat. An important observation here is that an ele-

ment that was popped from the heap might end up subsequently being

re-inserted to the heap, if it happens to be equal to the top element

of the heap after the θ

1 total removals. This is possible given that

a skip operation might result in skipping backwards on some of the

lists. The correctness of the algorithm follows from the fact that if an

element appears in at least θ lists, eventually it will have to appear

at the top of the heap θ times (after potentially multiple reinsertions

to the heap). The advantage of the algorithm is that it will result in

many list elements that appear in fewer than θ lists to be instantly

skipped. Skipping might be beneficial for main memory processing or

lists stored on solid state drives, but not so for disk resident inverted

lists due to the cost of the random seeks required in order to skip over

elements.

A generalization of this algorithm is to utilize the fact that some

token lists are short and others are long. We divide the lists into two

groups, one with the τ longest lists and one with the rest of the lists.

We use the aforementioned optimized multiway merge algorithm on

the short lists to identify candidate strings that appear at least θ − τ

times. Then, we probe the τ long lists to verify the candidates. Depend-

ing on the query and the data at hand we can find an appropriate value

τ that will result in a good tradeoff between the cost of running the

multiway merge algorithm on the short lists and the subsequent prob-

ing of the long lists. An alternative approach that can be beneficial

in certain scenarios is to avoid probing the long lists altogether, and

simply retrieve all candidate strings produced by the multiway merge

algorithm on the short lists and evaluate their similarity with the query

directly. This algorithm can result in faster processing whenever the

total number of candidates produced is suciently small.

The optimized multiway merge algorithm can be generalized for all

similarity functions with unit token weights. It cannot be generalized

for arbitrary weights.

Nevertheless, there is another simple optimization for skipping list

elements in the case of arbitrary token weights. Given k elements in the

heap and the fact that each list has a fixed weight, if we keep track of the

list from which each element has been popped off from, we can directly

−

Search WWH ::

Custom Search

Home