Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

is straightforward, since the inverted index for minimum threshold θ

indexes more information than that needed for answering queries with

threshold θ >θ . The index cannot be used to answer queries with

threshold θ <θ .

A more intuitive way of understanding the prefix inverted index is

to see it as a special case of a full inverted index, where each token list

is only partially populated by only those strings that contain that token

specifically in their prefix. Now, we can use the heaviest first algorithm

presented in Section 6 to answer queries, assuming that the algorithm

traverses all lists up to L ( λ π ) only. Recall that heaviest first processes

token lists in decreasing order of token weights (implicitly using the

concept of prefix filtering). Straightforwardly, even if the algorithm

traversed all query lists, the maximum possible remaining potential of

unseen candidates after list L ( λ π ) is processed (refer to Lemmata 6.4,

6.7, 6.10, 6.14) would be smaller than the query threshold, instruct-

ing the algorithm to stop adding new candidates in the candidate set.

Notice that for the prefix based inverted index, it is not beneficial to

actually scan lists beyond L ( λ π ) since lists are only partially popu-

lated, hence not useful for computing tighter similarity upper bounds,

for already discovered candidates (the entries of the candidates might

not be present in those partially populated lists).

The fundamental difference between a prefix inverted index and a

full inverted index using the heaviest first strategy is that the latter

prunes strings incrementally using exactly the same concept as the

prefix filter, but for always increasing prefix lengths (every new list

processed essentially increases the length of the prefixes examined, by

one token), and at the same time computes the actual similarity of

candidate strings incrementally, ultimately eliminating the need for a

refinement step. In addition, the full index can answer queries with

arbitrary thresholds θ . These advantages come at the cost of increased

index size and potentially slower query evaluation since the fully popu-

lated lists might contain strings that share a token with the query but

not within the prefix, which would have otherwise not been considered.

On the other hand, the pruning power of the full inverted index is at

least as good or better than that of the prefix inverted index, since by

virtue of fully populated lists, the algorithm computes much tighter

Search WWH ::

Custom Search

Home