Database Reference
In-Depth Information
is straightforward, since the inverted index for minimum threshold θ
indexes more information than that needed for answering queries with
threshold θ . The index cannot be used to answer queries with
threshold θ .
A more intuitive way of understanding the prefix inverted index is
to see it as a special case of a full inverted index, where each token list
is only partially populated by only those strings that contain that token
specifically in their prefix. Now, we can use the heaviest first algorithm
presented in Section 6 to answer queries, assuming that the algorithm
traverses all lists up to L ( λ π ) only. Recall that heaviest first processes
token lists in decreasing order of token weights (implicitly using the
concept of prefix filtering). Straightforwardly, even if the algorithm
traversed all query lists, the maximum possible remaining potential of
unseen candidates after list L ( λ π ) is processed (refer to Lemmata 6.4,
6.7, 6.10, 6.14) would be smaller than the query threshold, instruct-
ing the algorithm to stop adding new candidates in the candidate set.
Notice that for the prefix based inverted index, it is not beneficial to
actually scan lists beyond L ( λ π ) since lists are only partially popu-
lated, hence not useful for computing tighter similarity upper bounds,
for already discovered candidates (the entries of the candidates might
not be present in those partially populated lists).
The fundamental difference between a prefix inverted index and a
full inverted index using the heaviest first strategy is that the latter
prunes strings incrementally using exactly the same concept as the
prefix filter, but for always increasing prefix lengths (every new list
processed essentially increases the length of the prefixes examined, by
one token), and at the same time computes the actual similarity of
candidate strings incrementally, ultimately eliminating the need for a
refinement step. In addition, the full index can answer queries with
arbitrary thresholds θ . These advantages come at the cost of increased
index size and potentially slower query evaluation since the fully popu-
lated lists might contain strings that share a token with the query but
not within the prefix, which would have otherwise not been considered.
On the other hand, the pruning power of the full inverted index is at
least as good or better than that of the prefix inverted index, since by
virtue of fully populated lists, the algorithm computes much tighter
Search WWH ::




Custom Search