Database Reference
In-Depth Information
similarity upper bounds every time a new list is processed. The prefix
inverted index might have a performance advantage over the fully pop-
ulated inverted index if the former fits in main memory while the latter
does not, but this is data and query dependent and hard to quantify
since the refinement step of the prefix filter would still have to access
strings from secondary storage using random accesses.
Based on these observations, one could advocate for a hybrid
approach that creates both a full and a prefix inverted index. Then,
given a query, we scan the prefix index to find candidates using the
tokens in the prefix of the query and also scan the full inverted index
using the sux of the query to compute tighter upper bounds. Notice
that this approach cannot compute the actual similarity scores with-
out also accessing the full lists of the tokens in the query prefix; some
candidates identified in the first phase using the prefix lists might have
a token in common with the query prefix that nevertheless appears in
the sux of the data string, and hence is not contained in the prefix
inverted index.
7.1.2
Top-
k
Selection Queries
Prefix signatures cannot be used for answering top- k selection queries
due to the fact that a minimum pre-determined threshold θ needs to
be decided in advance in order to build the prefix inverted index. Since
the k -th similarity between any data string and a given query in the
worst case could be 0, for θ = 0 the signature index will degenerate to
a full inverted index.
7.1.3
All-Match Join Queries
Filtering techniques excel at processing join queries. Assume that we
are performing a join between two string datasets and neither dataset
is already indexed. If the full inverted index of either dataset cannot
fit in main memory, answering the query requires partitioning the data
appropriately. Instead, we can build an inverted index of significantly
reduced size by using prefix signatures. More importantly, since we
build the prefix inverted indexes on the fly, we can build signatures
using θ exactly equal to the query threshold, which has the immediate
Search WWH ::




Custom Search