Database Reference
In-Depth Information
Let P θ ( v )= 1 ,...,λ π } be the prefix of v and S θ ( v )= π +1 ,...,λ v m }
be the sux of v . Then, for any string s s.t. s
∩P
θ ( v )=
,
I
( s, v ) .
Proof. The proof appears in Section 7.1.
Hence, the only viable candidates have to appear in at least one of
the lists in the prefix L ( λ 1 ) ,...,L ( λ π ). The heaviest first algorithm
exhaustively scans the prefix lists to find all viable candidates and then
probes the sux lists to complete their scores using random access.
6.2 Top-
k
Selection Queries
A top-k selection query has to return the k data strings most similar to
the query (assuming no ties; if ties exist then algorithms either return
k strings by resolving ties arbitrarily, or return all ties).
6.2.1
Lists Sorted in String Identifier Order
It is easy to see that the multiway merge algorithm cannot be used to
answer top-k queries eciently. This is mainly because the algorithm
identifies answers in order of their string identifier rather than in order
of their similarity with the query. One could use the multiway merge
algorithm in combination with a heap data structure to keep track of
the current top-k answers, but in order to guarantee that the true top-
k answers are returned, all token lists corresponding to query tokens
have to be exhaustively examined. In other words, there is no guarantee
that the answer with the largest similarity is not also the one with the
largest string identifier.
Nevertheless, one can still use the optimized multiway merge
algorithm described in Section 6.1.1 in order to potentially skip a large
number of entries from every list, during the list merging step.
It is easy to see that the algorithms designed for all-match selection
queries can be adapted for top-k queries, based on the observation that
top-k queries are essentially all-match selection queries with an adap-
tively increasing similarity threshold θ . An ecient top-k algorithm has
to identify a good approximation of the similarity of the k -th most sim-
ilar string as fast as possible, in order to converge to the correct top-k
Search WWH ::




Custom Search