Database Reference
In-Depth Information
7.1.4
All-Match Self-join Queries
As with join queries, the algorithms for self-join queries are similar to
the ones discussed in Section 6.4, where the full inverted index con-
structed during the incremental indexing phase is simply replaced by
a prefix signature inverted index. Similarly, the prefix reduction prin-
ciple discussed in Section 7.1.3 can be employed for self-join queries by
sorting the input dataset in L p -norm order and incrementally indexing
only the reduced prefixes of strings (see Algorithm 6.4.1).
7.1.5
Top-
k
Join and Self-join Queries
Once again top- k join and self-join queries can be answered by using
the all-match join and self-join algorithms with prefix inverted indexes
instead of full inverted indexes. The algorithms simply identify any k
candidates during initialization, and use the k -th smallest similarity as
the initial threshold θ . Notice that in the worst case the initial k -th
similarity can be 0, and the prefix inverted index will degenerate to a
full inverted index (each prefix is the complete string). In fact, there is
no better solution known when the input datasets do not fit in main
memory.
Nevertheless, there exists another effective strategy when there is
enough available memory to fit one dataset and its prefix inverted
index in main memory. The strategy uses an optimistic approach that
assumes that the k -th smallest similarity is the maximum possible sim-
ilarity (e.g., one for Normalized Weighted Intersection, Jaccard, Dice,
and Cosine similarity), and starts indexing prefixes incrementally, one
token at a time. This strategy is effective only in main memory, since
the algorithm has to repeatedly access each string when it is time to
index its next prefix token. The top- k self-join algorithm for Jaccard
similarity is shown as Algorithm 7.1.1. The same algorithm, with slight
modifications, can be used to answer top- k join queries between two sets
S,R as well (e.g., a simple way to do this is to run a slightly modified
self-join algorithm on the union S
R ). It can work for Weighted Inter-
section, Normalized Weighted Intersection, Dice and Cosine similarity.
In the beginning the algorithm indexes only the first token of each
string. When a new token s π from a given string s is indexed, the
Search WWH ::




Custom Search