Database Reference
In-Depth Information
In the rare case that the dataset is too large for the inverted index
to fit in main memory even with the index size reduction optimiza-
tion mentioned above (this could happen when a very large number
of strings have equal norms), a simple modification of the algorithm is
to index strings until no more main memory is available and keep a
pointer to the last string indexed, similarly to Algorithm 6.3.1. Then
the algorithm switches to a probe only phase and scans the rest of the
dataset within the appropriate norm bounds, producing all matching
pairs. Finally, the algorithm discards the existing index and continues
indexing strings from where it left off. The drawback of this algorithm
is that it might have to scan parts of the dataset multiple times. One
can also modify the partitioning strategy described in Section 6.3 to do
the self-join (with some possible optimizations) in the case of weighted
intersection similarity.
6.5 Top- k Join and Self-join Queries
A top- k join query between two datasets S,R returns the k pairs s, r
with similarity larger than any other pair in the cross-product S
×
R .A
= s )
top- k self-join query for a dataset S returns the k pairs s, s (s.t. s
with similarity larger than any other pair in the cross-product S
S .
The diculty with top- k queries is that the similarity of the k -th
answer is not known in advance. A simple strategy for evaluating top- k
join (self-join) queries is to quickly identify k candidates and use the
k -th similarity as a threshold θ in order to answer the query using an
all-match join (self-join) algorithm. As the algorithm proceeds and the
similarity of the current k -th candidate converges toward the similarity
of the k -th most similar string, the search becomes more effective. All
the algorithms discussed for all-match join and self-join queries can be
used, with slight modifications, to answer top- k queries.
×
6.6
Index Construction
Inverted index construction might require several linear scans of the
data in some scenarios. The first step for constructing the inverted
index is to sort the data either in increasing/decreasing order of string
Search WWH ::




Custom Search