Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

In the rare case that the dataset is too large for the inverted index

to fit in main memory even with the index size reduction optimiza-

tion mentioned above (this could happen when a very large number

of strings have equal norms), a simple modification of the algorithm is

to index strings until no more main memory is available and keep a

pointer to the last string indexed, similarly to Algorithm 6.3.1. Then

the algorithm switches to a probe only phase and scans the rest of the

dataset within the appropriate norm bounds, producing all matching

pairs. Finally, the algorithm discards the existing index and continues

indexing strings from where it left off. The drawback of this algorithm

is that it might have to scan parts of the dataset multiple times. One

can also modify the partitioning strategy described in Section 6.3 to do

the self-join (with some possible optimizations) in the case of weighted

intersection similarity.

6.5 Top- k Join and Self-join Queries

A top- k join query between two datasets S,R returns the k pairs s, r

with similarity larger than any other pair in the cross-product S

×

R .A

= s )

top- k self-join query for a dataset S returns the k pairs s, s (s.t. s

with similarity larger than any other pair in the cross-product S

S .

The diculty with top- k queries is that the similarity of the k -th

answer is not known in advance. A simple strategy for evaluating top- k

join (self-join) queries is to quickly identify k candidates and use the

k -th similarity as a threshold θ in order to answer the query using an

all-match join (self-join) algorithm. As the algorithm proceeds and the

similarity of the current k -th candidate converges toward the similarity

of the k -th most similar string, the search becomes more effective. All

the algorithms discussed for all-match join and self-join queries can be

used, with slight modifications, to answer top- k queries.

×

6.6

Index Construction

Inverted index construction might require several linear scans of the

data in some scenarios. The first step for constructing the inverted

index is to sort the data either in increasing/decreasing order of string

Search WWH ::

Custom Search

Home