Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

Hadjieleftheriou et al. [33] introduced further optimizations for the spe-

cial case of Cosine similarity. L p -norm based filtering has also been used

by Li et al. [48] and Xiao et al. [75].

A detailed analysis of threshold based algorithms is conducted by

Fagin et al. [29]. Improved termination conditions for these algorithms

are discussed by Sarawagi and Kirpal [60], Bayardo et al. [10] and Had-

jieleftheriou et al. [33]. The heaviest first algorithm for weighted

intersection based on prefix and sux lists is based on ideas introduced

by Sarawagi and Kirpal [60] and Chaudhuri et al. [19]. The same algo-

rithm, assuming unit token weights, was extended for arbitrary prefix

lengths by Li et al. [48]. The heaviest first algorithm for arbitrary

token weights was introduced by Hadjieleftheriou et al. [33].

The partitioning strategy for all-match join queries with memory

constraints was proposed by Sarawagi and Kirpal [60]. The incremen-

tal indexing for self-join queries was first proposed by Sarawagi and

Kirpal [60]. The improved algorithm for Jaccard, Dice, and Cosine

similarity based on deleting elements from the top of token lists was

proposed by Bayardo et al. [10]. The block nested loop self-join algo-

rithm in case of memory constraints was also proposed by Bayardo

et al. [10]. Various techniques for answering top- k queries using inverted

indexes and the multiway merge strategy were discussed by Vernica

and Li [69].

Ecient online updates for inverted indexes have been studied

extensively by Lester et al. [46]. Propagating updates for inverted

indexes stored in a relational database were addressed by Koudas

et al. [45]. Index construction and update related issues with regard

to L p -norm computation is discussed in detail by Hadjieleftheriou

et al. [34].

Search WWH ::

Custom Search

Home