Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

real valued attributes cannot be compressed as eciently as lists con-

taining only natural numbers. Of course one could still use well-known

compression techniques (e.g., Golomb coding and run length encoding).

Another list compression approach is to completely discard certain

inverted lists based on the size of lists, their contribution to overall

string similarity given the weight of the tokens they correspond to, the

effect of discarded lists on list merging eciency given a query work-

load, etc. Another intuitive idea is to combine token lists that are very

similar to each other, using a single inverted list to store the union of

the initial lists. Finally, one can use variable length grams (as opposed

to fixed length q -grams) to better capture the distribution of short sub-

strings within the data strings, which can potentially yield significant

space and performance improvements, by eliminating very frequent,

short grams and replacing them with longer, less frequent grams.

6.9 Related Work

Baeza-Yates and Ribeiro-Neto [8] and Witten et al. [71] discuss selec-

tion queries for various similarity measures using multiway merge

and inverted indexes. The authors also discuss various compression

algorithms for inverted lists sorted according to string identifiers. Behm

et al. [11] discuss specialized lossy compression algorithms, while Li

et al. [49] and Yang et al. [76] present compression algorithms based on

alternative string tokenizations that take advantage of variable length

grams. Li et al. [48] introduced the optimized version of the multiway

merge algorithm for the special case of unit token weights. Linder-

man [50] proposed the optimized version and early termination condi-

tion of the multiway merge algorithm for general weights. Knuth [43]

gives a more general discussion of the multiway merge algorithm in

the context of external sorting. Linderman [50] also proposed the opti-

mized version of the multiway merge algorithm based on sorting both

by norm and string identifiers.

Koudas et al. [44] introduced an SQL based framework for identi-

fying similar strings based on Cosine similarity and L p -norms. Various

strategies for evaluating set intersection queries based on sorting sets

according to their L p -norms have been made by Bayardo et al. [10].

Search WWH ::

Custom Search

Home