Database Reference
In-Depth Information
real valued attributes cannot be compressed as eciently as lists con-
taining only natural numbers. Of course one could still use well-known
compression techniques (e.g., Golomb coding and run length encoding).
Another list compression approach is to completely discard certain
inverted lists based on the size of lists, their contribution to overall
string similarity given the weight of the tokens they correspond to, the
effect of discarded lists on list merging eciency given a query work-
load, etc. Another intuitive idea is to combine token lists that are very
similar to each other, using a single inverted list to store the union of
the initial lists. Finally, one can use variable length grams (as opposed
to fixed length q -grams) to better capture the distribution of short sub-
strings within the data strings, which can potentially yield significant
space and performance improvements, by eliminating very frequent,
short grams and replacing them with longer, less frequent grams.
6.9 Related Work
Baeza-Yates and Ribeiro-Neto [8] and Witten et al. [71] discuss selec-
tion queries for various similarity measures using multiway merge
and inverted indexes. The authors also discuss various compression
algorithms for inverted lists sorted according to string identifiers. Behm
et al. [11] discuss specialized lossy compression algorithms, while Li
et al. [49] and Yang et al. [76] present compression algorithms based on
alternative string tokenizations that take advantage of variable length
grams. Li et al. [48] introduced the optimized version of the multiway
merge algorithm for the special case of unit token weights. Linder-
man [50] proposed the optimized version and early termination condi-
tion of the multiway merge algorithm for general weights. Knuth [43]
gives a more general discussion of the multiway merge algorithm in
the context of external sorting. Linderman [50] also proposed the opti-
mized version of the multiway merge algorithm based on sorting both
by norm and string identifiers.
Koudas et al. [44] introduced an SQL based framework for identi-
fying similar strings based on Cosine similarity and L p -norms. Various
strategies for evaluating set intersection queries based on sorting sets
according to their L p -norms have been made by Bayardo et al. [10].
Search WWH ::




Custom Search