Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

weights, it has larger size on average than the prefix signature, but it

results in tighter pruning. The filter tree has proved to be very effective

for answering selection queries, but its performance depends heavily on

the order in which filters are composed. It is dicult to compare fil-

tering techniques with algorithms based on full inverted indexes. The

relative performance depends on whether the indexes can fit in main

memory and also on the query and data characteristics. There exists

an inherent tradeoff between using the index to compute the exact sim-

ilarity between the query and data strings, over using the index as a

filtering step that requires a subsequent verification step.

7.6 Related Work

A first instantiation of the prefix principle was proposed by Sarawagi

and Kirpal [60]. The prefix filter was later formalized by Chaudhuri

et al. [19]. Xiao et al. [75] proposed improved pruning conditions for

Jaccard, Dice, and Cosine similarity for the prefix filter, assuming unit

token weights. The extensions to arbitrary weights discussed here are

straightforward. Chang and Lawler [17] used the idea of partitioning

the strings using the pigeonhole principle for computing both Hamming

distance and edit distance. The same partitioning strategy was used by

Wu and Manber [72] for the agrep utility. The prefix enumeration and

partenum signatures were proposed by Arasu et al. [6].

Bayardo et al. [10] proposed various join algorithms and optimiza-

tions related to building and indexing prefix signatures, and using

advanced list merging algorithms to evaluate join queries, similar to

the ones described in Section 6.3. Xiao et al. [75] extended the algo-

rithms by Bayardo et al. to access parts of the sux of strings as

well in order to improve filtering eciency, assuming that datasets fit

in main memory. Top- k join queries were extensively studied by Xiao

et al. [74], assuming that the dataset fits in main memory. In addition,

Xiao et al. [74] discussed various optimizations specifically tailored for

unit weights. The filter tree was proposed by Li et al. [48].

Search WWH ::

Custom Search

Home