Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

7

Algorithms for Set Based Similarity Using

Filtering Techniques

So far we have discussed several algorithms based on building inverted

indexes on all the tokens contained in the data strings. A fundamen-

tally different indexing approach is based on the observation that for

a variety of similarity measures one can easily derive upper bounds on

the similarity by examining only a small number of tokens from each

string. We call algorithms for computing upper bounds on the simi-

larity filtering techniques . Given a query string, once a candidate set

of answers has been produced using a filtering technique, a refinement

step is performed to compute the exact similarity and return the correct

answers. This filter and refine approach, as it is commonly called, builds

an inverted index of substantially reduced size over a small subset of

tokens per string. The advantage of filtering techniques is that they can

evaluate join queries very eciently. The drawback is that the filtering

phase might produce a large number of false positives that will have to

be pruned by verification during the refinement step. Verification might

be expensive, depending on the similarity function used. Here, we focus

only on filtering techniques that do not result in any false negatives.

We do not cover synopses data structures and probabilistic algorithms,

like min-wise independent permutations and locality sensitive hashing

that can discard valid answers.

339

Search WWH ::

Custom Search

Home