Database Reference
In-Depth Information
7
Algorithms for Set Based Similarity Using
Filtering Techniques
So far we have discussed several algorithms based on building inverted
indexes on all the tokens contained in the data strings. A fundamen-
tally different indexing approach is based on the observation that for
a variety of similarity measures one can easily derive upper bounds on
the similarity by examining only a small number of tokens from each
string. We call algorithms for computing upper bounds on the simi-
larity filtering techniques . Given a query string, once a candidate set
of answers has been produced using a filtering technique, a refinement
step is performed to compute the exact similarity and return the correct
answers. This filter and refine approach, as it is commonly called, builds
an inverted index of substantially reduced size over a small subset of
tokens per string. The advantage of filtering techniques is that they can
evaluate join queries very eciently. The drawback is that the filtering
phase might produce a large number of false positives that will have to
be pruned by verification during the refinement step. Verification might
be expensive, depending on the similarity function used. Here, we focus
only on filtering techniques that do not result in any false negatives.
We do not cover synopses data structures and probabilistic algorithms,
like min-wise independent permutations and locality sensitive hashing
that can discard valid answers.
339
 
Search WWH ::




Custom Search