Database Reference
In-Depth Information
6
Algorithms for Set Based Similarity Using
Inverted Indexes
In this section we will cover evaluation of set based similarity functions.
We will cover algorithms for selection and join queries. The brute-force
approach for evaluating selection queries has linear cost (computes all
pairwise similarities between the query string and the data strings),
while the brute-force approach for join queries has quadratic cost (com-
putes all pairwise similarities in the cross product of the two datasets).
Straightforwardly, big savings in computation cost can be achieved by
using an index structure (resulting in a typical computation cost versus
storage cost tradeoff).
6.1 All-Match Selection Queries
An all-match selection query returns all the strings in the dataset with
similarity to the query string that is larger than or equal to a user
specified threshold θ .
All of the set based similarity functions introduced in Section 2.2
can be computed easily using an inverted index. First, the data strings
are tokenized and an inverted index is build on the resulting tokens,
one list per token (an example using word tokens is shown in Tables 5.1
299
 
Search WWH ::




Custom Search