Database Reference
In-Depth Information
double estimate = 0.0;
for(int i=0;i<M.length/3;i+=3)
estimate += range/(double)M[i+2];
return (int)(2.0*estimate);
}
Computing Set Similarity
Min-Count sketches can also be used to compute the approximate similarity
between two sets using a technique called Min-wise Hashing . The basic
idea is to find an approximation to the Jaccard Similarity metric used in
the “Real-Time Identification of Fraudulent Websites” sidebar earlier in the
chapter.
The key observation is that the probability that the minimum of the hash of
one set is equal to the minimum of the hash of another set is the ratio of the
size of the intersection of the set over the size of the union of the set. This is
exactly the Jaccard Similarity metric from the “Real-Time Identification of
Fraudulent Websites” sidebar.
Like most sketches, the approximation using only a single hash function
is a very rough approximation. The original variant of the algorithm used
multiple hash functions to overcome this limitation, much like the Bloom
Filter. Of course, this is a very expensive operation, requiring at least 400
hash functions to achieve an error of 5 percent. Later variations use a single
hash function and then subsample the hashes into various streams, just
like the Min-Count sketch. In either case, the Jaccard Similarity is then the
proportion of matching registers between two sets that have been hashed
using the same mechanism:
public double distance(MinCount other) {
double x = 0.0;
for(int i=0;i<M.length;i++) {
if(other.M[i] == M[i]) x += 1.0;
}
return x/(double)M.length;
}
Search WWH ::




Custom Search