Databases Reference
In-Depth Information
(b) How do these rates compare with what we get if we organize the same
2048 functions into a 2-way AND of members of F 1 , as was discussed at
the end of Section 3.8.5?
Exercise 3.8.3 : Suppose fingerprints have the same statistics outlined in Ex-
ercise 3.8.2, but we use a base family of functions F defined like F, but using
only two randomly chosen grid squares. Construct another set of functions F 1
from F by taking the n-way OR of functions from F . What, as a function of
n, are the false positive and false negative rates for F 1 ?
Exercise 3.8.4 : Suppose we use the functions F 1 from Example 3.22, but we
want to solve the many-many problem.
(a) If two fingerprints are from the same finger, what is the probability that
they will not be compared (i.e., what is the false negative rate)?
(b) What fraction of the fingerprints from different fingers will be compared
(i.e., what is the false positive rate)?
! Exercise 3.8.5 : Assume we have the set of functions F as in Exercise 3.8.2,
and we construct a new set of functions F 3 by an n-way OR of functions in
F. For what value of n is the sum of the false positive and false negative rates
minimized?
3.9
Methods for High Degrees of Similarity
LSH-based methods appear most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical,
there are other methods that can be faster. Moreover, these methods are exact,
in that they find every pair of items with the desired degree of similarity. There
are no false negatives, as there can be with LSH.
3.9.1 Finding Identical Items
The extreme case is finding identical items, for example, Web pages that are
identical, character-for-character. It is straightforward to compare two docu-
ments and tell whether they are identical, but we still must avoid having to
compare every pair of documents. Our first thought would be to hash docu-
ments based on their first few characters, and compare only those documents
that fell into the same bucket. That scheme should work well, unless all the
documents begin with the same characters, such as an HTML header.
Our second thought would be to use a hash function that examines the
entire document. That would work, and if we use enough buckets, it would be
very rare that two documents went into the same bucket, yet were not identical.
The downside of this approach is that we must examine every character of every
document. If we limit our examination to a small number of characters, then
Search WWH ::




Custom Search