Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

(b) How do these rates compare with what we get if we organize the same

2048 functions into a 2-way AND of members of F 1 , as was discussed at

the end of Section 3.8.5?

Exercise 3.8.3 : Suppose fingerprints have the same statistics outlined in Ex-

ercise 3.8.2, but we use a base family of functions F ′ defined like F, but using

only two randomly chosen grid squares. Construct another set of functions F ′ 1

from F ′ by taking the n-way OR of functions from F ′ . What, as a function of

n, are the false positive and false negative rates for F ′ 1 ?

Exercise 3.8.4 : Suppose we use the functions F 1 from Example 3.22, but we

want to solve the many-many problem.

(a) If two fingerprints are from the same finger, what is the probability that

they will not be compared (i.e., what is the false negative rate)?

(b) What fraction of the fingerprints from different fingers will be compared

(i.e., what is the false positive rate)?

! Exercise 3.8.5 : Assume we have the set of functions F as in Exercise 3.8.2,

and we construct a new set of functions F 3 by an n-way OR of functions in

F. For what value of n is the sum of the false positive and false negative rates

minimized?

3.9

Methods for High Degrees of Similarity

LSH-based methods appear most effective when the degree of similarity we

accept is relatively low. When we want to find sets that are almost identical,

there are other methods that can be faster. Moreover, these methods are exact,

in that they find every pair of items with the desired degree of similarity. There

are no false negatives, as there can be with LSH.

3.9.1 Finding Identical Items

The extreme case is finding identical items, for example, Web pages that are

identical, character-for-character. It is straightforward to compare two docu-

ments and tell whether they are identical, but we still must avoid having to

compare every pair of documents. Our first thought would be to hash docu-

ments based on their first few characters, and compare only those documents

that fell into the same bucket. That scheme should work well, unless all the

documents begin with the same characters, such as an HTML header.

Our second thought would be to use a hash function that examines the

entire document. That would work, and if we use enough buckets, it would be

very rare that two documents went into the same bucket, yet were not identical.

The downside of this approach is that we must examine every character of every

document. If we limit our examination to a small number of characters, then

Search WWH ::

Custom Search

Home