Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

tance between vectors x and y . If we take any one position of the vectors, say the i th pos-

ition, we can define the function f i ( x ) to be the i th bit of vector x . Then f i ( x ) = f i ( y ) if and

only if vectors x and y agree in the i th position. Then the probability that f i ( x ) = f i ( y ) for a

randomly chosen i is exactly 1 − h ( x, y )/ d ; i.e., it is the fraction of positions in which x and

y agree.

This situation is almost exactly like the one we encountered for minhashing. Thus, the

family F consisting of the functions { f 1 , f 2 , . . . , f d } is a

( d 1 , d 2 , 1 − d 1 / d, 1 − d 2 / d )-sensitive

family of hash functions, for any d 1 < d 2 . There are only two differences between this fam-

ily and the family of minhash functions.

(1) While Jaccard distance runs from 0 to 1, the Hamming distance on a vector space of

dimension d runs from 0 to d . It is therefore necessary to scale the distances by divid-

ing by d , to turn them into probabilities.

(2) While there is essentially an unlimited supply of minhash functions, the size of the

family F for Hamming distance is only d .

The first point is of no consequence; it only requires that we divide by d at appropriate

times. The second point is more serious. If d is relatively small, then we are limited in the

number of functions that can be composed using the AND and OR constructions, thereby

limiting how steep we can make the S-curve be.

3.7.2

Random Hyperplanes and the Cosine Distance

Recall from Section 3.5.4 that the cosine distance between two vectors is the angle between

the vectors. For instance, we see in Fig. 3.12 two vectors x and y that make an angle θ

between them. Note that these vectors may be in a space of many dimensions, but they al-

ways define a plane, and the angle between them is measured in this plane. Figure 3.12 is a

“top-view” of the plane containing x and y .

Figure 3.12 Two vectors make an angle θ

Search WWH ::

Custom Search

Home