Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

Suppose we pick a hyperplane through the origin. This hyperplane intersects the plane of

x and y in a line. Figure 3.12 suggests two possible hyperplanes, one whose intersection is

the dashed line and the other's intersection is the dotted line. To pick a random hyperplane,

we actually pick the normal vector to the hyperplane, say v . The hyperplane is then the set

of points whose dot product with v is 0.

First, consider a vector v that is normal to the hyperplane whose projection is represented

by the dashed line in Fig. 3.12 ; that is, x and y are on different sides of the hyperplane. Then

the dot products v.x and v.y will have different signs. If we assume, for instance, that v is

a vector whose projection onto the plane of x and y is above the dashed line in Fig. 3.12 ,

then v.x is positive, while v.y is negative. The normal vector v instead might extend in the

opposite direction, below the dashed line. In that case v.x is negative and v.y is positive, but

the signs are still different.

On the other hand, the randomly chosen vector v could be normal to a hyperplane like

the dotted line in Fig. 3.12 . In that case, both v.x and v.y have the same sign. If the projec-

tion of v extends to the right, then both dot products are positive, while if v extends to the

left, then both are negative.

What is the probability that the randomly chosen vector is normal to a hyperplane that

looks like the dashed line rather than the dotted line? All angles for the line that is the in-

tersection of the random hyperplane and the plane of x and y are equally likely. Thus, the

hyperplane will look like the dashed line with probability θ /180 and will look like the dot-

ted line otherwise.

Thus, each hash function f in our locality-sensitive family F is built from a randomly

chosen vector v f . Given two vectors x and y , say f ( x ) = f ( y ) if and only if the dot products

v f .x and v f .y have the same sign. Then F is a locality-sensitive family for the cosine dis-

tance. The parameters are essentially the same as for the Jaccard-distance family described

in Section 3.6.2 , except the scale of distances is 0-180 rather than 0-1. That is, F is a

( d 1 , d 2 , (180 − d 1 )/180 , (180 − d 2 )/180)-sensitive

family of hash functions. From this basis, we can amplify the family as we wish, just as for

the minhash-based family.

3.7.3

Sketches

Instead of chosing a random vector from all possible vectors, it turns out to be sufficiently

random if we restrict our choice to vectors whose components are +1 and −1. The dot

product of any vector x with a vector v of +1's and −1's is formed by adding the compon-

Search WWH ::

Custom Search

Home