Database Reference
In-Depth Information
tance between vectors x and y . If we take any one position of the vectors, say the i th pos-
ition, we can define the function f i ( x ) to be the i th bit of vector x . Then f i ( x ) = f i ( y ) if and
only if vectors x and y agree in the i th position. Then the probability that f i ( x ) = f i ( y ) for a
randomly chosen i is exactly 1 − h ( x, y )/ d ; i.e., it is the fraction of positions in which x and
y agree.
This situation is almost exactly like the one we encountered for minhashing. Thus, the
family F consisting of the functions { f 1 , f 2 , . . . , f d } is a
( d 1 , d 2 , 1 − d 1 / d, 1 − d 2 / d )-sensitive
family of hash functions, for any d 1 < d 2 . There are only two differences between this fam-
ily and the family of minhash functions.
(1) While Jaccard distance runs from 0 to 1, the Hamming distance on a vector space of
dimension d runs from 0 to d . It is therefore necessary to scale the distances by divid-
ing by d , to turn them into probabilities.
(2) While there is essentially an unlimited supply of minhash functions, the size of the
family F for Hamming distance is only d .
The first point is of no consequence; it only requires that we divide by d at appropriate
times. The second point is more serious. If d is relatively small, then we are limited in the
number of functions that can be composed using the AND and OR constructions, thereby
limiting how steep we can make the S-curve be.
3.7.2
Random Hyperplanes and the Cosine Distance
Recall from Section 3.5.4 that the cosine distance between two vectors is the angle between
the vectors. For instance, we see in Fig. 3.12 two vectors x and y that make an angle θ
between them. Note that these vectors may be in a space of many dimensions, but they al-
ways define a plane, and the angle between them is measured in this plane. Figure 3.12 is a
“top-view” of the plane containing x and y .
Figure 3.12 Two vectors make an angle θ
Search WWH ::




Custom Search