Database Reference
In-Depth Information
As d grows, the denominator grows linearly in d , but the numerator is a sum of random
values, which are as likely to be positive as negative. Thus, the expected value of the nu-
merator is 0, and as d grows, its standard deviation grows only as Thus, for large d , the
cosine of the angle between any two vectors is almost certain to be close to 0, which means
the angle is close to 90 degrees.
An important consequence of random vectors being orthogonal is that if we have three
random points A , B , and C , and we know the distance from A to B is d 1 , while the distance
from B to C is d 2 , we can assume the distance from A to C is approximately That rule
does not hold, even approximately, if the number of dimensions is small. As an extreme
case, if d = 1, then the distance from A to C would necessarily be d 1 + d 2 if A and C were
on opposite sides of B , or | d 1 d 2 | if they were on the same side.
7.1.4
Exercises for Section 7.1
! EXERCISE 7.1.1 Prove that if you choose two points uniformly and independently on a
line of length 1, then the expected distance between the points is 1/3.
!! EXERCISE 7.1.2 If you choose two points uniformly in the unit square, what is their ex-
pected Euclidean distance?
! EXERCISE 7.1.3 Suppose we have a d -dimensional Euclidean space. Consider vectors
whose components are only +1 or −1 in each dimension. Note that each vector has length
so the product of their lengths (denominator in the formula for the cosine of the angle
between them) is d . If we chose each component independently, and a component is as
likely to be +1 as −1, what is the distribution of the value of the numerator of the formula
(i.e., the sum of the products of the corresponding components from each vector)? What
can you say about the expected value of the cosine of the angle between the vectors, as d
grows large?
7.2 Hierarchical Clustering
We begin by considering hierarchical clustering in a Euclidean space. This algorithm can
only be used for relatively small datasets, but even so, there are some efficiencies we can
make by careful implementation. When the space is non-Euclidean, there are additional
Search WWH ::




Custom Search