Databases Reference
In-Depth Information
7.1.3 The Curse of Dimensionality
High-dimensional Euclidean spaces have a number of unintuitive properties that
are sometimes referred to as the “curse of dimensionality.” Non-Euclidean
spaces usually share these anomalies as well. One manifestation of the “curse”
is that in high dimensions, almost all pairs of points are equally far away from
one another. Another manifestation is that almost any two vectors are almost
orthogonal. We shall explore each of these in turn.
The Distribution of Distances in a High-Dimensional Space
Let us consider a d-dimensional Euclidean space. Suppose we choose n random
points in the unit cube, i.e., points [x 1 , x 2 , . . . , x d ], where each x i is in the
range 0 to 1. If d = 1, we are placing random points on a line of length 1. We
expect that some pairs of points will be very close, e.g., consecutive points on
the line. We also expect that some points will be very far away - those at or
near opposite ends of the line. The average distance between a pair of points
is 1/3. 1
Suppose that d is very large. The Euclidean distance between two random
points [x 1 , x 2 , . . . , x d ] and [y 1 , y 2 , . . . , y d ] is
d
(x i −y i ) 2
i=1
Here, each x i and y i is a random variable chosen uniformly in the range 0 to
1. Since d is large, we can expect that for some i,|x i
|will be close to 1.
That puts a lower bound of 1 on the distance between almost any two random
points. In fact, a more careful argument can put a stronger lower bound on
the distance between all but a vanishingly small fraction of the pairs of points.
However, the maximum distance between two points is
−y i
d, and one can argue
that all but a vanishingly small fraction of the pairs do not have a distance
close to this upper limit. In fact, almost all points will have a distance close to
the average distance.
If there are essentially no pairs of points that are close, it is hard to build
clusters at all. There is little justification for grouping one pair of points and
not another. Of course, the data may not be random, and there may be useful
clusters, even in very high-dimensional spaces. However, the argument about
random data suggests that it will be hard to find these clusters among so many
pairs that are all at approximately the same distance.
Angles Between Vectors
Suppose again that we have three random points A, B, and C in a d-dimensional
space, where d is large. Here, we do not assume points are in the unit cube;
1 You can prove this fact by evaluating a double integral, but we shall not do the math
here, as it is not central to the discussion.
Search WWH ::




Custom Search