Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

7.1.3 The Curse of Dimensionality

High-dimensional Euclidean spaces have a number of unintuitive properties that

are sometimes referred to as the “curse of dimensionality.” Non-Euclidean

spaces usually share these anomalies as well. One manifestation of the “curse”

is that in high dimensions, almost all pairs of points are equally far away from

one another. Another manifestation is that almost any two vectors are almost

orthogonal. We shall explore each of these in turn.

The Distribution of Distances in a High-Dimensional Space

Let us consider a d-dimensional Euclidean space. Suppose we choose n random

points in the unit cube, i.e., points [x 1 , x 2 , . . . , x d ], where each x i is in the

range 0 to 1. If d = 1, we are placing random points on a line of length 1. We

expect that some pairs of points will be very close, e.g., consecutive points on

the line. We also expect that some points will be very far away - those at or

near opposite ends of the line. The average distance between a pair of points

is 1/3. 1

Suppose that d is very large. The Euclidean distance between two random

points [x 1 , x 2 , . . . , x d ] and [y 1 , y 2 , . . . , y d ] is

d

(x i −y i ) 2

i=1

Here, each x i and y i is a random variable chosen uniformly in the range 0 to

1. Since d is large, we can expect that for some i,|x i

|will be close to 1.

That puts a lower bound of 1 on the distance between almost any two random

points. In fact, a more careful argument can put a stronger lower bound on

the distance between all but a vanishingly small fraction of the pairs of points.

However, the maximum distance between two points is

−y i

√

d, and one can argue

that all but a vanishingly small fraction of the pairs do not have a distance

close to this upper limit. In fact, almost all points will have a distance close to

the average distance.

If there are essentially no pairs of points that are close, it is hard to build

clusters at all. There is little justification for grouping one pair of points and

not another. Of course, the data may not be random, and there may be useful

clusters, even in very high-dimensional spaces. However, the argument about

random data suggests that it will be hard to find these clusters among so many

pairs that are all at approximately the same distance.

Angles Between Vectors

Suppose again that we have three random points A, B, and C in a d-dimensional

space, where d is large. Here, we do not assume points are in the unit cube;

1 You can prove this fact by evaluating a double integral, but we shall not do the math

here, as it is not central to the discussion.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home