Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

(3) Distance measures obey the triangle inequality; the distance from x to y to z is never

less than the distance going from x to z directly.

7.1.2

Clustering Strategies

We can divide (cluster!) clustering algorithms into two groups that follow two fundament-

ally different strategies.

(1) Hierarchical or agglomerative algorithms start with each point in its own cluster.

Clusters are combined based on their “closeness,” using one of many possible defini-

tions of “close.” Combination stops when further combination leads to clusters that are

undesirable for one of several reasons. For example, we may stop when we have a pre-

determined number of clusters, or we may use a measure of compactness for clusters,

and refuse to construct a cluster by combining two smaller clusters if the resulting

cluster has points that are spread out over too large a region.

(2) The other class of algorithms involve point assignment . Points are considered in some

order, and each one is assigned to the cluster into which it best fits. This process is

normally preceded by a short phase in which initial clusters are estimated. Variations

allow occasional combining or splitting of clusters, or may allow points to be unas-

signed if they are outliers (points too far from any of the current clusters).

Algorithms for clustering can also be distinguished by:

(a) Whether the algorithm assumes a Euclidean space, or whether the algorithm works for

an arbitrary distance measure. We shall see that a key distinction is that in a Euclidean

space it is possible to summarize a collection of points by their centroid - the average

of the points. In a non-Euclidean space, there is no notion of a centroid, and we are

forced to develop another way to summarize clusters.

(b) Whether the algorithm assumes that the data is small enough to fit in main memory,

or whether data must reside in secondary memory, primarily. Algorithms for large

amounts of data often must take shortcuts, since it is infeasible to look at all pairs of

points, for example. It is also necessary to summarize clusters in main memory, since

we cannot hold all the points of all the clusters in main memory at the same time.

7.1.3

The Curse of Dimensionality

High-dimensional Euclidean spaces have a number of unintuitive properties that are some-

times referred to as the “curse of dimensionality.” Non-Euclidean spaces usually share

Search WWH ::

Custom Search

Home