Database Reference
In-Depth Information
(3) Distance measures obey the triangle inequality; the distance from x to y to z is never
less than the distance going from x to z directly.
7.1.2
Clustering Strategies
We can divide (cluster!) clustering algorithms into two groups that follow two fundament-
ally different strategies.
(1) Hierarchical or agglomerative algorithms start with each point in its own cluster.
Clusters are combined based on their “closeness,” using one of many possible defini-
tions of “close.” Combination stops when further combination leads to clusters that are
undesirable for one of several reasons. For example, we may stop when we have a pre-
determined number of clusters, or we may use a measure of compactness for clusters,
and refuse to construct a cluster by combining two smaller clusters if the resulting
cluster has points that are spread out over too large a region.
(2) The other class of algorithms involve point assignment . Points are considered in some
order, and each one is assigned to the cluster into which it best fits. This process is
normally preceded by a short phase in which initial clusters are estimated. Variations
allow occasional combining or splitting of clusters, or may allow points to be unas-
signed if they are outliers (points too far from any of the current clusters).
Algorithms for clustering can also be distinguished by:
(a) Whether the algorithm assumes a Euclidean space, or whether the algorithm works for
an arbitrary distance measure. We shall see that a key distinction is that in a Euclidean
space it is possible to summarize a collection of points by their centroid - the average
of the points. In a non-Euclidean space, there is no notion of a centroid, and we are
forced to develop another way to summarize clusters.
(b) Whether the algorithm assumes that the data is small enough to fit in main memory,
or whether data must reside in secondary memory, primarily. Algorithms for large
amounts of data often must take shortcuts, since it is infeasible to look at all pairs of
points, for example. It is also necessary to summarize clusters in main memory, since
we cannot hold all the points of all the clusters in main memory at the same time.
7.1.3
The Curse of Dimensionality
High-dimensional Euclidean spaces have a number of unintuitive properties that are some-
times referred to as the “curse of dimensionality.” Non-Euclidean spaces usually share
Search WWH ::




Custom Search