Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

and edit distance. Recall that the requirements for a function on pairs of points

to be a distance measure are that

1. Distances are always nonnegative, and only the distance between a point

and itself is 0.

2. Distance is symmetric; it doesn't matter in which order you consider the

points when computing their distance.

3. Distance measures obey the triangle inequality; the distance from x to y

to z is never less than the distance going from x to z directly.

7.1.2 Clustering Strategies

We can divide (cluster!) clustering algorithms into two groups that follow two

fundamentally different strategies.

1. Hierarchical or agglomerative algorithms start with each point in its own

cluster. Clusters are combined based on their “closeness,” using one of

many possible definitions of “close.” Combination stops when further

combination leads to clusters that are undesirable for one of several rea-

sons. For example, we may stop when we have a predetermined number of

clusters, or we may use a measure of compactness for clusters, and refuse

to construct a cluster by combining two smaller clusters if the resulting

cluster has points that are spread out over too large a region.

2. The other class of algorithms involve point assignment. Points are con-

sidered in some order, and each one is assigned to the cluster into which

it best fits. This process is normally preceded by a short phase in which

initial clusters are estimated. Variations allow occasional combining or

splitting of clusters, or may allow points to be unassigned if they are

outliers (points too far from any of the current clusters).

Algorithms for clustering can also be distinguished by:

(a) Whether the algorithm assumes a Euclidean space, or whether the al-

gorithm works for an arbitrary distance measure. We shall see that a

key distinction is that in a Euclidean space it is possible to summarize

a collection of points by their centroid - the average of the points. In a

non-Euclidean space, there is no notion of a centroid, and we are forced

to develop another way to summarize clusters.

(b) Whether the algorithm assumes that the data is small enough to fit in

main memory, or whether data must reside in secondary memory, pri-

marily. Algorithms for large amounts of data often must take shortcuts,

since it is infeasible to look at all pairs of points, for example. It is also

necessary to summarize clusters in main memory, since we cannot hold

all the points of all the clusters in main memory at the same time.

Search WWH ::

Custom Search

Home