Databases Reference
In-Depth Information
and edit distance. Recall that the requirements for a function on pairs of points
to be a distance measure are that
1. Distances are always nonnegative, and only the distance between a point
and itself is 0.
2. Distance is symmetric; it doesn't matter in which order you consider the
points when computing their distance.
3. Distance measures obey the triangle inequality; the distance from x to y
to z is never less than the distance going from x to z directly.
7.1.2 Clustering Strategies
We can divide (cluster!) clustering algorithms into two groups that follow two
fundamentally different strategies.
1. Hierarchical or agglomerative algorithms start with each point in its own
cluster. Clusters are combined based on their “closeness,” using one of
many possible definitions of “close.” Combination stops when further
combination leads to clusters that are undesirable for one of several rea-
sons. For example, we may stop when we have a predetermined number of
clusters, or we may use a measure of compactness for clusters, and refuse
to construct a cluster by combining two smaller clusters if the resulting
cluster has points that are spread out over too large a region.
2. The other class of algorithms involve point assignment. Points are con-
sidered in some order, and each one is assigned to the cluster into which
it best fits. This process is normally preceded by a short phase in which
initial clusters are estimated. Variations allow occasional combining or
splitting of clusters, or may allow points to be unassigned if they are
outliers (points too far from any of the current clusters).
Algorithms for clustering can also be distinguished by:
(a) Whether the algorithm assumes a Euclidean space, or whether the al-
gorithm works for an arbitrary distance measure. We shall see that a
key distinction is that in a Euclidean space it is possible to summarize
a collection of points by their centroid - the average of the points. In a
non-Euclidean space, there is no notion of a centroid, and we are forced
to develop another way to summarize clusters.
(b) Whether the algorithm assumes that the data is small enough to fit in
main memory, or whether data must reside in secondary memory, pri-
marily. Algorithms for large amounts of data often must take shortcuts,
since it is infeasible to look at all pairs of points, for example. It is also
necessary to summarize clusters in main memory, since we cannot hold
all the points of all the clusters in main memory at the same time.
Search WWH ::




Custom Search