Databases Reference
In-Depth Information
are some e ciencies we can make by careful implementation. When the space
is non-Euclidean, there are additional problems associated with hierarchical
clustering. We therefore consider “clustroids” and the way we can represent a
cluster when there is no centroid or average point in a cluster.
7.2.1 Hierarchical Clustering in a Euclidean Space
Any hierarchical clustering algorithm works as follows. We begin with every
point in its own cluster. As time goes on, larger clusters will be constructed by
combining two smaller clusters, and we have to decide in advance:
1. How will clusters be represented?
2. How will we choose which two clusters to merge?
3. When will we stop combining clusters?
Once we have answers to these questions, the algorithm can be described suc-
cinctly as:
WHILE it is not time to stop DO
pick the best two clusters to merge;
combine those two clusters into one cluster;
END;
To begin, we shall assume the space is Euclidean. That allows us to represent
a cluster by its centroid or average of the points in the cluster. Note that
in a cluster of one point, that point is the centroid, so we can initialize the
clusters straightforwardly. We can then use the merging rule that the distance
between any two clusters is the Euclidean distance between their centroids,
and we should pick the two clusters at the shortest distance. Other ways to
define intercluster distance are possible, and we can also pick the best pair of
clusters on a basis other than their distance. We shall discuss some options in
Section 7.2.3.
Example 7.2 : Let us see how the basic hierarchical clustering would work on
the data of Fig. 7.2. These points live in a 2-dimensional Euclidean space, and
each point is named by its (x, y) coordinates. Initially, each point is in a cluster
by itself and is the centroid of that cluster. Among all the pairs of points, there
are two pairs that are closest: (10,5) and (11,4) or (11,4) and (12,3). Each is
at distance
2. Let us break ties arbitrarily and decide to combine (11,4) with
(12,3). The result is shown in Fig. 7.3, including the centroid of the new cluster,
which is at (11.5, 3.5).
You might think that (10,5) gets combined with the new cluster next, since it
is so close to (11,4). But our distance rule requires us to compare only cluster
centroids, and the distance from (10,5) to the centroid of the new cluster is
1.5
2, which is slightly greater than 2. Thus, now the two closest clusters are
Search WWH ::




Custom Search