Databases Reference
In-Depth Information
(4,10)
(7,10)
(4,8)
(6,8)
(12,6)
(10,5)
(11,4)
(3,4)
(11.5,3.5)
(9,3)
(12,3)
(2,2)
(5,2)
Figure 7.3: Combining the first two points into a cluster
various tests for the adequacy of a cluster in Section 7.2.3, but for an
example, we could insist that any cluster have an average distance between
the centroid and its points no greater than some limit. This approach is
only sensible if we have a reason to believe that no cluster extends over
too much of the space.
3. We could continue clustering until there is only one cluster. However,
it is meaningless to return a single cluster consisting of all the points.
Rather, we return the tree representing the way in which all the points
were combined. This form of answer makes good sense in some applica-
tions, such as one in which the points are genomes of different species,
and the distance measure reflects the difference in the genome. 2 Then,
the tree represents the evolution of these species, that is, the likely order
in which two species branched from a common ancestor.
Example 7.3 : If we complete the clustering of the data of Fig. 7.2, the tree
describing how clusters were grouped is the tree shown in Fig. 7.6.
2
7.2.2 E ciency of Hierarchical Clustering
The basic algorithm for hierarchical clustering is not very e cient. At each
step, we must compute the distances between each pair of clusters, in order to
find the best merger. The initial step takes O(n 2 ) time, but subsequent steps
take time proportional to (n−1) 2 , (n−2) 2 , . . . . The sum of squares up to n is
O(n 3 ), so this algorithm is cubic. Thus, it cannot be run except for fairly small
numbers of points.
2 This space would not be Euclidean, of course, but the principles regarding hierarchical
clustering carry over, with some modifications, to non-Euclidean clustering.
Search WWH ::




Custom Search