Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

(4,10)

(7,10)

(4,8)

(6,8)

(12,6)

(10,5)

(11,4)

(3,4)

(11.5,3.5)

(9,3)

(12,3)

(2,2)

(5,2)

Figure 7.3: Combining the first two points into a cluster

various tests for the adequacy of a cluster in Section 7.2.3, but for an

example, we could insist that any cluster have an average distance between

the centroid and its points no greater than some limit. This approach is

only sensible if we have a reason to believe that no cluster extends over

too much of the space.

3. We could continue clustering until there is only one cluster. However,

it is meaningless to return a single cluster consisting of all the points.

Rather, we return the tree representing the way in which all the points

were combined. This form of answer makes good sense in some applica-

tions, such as one in which the points are genomes of different species,

and the distance measure reflects the difference in the genome. 2 Then,

the tree represents the evolution of these species, that is, the likely order

in which two species branched from a common ancestor.

Example 7.3 : If we complete the clustering of the data of Fig. 7.2, the tree

describing how clusters were grouped is the tree shown in Fig. 7.6.

2

7.2.2 E ciency of Hierarchical Clustering

The basic algorithm for hierarchical clustering is not very e cient. At each

step, we must compute the distances between each pair of clusters, in order to

find the best merger. The initial step takes O(n 2 ) time, but subsequent steps

take time proportional to (n−1) 2 , (n−2) 2 , . . . . The sum of squares up to n is

O(n 3 ), so this algorithm is cubic. Thus, it cannot be run except for fairly small

numbers of points.

2 This space would not be Euclidean, of course, but the principles regarding hierarchical

clustering carry over, with some modifications, to non-Euclidean clustering.

Search WWH ::

Custom Search

Home