Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.2.2

Efficiency of Hierarchical Clustering

The basic algorithm for hierarchical clustering is not very efficient. At each step, we must

compute the distances between each pair of clusters, in order to find the best merger. The

initial step takes O ( n 2 ) time, but subsequent steps take time proportional to ( n − 1) 2 , ( n −

2) 2 , . . . . The sum of squares up to n is O ( n 3 ), so this algorithm is cubic. Thus, it cannot be

run except for fairly small numbers of points.

However, there is a somewhat more efficient implementation of which we should be

aware.

(1) We start, as we must, by computing the distances between all pairs of points, and this

step is O ( n 2 ).

(2) Form the pairs and their distances into a priority queue, so we can always find the

smallest distance in one step. This operation is also O ( n 2 ).

(3) When we decide to merge two clusters C and D , we remove all entries in the priority

queue involving one of these two clusters; that requires work O ( n log n ) since there are

at most 2 n deletions to be performed, and priority-queue deletion can be performed in

O (log n ) time.

(4) We then compute all the distances between the new cluster and the remaining clusters.

This work is also O ( n log n ), as there are at most n entries to be inserted into the prior-

ity queue, and insertion into a priority queue can also be done in O (log n ) time.

Since the last two steps are executed at most n times, and the first two steps are executed

only once, the overall running time of this algorithm is O ( n 2 log n ). That is better than

O ( n 3 ), but it still puts a strong limit on how large n can be before it becomes infeasible to

use this clustering approach.

7.2.3

Alternative Rules for Controlling Hierarchical

Clustering

We have seen one rule for picking the best clusters to merge: find the pair with the smallest

distance between their centroids. Some other options are:

(1) Take the distance between two clusters to be the minimum of the distances between

any two points, one chosen from each cluster. For example, in Fig. 7.3 we would next

chose to cluster the point (10,5) with the cluster of two points, since (10,5) has dis-

tance and no other pair of unclustered points is that close. Note that in Example 7.2 ,

Search WWH ::

Custom Search

Home