Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.3.2

Initializing Clusters for K-Means

We want to pick points that have a good chance of lying in different clusters. There are two

approaches.

(1) Pick points that are as far away from one another as possible.

(2) Cluster a sample of the data, perhaps hierarchically, so there are k clusters. Pick a point

from each cluster, perhaps that point closest to the centroid of the cluster.

The second approach requires little elaboration. For the first approach, there are vari-

ations. One good choice is:

Pick the first point at random;

WHILE there are fewer than k points DO

Add the point whose minimum distance from the selected

points is as large as possible;

END;

EXAMPLE 7.8 Let us consider the twelve points of Fig. 7.2 , which we reproduce here as

Fig. 7.8 . In the worst case, our initial choice of a point is near the center, say (6,8). The

furthest point from (6,8) is (12,3), so that point is chosen next.

Figure 7.8 Repeat of Fig. 7.2

Among the remaining ten points, the one whose minimum distance to either (6,8) or

(12,3) is a maximum is (2,2). That point has distance from (6,8) and distance to

(12,3); thus its “score” is 7.21. You can check easily that any other point is less than dis-

tance 7.21 from at least one of (6,8) and (12,3). Our selection of three starting points is thus

(6,8), (12,3), and (2,2). Notice that these three belong to different clusters.

Had we started with a different point, say (10,5), we would get a different set of three ini-

tial points. In this case, the starting points would be (10,5), (2,2), and (4,10). Again, these

points belong to the three different clusters.

□

Search WWH ::

Custom Search

Home