Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

WHILE there are fewer than k points DO

Add the point whose minimum distance from the selected

points is as large as possible;

END;

Example 7.8 : Let us consider the twelve points of Fig. 7.2, which we repro-

duce here as Fig. 7.8. In the worst case, our initial choice of a point is near the

center, say (6,8). The furthest point from (6,8) is (12,3), so that point is chosen

next.

(4,10)

(7,10)

(4,8)

(6,8)

(12,6)

(10,5)

(11,4)

(3,4)

(9,3)

(12,3)

(2,2)

(5,2)

Figure 7.8: Repeat of Fig. 7.2

Among the remaining ten points, the one whose minimum distance to either

(6,8) or (12,3) is a maximum is (2,2). That point has distance

√

52 = 7.21 from

√

(6,8) and distance

101 = 10.05 to (12,3); thus its “score” is 7.21. You can

check easily that any other point is less than distance 7.21 from at least one of

(6,8) and (12.3). Our selection of three starting points is thus (6,8), (12,3), and

(2,2). Notice that these three belong to different clusters.

Had we started with a different point, say (10,5), we would get a different

set of three initial points. In this case, the starting points would be (10,5),

(2,2), and (4,10). Again, these points belong to the three different clusters.

2

7.3.3

Picking the Right Value of k

We may not know the correct value of k to use in a k-means clustering. How-

ever, if we can measure the quality of the clustering for various values of k, we

can usually guess what the right value of k is. Recall the discussion in Sec-

tion 7.2.3, especially Example 7.5, where we observed that if we take a measure

of appropriateness for clusters, such as average radius or diameter, that value

will grow slowly, as long as the number of clusters we assume remains at or

above the true number of clusters. However, as soon as we try to form fewer

Mining of Massive Datasets

Search WWH ::

Custom Search

Home