Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

7.3.3

Picking the Right Value of k

We may not know the correct value of k to use in a k -means clustering. However, if we can

measure the quality of the clustering for various values of k , we can usually guess what the

right value of k is. Recall the discussion in Section 7.2.3 , especially Example 7.5 , where

we observed that if we take a measure of appropriateness for clusters, such as average ra-

dius or diameter, that value will grow slowly, as long as the number of clusters we assume

remains at or above the true number of clusters. However, as soon as we try to form fewer

clusters than there really are, the measure will rise precipitously. The idea is expressed by

the diagram of Fig. 7.9 .

Figure 7.9 Average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls be-

low the true number present in the data

If we have no idea what the correct value of k is, we can find a good value in a number of

clustering operations that grows only logarithmically with the true number. Begin by run-

ning the k -means algorithm for k = 1 , 2 , 4 , 8 , . . . . Eventually, you will find two values v and

2 v between which there is very little decrease in the average diameter, or whatever measure

of cluster cohesion you are using. We may conclude that the value of k that is justified by

the data lies between v/ 2 and v . If you use a binary search (discussed below) in that range,

you can find the best value for k in another log 2 v clustering operations, for a total of 2 log 2

v clusterings. Since the true value of k is at least v/ 2, we have used a number of clusterings

that is logarithmic in k .

Since the notion of “not much change” is imprecise, we cannot say exactly how much

change is too much. However, the binary search can be conducted as follows, assuming the

notion of “not much change” is made precise by some formula. We know that there is too

much change between v/ 2 and v , or else we would not have gone on to run a clustering for

2 v clusters. Suppose at some point we have narrowed the range of k to between x and y .

Let z = ( x + y ) / 2. Run a clustering with z as the target number of clusters. If there is not too

much change between z and y , then the true value of k lies between x and z . So recursively

narrow that range to find the correct value of k . On the other hand, if there is too much

change between z and y , then use binary search in the range between z and y instead.

Search WWH ::

Custom Search

Home