Database Reference
In-Depth Information
7.3.3
Picking the Right Value of k
We may not know the correct value of k to use in a k -means clustering. However, if we can
measure the quality of the clustering for various values of k , we can usually guess what the
right value of k is. Recall the discussion in Section 7.2.3 , especially Example 7.5 , where
we observed that if we take a measure of appropriateness for clusters, such as average ra-
dius or diameter, that value will grow slowly, as long as the number of clusters we assume
remains at or above the true number of clusters. However, as soon as we try to form fewer
clusters than there really are, the measure will rise precipitously. The idea is expressed by
the diagram of Fig. 7.9 .
Figure 7.9 Average diameter or another measure of diffuseness rises quickly as soon as the number of clusters falls be-
low the true number present in the data
If we have no idea what the correct value of k is, we can find a good value in a number of
clustering operations that grows only logarithmically with the true number. Begin by run-
ning the k -means algorithm for k = 1 , 2 , 4 , 8 , . . . . Eventually, you will find two values v and
2 v between which there is very little decrease in the average diameter, or whatever measure
of cluster cohesion you are using. We may conclude that the value of k that is justified by
the data lies between v/ 2 and v . If you use a binary search (discussed below) in that range,
you can find the best value for k in another log 2 v clustering operations, for a total of 2 log 2
v clusterings. Since the true value of k is at least v/ 2, we have used a number of clusterings
that is logarithmic in k .
Since the notion of “not much change” is imprecise, we cannot say exactly how much
change is too much. However, the binary search can be conducted as follows, assuming the
notion of “not much change” is made precise by some formula. We know that there is too
much change between v/ 2 and v , or else we would not have gone on to run a clustering for
2 v clusters. Suppose at some point we have narrowed the range of k to between x and y .
Let z = ( x + y ) / 2. Run a clustering with z as the target number of clusters. If there is not too
much change between z and y , then the true value of k lies between x and z . So recursively
narrow that range to find the correct value of k . On the other hand, if there is too much
change between z and y , then use binary search in the range between z and y instead.
Search WWH ::




Custom Search