Biology Reference
In-Depth Information
2.7. Determining the “Right” Number of Clusters
An important issue in clustering is to determine the “right” number K of clusters
that fits a data set with N points. In dichotomous situations the unambiguous
answer is K =2, but in general, the answer lies between the two extremes of
K =1(one cluster fits all), and K = N (each point is a cluster.)
The joint distance function (given in (2.31) for K =2, and analogously for
general K ) helps resolve this issue. Indeed, the value of the JDF decreases mono-
tonically with K , the number of clusters, and the decrease is precipitous until the
“right” number is reached, and after that the rate of decrease is small.
This is
illustrated in Example 2.10 and Figures 2.9-2.11 below.
This approach is practical because the PDQ algorithm is fast, and it generates
the criterion needed (the value of the JDF), unlike other approaches that require
external criteria [11]. See also the survey in [4].
Example 2.10. Figure 2.9(a) shows a data set with 2 clusters. The PDQ algorithm
was applied to this data set, and the values of the JDF are computed for values of
K from 1 to 10, the results are plotted in Fig. 2.9(b). Note the change of slope of
the JDF at K =2, the correct number of clusters.
Figures 2.10(a) and 2.11(a) similarly show data sets with K =3and K =
4 clusters, respectively. The JDF, computed by the PDQ algorithm, shown in
Figs. 2.10(b) and 2.11(b), reveal the correct number of clusters.
350
0.8
j.d.f value
0.6
300
0.4
0.2
250
0
200
−0.2
−0.4
150
−0.6
−0.8
100
1
2
3
4
5
6
7
8
9
10
−0.5
0
0.5
1
1.5
number of clusters
(a) A data set with K =2clusters
(b) The JDF as a function of K
Fig. 2.9.
Results for Example 2.10
 
Search WWH ::




Custom Search