Probabilistic Distance Clustering: Algorithm and Applications - Clustering Challenges in Biological Network - page 50

Biology Reference

In-Depth Information

2.7. Determining the “Right” Number of Clusters

An important issue in clustering is to determine the “right” number K of clusters

that fits a data set with N points. In dichotomous situations the unambiguous

answer is K =2, but in general, the answer lies between the two extremes of

K =1(one cluster fits all), and K = N (each point is a cluster.)

The joint distance function (given in (2.31) for K =2, and analogously for

general K ) helps resolve this issue. Indeed, the value of the JDF decreases mono-

tonically with K , the number of clusters, and the decrease is precipitous until the

“right” number is reached, and after that the rate of decrease is small.

This is

illustrated in Example 2.10 and Figures 2.9-2.11 below.

This approach is practical because the PDQ algorithm is fast, and it generates

the criterion needed (the value of the JDF), unlike other approaches that require

external criteria [11]. See also the survey in [4].

Example 2.10. Figure 2.9(a) shows a data set with 2 clusters. The PDQ algorithm

was applied to this data set, and the values of the JDF are computed for values of

K from 1 to 10, the results are plotted in Fig. 2.9(b). Note the change of slope of

the JDF at K =2, the correct number of clusters.

Figures 2.10(a) and 2.11(a) similarly show data sets with K =3and K =

4 clusters, respectively. The JDF, computed by the PDQ algorithm, shown in

Figs. 2.10(b) and 2.11(b), reveal the correct number of clusters.

350

0.8

j.d.f value

0.6

300

0.4

0.2

250

0

200

−0.2

−0.4

150

−0.6

−0.8

100

1

2

3

4

5

6

7

8

9

10

−0.5

0

0.5

1

1.5

number of clusters

(a) A data set with K =2clusters

(b) The JDF as a function of K

Fig. 2.9.

Results for Example 2.10

Next Page

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home