Information Technology Reference
In-Depth Information
Number of clusters
The total number of clusters c is the most important parameter, as the remaining
parameters have little influence on the resulting partition: when clustering real data
without any prior information about the structures in the data, one usually has to
make assumptions about the number of underlying clusters. The clustering
algorithm chosen then searches for c clusters regardless of whether they are really
present in the data or not. Two main approaches to determining the appropriate
number of clusters in the data can be distinguished:
A. Validity measures
Validity measures are scalar indices that assess the goodness of the partition
obtained. Clustering algorithms generally aim at locating well-separated and
compact clusters. When the number of clusters is chosen equal to the number of
groups that are actually present in the data, it is expected that the clustering
algorithm will identify them correctly. When this is not the case, misclassifications
appear, and the clusters are not likely to be well-separated and compact. Hence,
most cluster validity measures are open to interpretation and can be formulated in
different ways. Consequently, many validity measures have been introduced in the
literature (Bezdek, 1981; Gath and Geva, 1989; Pal and Bezdek, 1995). For the
FCM algorithm, the Xie-Beni index (Xie and Beni, 1991)
cN
2
m
gs
¦¦
Zv
P
s
g
gs
11
F
ZUV
;,
(4.23)
2
c
Z
v
min
s
g
gh
z
has been found to perform well in practice. This index can be interpreted as the
ratio of the total within-group variance and the separation of the cluster centers.
The best partition minimizes the value of
F
ZUV
;,
.
B. Iterative merging
In the iterative cluster merging, one starts with a sufficiently large number of
clusters and successively by merging clusters, that are similar (compatible) with
respect to some well-defined criteria (Krishnapuram and Freg, 1992; Kaymak and
Babuška, 1995), the number of clusters is reduced. One can also adopt the opposite
approach, i.e. start with a small number of clusters and iteratively insert clusters in
the region where the data points have a low degree of membership in the existing
clusters (Gath and Geva, 1989).
Fuzziness parameter
The fuzziness exponent or weighting exponent m is a rather important parameter
that is to be selected properly as well. This is because it significantly influences the
fuzziness of the resulting partition. As m approaches to one, the partition becomes
hard partition ( g P {0,1}) and v g are ordinary means of the clusters. On the other
hand, as m of, the partition becomes completely fuzzy (
g P = 1/ c ) and the
Search WWH ::




Custom Search