Biology Reference
In-Depth Information
clustering and the results obtained for the original dataset are compared, for each
value of k , with the quality measures computed from the m randomizations. A
good clustering is one whose quality measure lies significantly above the range of
the quality measures computed from the randomized results, for the same value
of k .
The silhouette coefficient used here as a measure of cluster quality is based on
the idea that a good clustering should consist of cohesive, well-separated clusters.
Given a partitioning
of N objects into k clusters, consider any fixed object i
and let C i denote the set of indices for all objects clustered together with object i .
A useful measure of cohesion for this cluster is:
a ( i )= 1
n i
P
d ij ,
(15.6)
j
C i
where n i is the number of objects in cluster C i and d ij is the dissimilarity between
objects i and j . Note that for a good clustering, a ( i ) should be small for all i .To
characterize the separation between clusters, let K denote the th neighboring
cluster, distinct from C i ,for =1 , 2 ,...,k
1.Define b ( i ) as the average dis-
similarity between object i in cluster C i and the objects in the closest neighboring
cluster, given by:
!
"
1
n
b ( i )=min
d ij
.
(15.7)
#
j∈K
Here, for a good clustering, b ( i ) should be large for all i . The silhouette coefficient
s ( i ) for object i is then defined as the following normalized difference between
these two quantities:
b ( i )
a ( i )
s ( i )=
.
(15.8)
max
{
a ( i ) ,b ( i )
}
For s ( i ) to be well-defined, the partitioning
must contain at least two clusters,
and every cluster must contain at least two objects. Under these conditions, it is
easily shown that
P
1 for all i . From the previous observations, a
good clustering should have a ( i ) << b ( i ) for all i , implying s ( i )
1
s ( i )
1 for all i .A
useful measure of the overall quality of the partitioning
P
is therefore the average
silhouette coefficient over all objects:
N
S = 1
N
s ( i ) .
(15.9)
i =1
Given this cluster quality measure, let
D 0 denote the original dataset and let S 0 ( k )
denote the value of S computed for the k -cluster partitioning of
D 0 obtained by a
Search WWH ::




Custom Search