Information Technology Reference
In-Depth Information
where tr represents the trace of a matrix, B(k) is the between cluster sum of squares with k clusters and
W(k) is the within cluster sum of squares with k clusters (Mardia et al., 1979). The number of clusters
for a dataset is given by
k CH .
Krzanowski and Lai (1985) defined the following indices for estimating k for a dataset:
argmax
( )
2
1) m
2/
2/
m
(2)
diff k
( ) (
= −
k
trW
k trW
k
1
k
(3)
|
diff k
( )|
KL k
( )
=
|
diff k
(
+
1)|
where m is number of features for each data point. The number of clusters for a dataset is estimated to
be
k KL .
The Silhouette width is defined (Kaufman & Rousseeuw, 1990) to be a criterion for estimating k in
a dataset as follows:
argmax
( )
2
b i a i
( )
( )
(4)
sil i
( )
=
max( ( ), ( ))
a i b i
where sil(i) means the Silhouette width of data point i , a(i) denotes the average distance between i and
all other data in the cluster which i i belongs to, and b(i) represents the smallest average distance be-
tween i i and all data points in a c luster. The data with large sil(i) is well clustered. The overall average
silhouette width i s de ined by
=
(where n is the number of data in a dataset). Each k ( k ≥2) is
sil
sil n
/
i
i
associa ted with a
sil and the k i s selected to be the right number of clusters for a dataset which has the
k
largest
si ).
Similarly, Strehl (2002) defined the following indices:
sil (i.e. k =
argmax k
2
k
k
n
k
avgInter k
( )
=
i
n Inter(C ,C )
(5)
j
i
j
n n
j { ...i i ... k}
1
− +
1 1
i
=
1
i
k
=
(6)
avgIntra k
( )
n Intra C
( )
i
i
i
=
1
avgInter k
( )
(7)
( ) 1
k
= −
avgIntra k
( )
where avgInter(k) denotes the weighted average inter-cluster similarity, avgIntra(k) denotes the weighted
average intra-cluster similarity, Inter(C i ,C j ) means the inter-cluster similarity between cluster C i with n i
data points and cluster C j with n j data points, Intra(C i ) means the intra-cluster similarity within cluster
C i , and φ (k) is the criterion designed to measure the quality of clustering solution. The Inter(C i ,C j ) and
Intra(C i ) are given by (Strehl, 2002)
1
C C
Inter
( ,
)
=
sim d d
( , )
(8)
a
b
i
j
nn
d C d C
,
a
i b
j
i j
2
C
Intra
( )
=
sim d d
( , )
(9)
a
b
i
(
n
1)
n
d d C
,
a b
i
i
i
where d a and d b represent data points. To obtain high quality with small number of clusters, Strehl (2002)
also designed a penalized quality φ T (k) which is defined as
 
Search WWH ::




Custom Search