Information Technology Reference
In-Depth Information
where
tr
represents the trace of a matrix,
B(k)
is the between cluster sum of squares with
k
clusters and
W(k)
is the within cluster sum of squares with
k
clusters (Mardia et al., 1979). The number of clusters
for a dataset is given by
k
CH
≥
.
Krzanowski and Lai (1985) defined the following indices for estimating
k
for a dataset:
argmax
( )
2
1)
m
2/
2/
m
(2)
diff k
( ) (
= −
k
trW
−
k trW
k
−
1
k
(3)
|
diff k
( )|
KL k
( )
=
|
diff k
(
+
1)|
where
m
is number of features for each data point. The number of clusters for a dataset is estimated to
be
k
KL
≥
.
The Silhouette width is defined (Kaufman & Rousseeuw, 1990) to be a criterion for estimating
k
in
a dataset as follows:
argmax
( )
2
b i a i
( )
−
( )
(4)
sil i
( )
=
max( ( ), ( ))
a i b i
where
sil(i)
means the Silhouette width of data point
i
,
a(i)
denotes the average distance between
i
and
all other data in the cluster which
i
i belongs to, and
b(i)
represents the
smallest
average distance be-
tween
i
i and all data points in
a c
luster. The data with large
sil(i)
is well clustered. The overall average
silhouette width i
s de
ined by
=
∑
(where
n
is the number of data in a dataset). Each
k
(
k
≥2) is
sil
sil n
/
i
i
associa
ted
with a
sil
and the
k
i
s selected to be the right number of clusters for a dataset which has the
k
largest
si
≥
).
Similarly, Strehl (2002) defined the following indices:
sil
(i.e.
k
=
argmax
k
2
k
k
n
k
∑
∑
avgInter k
( )
=
i
n Inter(C ,C )
⋅
(5)
j
i
j
n n
−
j { ...i i ... k}
∈
1
− +
1 1
i
=
1
i
k
=
∑
(6)
avgIntra k
( )
n Intra C
( )
i
i
i
=
1
avgInter k
( )
(7)
( ) 1
k
= −
avgIntra k
( )
where
avgInter(k)
denotes the weighted average inter-cluster similarity,
avgIntra(k)
denotes the weighted
average intra-cluster similarity,
Inter(C
i
,C
j
)
means the inter-cluster similarity between cluster
C
i
with
n
i
data points and cluster
C
j
with
n
j
data points,
Intra(C
i
)
means the intra-cluster similarity within cluster
C
i
, and φ
(k)
is the criterion designed to measure the quality of clustering solution. The
Inter(C
i
,C
j
)
and
Intra(C
i
)
are given by (Strehl, 2002)
1
∑
C C
Inter
( ,
)
=
sim d d
( , )
(8)
a
b
i
j
nn
d C d C
∈
,
∈
a
i b
j
i j
2
∑
C
Intra
( )
=
sim d d
( , )
(9)
a
b
i
(
n
−
1)
n
d d C
,
∈
a b
i
i
i
where
d
a
and
d
b
represent data points. To obtain high quality with small number of clusters, Strehl (2002)
also designed a penalized quality φ
T
(k)
which is defined as
Search WWH ::
Custom Search