Biology Reference
In-Depth Information
(3) Update the centers of the
K
clusters by
N
t
C
j
i
=1
I
(
CL
t
i
=
j
)
x
i
,
j
=1
,
2
,...,
K
,
x
t
+1
j
1
=
where
I
(
X
) is an identity function such that
I
(
X
)=1if
X
is true, and
I
(
X
)=0
otherwise. The denominator,
N
t
C
j
, is the number of objects assigned to cluster
j
at the
t
th
iteration; Let
t
=
t
+1;
(4) If the convergence criterion is satisfied, algorithm stops. Otherwise go to step
(2).
K
-means algorithm has several variants [1]. For instance, in step (1), the ini-
tial centers of these
K
clusters can be randomly generated points which are evenly
distributed in the area occupied by the objects in the dataset. Or we can just
randomly pick up
K
objects in the dataset as the initial centers. In different appli-
cations, the candidates for the similarity or dissimilarity measure in step (2) can
include
d
2
,
d
m
,
d
M
and
s
C
. Or, if users are clustering variables, correlation and
mutual information can be the candidated similarity measures. If
d
M
is selected as
the dissimilarity measure, in step (1),
K
variance-covariance matrices Σ
C
j
,
j
=1, 2,
...,
K
, are needed to be initialized. Usually, we choose Σ
C
j
=
I
,where
I
is a
p
p
identity matrix. In step (3), the variance-covariance matrix of cluster
j
at iteration
t
Σ
t
C
j
×
should also be updated by
N
t
C
j
−
1
i
=1
I
(
CL
i
=
j
)(
x
i
−
Σ
t
+1
C
j
1
x
j
)(
x
i
−
x
j
)
=
i.e., the sample variance-covariance matrix of objects assigned to cluster
j
.
Because of the randomness of the initialization of
K
cluster centers in step
(1),
K
-means cluster algorithm may give different clustering results on the same
dataset in different runs. One can run the
K
-means clustering algorithm on the
dataset for
D
duplicates,
D
>
1. For duplicate
k
(
k
=1, 2, ...,
D
), we calculate
the sum of the intra-cluster dissimilarities (
SICD
) from each object to its cluster
center, denoted as
SICD
k
:
K
N
SICD
k
=
d
2
(
x
i
,
x
kj
)
I
(
CL
ki
=
j
)
(5.14)
j
=1
i
=1
where
x
kj
is the center of the
j
th
cluster in the
k
th
duplicate,
CL
ki
is the cluster
label of object
i
in the
k
th
duplicate. The clustering result with the smallest
SICD
k
is the final clustering result, i.e.,
k
∗
=argmin
k
(
SICD
k
,
k
=1
,
2
,...,
D
)
,
CL
i
=
CL
k
∗
,
i
,
i
=1
,
2
,...,
N
(5.15)
Search WWH ::
Custom Search