Biology Reference
In-Depth Information
(3) Update the centers of the K clusters by
N t C j i =1 I ( CL t i = j ) x i , j =1 , 2 ,..., K ,
x t +1
j
1
=
where I ( X ) is an identity function such that I ( X )=1if X is true, and I ( X )=0
otherwise. The denominator, N t C j , is the number of objects assigned to cluster
j at the t th iteration; Let t = t +1;
(4) If the convergence criterion is satisfied, algorithm stops. Otherwise go to step
(2).
K -means algorithm has several variants [1]. For instance, in step (1), the ini-
tial centers of these K clusters can be randomly generated points which are evenly
distributed in the area occupied by the objects in the dataset. Or we can just
randomly pick up K objects in the dataset as the initial centers. In different appli-
cations, the candidates for the similarity or dissimilarity measure in step (2) can
include d 2 , d m , d M and s C . Or, if users are clustering variables, correlation and
mutual information can be the candidated similarity measures. If d M is selected as
the dissimilarity measure, in step (1), K variance-covariance matrices Σ C j , j =1, 2,
..., K , are needed to be initialized. Usually, we choose Σ C j = I ,where I is a p
p
identity matrix. In step (3), the variance-covariance matrix of cluster j at iteration
t Σ t C j
×
should also be updated by
N t C j 1 i =1 I ( CL i = j )( x i
Σ t +1
C j
1
x j )( x i
x j )
=
i.e., the sample variance-covariance matrix of objects assigned to cluster j .
Because of the randomness of the initialization of K cluster centers in step
(1), K -means cluster algorithm may give different clustering results on the same
dataset in different runs. One can run the K -means clustering algorithm on the
dataset for D duplicates, D > 1. For duplicate k ( k =1, 2, ..., D ), we calculate
the sum of the intra-cluster dissimilarities ( SICD ) from each object to its cluster
center, denoted as SICD k :
K
N
SICD k =
d 2 ( x i , x kj ) I ( CL ki = j )
(5.14)
j =1
i =1
where x kj is the center of the j th cluster in the k th duplicate, CL ki is the cluster
label of object i in the k th duplicate. The clustering result with the smallest SICD k
is the final clustering result, i.e.,
k =argmin
k
( SICD k , k =1 , 2 ,..., D ) , CL i = CL k , i , i =1 , 2 ,..., N
(5.15)
Search WWH ::




Custom Search