Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

(3) Update the centers of the K clusters by

N t C j i =1 I ( CL t i = j ) x i , j =1 , 2 ,..., K ,

x t +1

where I ( X ) is an identity function such that I ( X )=1if X is true, and I ( X )=0

otherwise. The denominator, N t C j , is the number of objects assigned to cluster

j at the t th iteration; Let t = t +1;

(4) If the convergence criterion is satisfied, algorithm stops. Otherwise go to step

(2).

K -means algorithm has several variants [1]. For instance, in step (1), the ini-

tial centers of these K clusters can be randomly generated points which are evenly

distributed in the area occupied by the objects in the dataset. Or we can just

randomly pick up K objects in the dataset as the initial centers. In different appli-

cations, the candidates for the similarity or dissimilarity measure in step (2) can

include d 2 , d m , d M and s C . Or, if users are clustering variables, correlation and

mutual information can be the candidated similarity measures. If d M is selected as

the dissimilarity measure, in step (1), K variance-covariance matrices Σ C j , j =1, 2,

..., K , are needed to be initialized. Usually, we choose Σ C j = I ,where I is a p

identity matrix. In step (3), the variance-covariance matrix of cluster j at iteration

t Σ t C j

should also be updated by

N t C j − 1 i =1 I ( CL i = j )( x i −

Σ t +1

C j

x j )( x i −

x j )

i.e., the sample variance-covariance matrix of objects assigned to cluster j .

Because of the randomness of the initialization of K cluster centers in step

(1), K -means cluster algorithm may give different clustering results on the same

dataset in different runs. One can run the K -means clustering algorithm on the

dataset for D duplicates, D > 1. For duplicate k ( k =1, 2, ..., D ), we calculate

the sum of the intra-cluster dissimilarities ( SICD ) from each object to its cluster

center, denoted as SICD k :

SICD k =

d 2 ( x i , x kj ) I ( CL ki = j )

(5.14)

j =1

i =1

where x kj is the center of the j th cluster in the k th duplicate, CL ki is the cluster

label of object i in the k th duplicate. The clustering result with the smallest SICD k

is the final clustering result, i.e.,

k ∗ =argmin

( SICD k , k =1 , 2 ,..., D ) , CL i = CL k ∗ , i , i =1 , 2 ,..., N

(5.15)

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home