Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

K -means algorithm is a hard clustering algorithm, i.e., each subject is clustered

into a single cluster with probability 1. Its counter part, soft clustering algorithm,

assigns a set of probabilities to a subject, each of which represents the probability

that the subject belongs to a cluster. The probabilities assigned to a single subject

are all non-negative and sum up to 1. The soft clustering counter part of K -means

is fuzzy K -means algorithm. Generally speaking, the differences between fuzzy

K -means and K -means are that in fuzzy K -means algorithm, each subject is as-

signed a set of probabilities, and in each iteration, the clusters centers are updated

by weighted average of all subjects in the dataset, where the weight of each sub-

ject is the probability that this subject belongs to the cluster [9]. Other examples

of soft clustering algorithms include fuzzy clustering by local approximation of

memberships (FLAME) [8] and neuro-fuzzy [25] algorithms.

5.3.2. E-M Algorithm

E-M algorithm stands for Expectation-Maximization algorithm. It has two steps:

Expectation step and Maximization step. By assuming that the dataset is generated

from a mixture of K multivariate normal distribution, the E-M algorithm works as

follows in clustering analysis.

Expectation step (E-step): Make expectations of the mean vector and variance-

covariance matrix of cluster j , denoted as µ j and Σ j respectively, using objects

assigned to cluster j , j =1 , 2 ,..., K . If it is in the first iteration, randomly select K

different vectors µ j 's and Kp

p positive-definitive and symmetric matrices Σ j 's

as the initial expectation of the mean vectors and variance-covariance matrices.

Maximization step (M-step): For object i , calculate the likelihood that it is

generated by the j th multivariate normal distribution with mean µ j and variance-

covariance matrix Σ j . Assign cluster label CL i to object i according to the maxi-

mum likelihood, i.e.,

×

1

2 ( x i −

µ j ) Σ − 1

L ij =

1 / 2 exp (

−

j ( x i −

µ j ))

(5.16)

Σ j |

(2 π ) P / 2

|

If some stopping criterion is satisfied, the algorithm stops; otherwise go back

to the E-step. The frequently used stopping criterion is that the update of µ j 's from

the previous iteration does not exceed a threshold value.

In the initial run of the E-step, Σ j 's are usually selected to be identity matrices.

One disadvantage of the E-M algorithm is that it is based on the assumption of

the mixture of some distributions with unknown parameters, and the assumption

is difficult to be justified. There is a dilemma here: to justify the assumption, we

have to know the clusters first and then run the test to see whether the distribution

of each cluster satisfied the distribution assumption. This means that we have to

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home