Biology Reference
In-Depth Information
K -means algorithm is a hard clustering algorithm, i.e., each subject is clustered
into a single cluster with probability 1. Its counter part, soft clustering algorithm,
assigns a set of probabilities to a subject, each of which represents the probability
that the subject belongs to a cluster. The probabilities assigned to a single subject
are all non-negative and sum up to 1. The soft clustering counter part of K -means
is fuzzy K -means algorithm. Generally speaking, the differences between fuzzy
K -means and K -means are that in fuzzy K -means algorithm, each subject is as-
signed a set of probabilities, and in each iteration, the clusters centers are updated
by weighted average of all subjects in the dataset, where the weight of each sub-
ject is the probability that this subject belongs to the cluster [9]. Other examples
of soft clustering algorithms include fuzzy clustering by local approximation of
memberships (FLAME) [8] and neuro-fuzzy [25] algorithms.
5.3.2. E-M Algorithm
E-M algorithm stands for Expectation-Maximization algorithm. It has two steps:
Expectation step and Maximization step. By assuming that the dataset is generated
from a mixture of K multivariate normal distribution, the E-M algorithm works as
follows in clustering analysis.
Expectation step (E-step): Make expectations of the mean vector and variance-
covariance matrix of cluster j , denoted as µ j and Σ j respectively, using objects
assigned to cluster j , j =1 , 2 ,..., K . If it is in the first iteration, randomly select K
different vectors µ j 's and Kp
p positive-definitive and symmetric matrices Σ j 's
as the initial expectation of the mean vectors and variance-covariance matrices.
Maximization step (M-step): For object i , calculate the likelihood that it is
generated by the j th multivariate normal distribution with mean µ j and variance-
covariance matrix Σ j . Assign cluster label CL i to object i according to the maxi-
mum likelihood, i.e.,
×
1
1
2 ( x i
µ j ) Σ 1
L ij =
1 / 2 exp (
j ( x i
µ j ))
(5.16)
Σ j |
(2 π ) P / 2
|
If some stopping criterion is satisfied, the algorithm stops; otherwise go back
to the E-step. The frequently used stopping criterion is that the update of µ j 's from
the previous iteration does not exceed a threshold value.
In the initial run of the E-step, Σ j 's are usually selected to be identity matrices.
One disadvantage of the E-M algorithm is that it is based on the assumption of
the mixture of some distributions with unknown parameters, and the assumption
is difficult to be justified. There is a dilemma here: to justify the assumption, we
have to know the clusters first and then run the test to see whether the distribution
of each cluster satisfied the distribution assumption. This means that we have to
Search WWH ::




Custom Search