Biology Reference
In-Depth Information
5.4.1. Model-based Method
Model-based method assumes that the dataset consists of samples from a mixture
of populations, and each population has a determined form of probability distri-
bution with unknown parameters. The pdf of the mixture model is:
K
f ( x )=
f ( x
|
x
C j ) p ( x
C j )
(5.23)
j =1
where f ( x
C j ) is the pdf of vector x conditioning on vector x is generated by
class j , denoted by C j , p ( x
|
x
C j ) is the probability that vector x is generated by
C j ,and K is the number of assumed mixed probability populations.
Model-based method to determine the number of clusters works as follows:
(1) Assume the form of pdf f ( x
|
x
C j ). Usually,wetakethesameformof
|
C j ), j =1 , 2 ,..., K . We denote the model parameters of
function for f ( x
x
= θ j .
(2) Let K takes value from 1 to K ,where K is a large integer. For each K ,we
cluster the dataset into K clusters. For cluster j , j =1 to K , we calculate the
maximum likelihood estimates of θ j , denoted as θ j . Probability p ( x
f ( x
|
x
C j ) as θ j .For f ( x
|
x
C i ) and f ( x
|
x
C j ),if i
= j , θ i
C j ) is
calculated by:
C j )= i =1 I ( x i
C j )
p ( x
(5.24)
N
It is just the percent of the observations assigned into cluster j .
(3) For each K , calculate the adjusted log-likelihood value by:
N
K
C j , θ j ) p ( x i
l ( K )=2
log (
f ( x i |
x i
C j ))
g ( K )
(5.25)
i =1
j =1
C j j ) is the pdf of x i with conditions that x i is generated by
class C j and model parameters are θ j . Function g ( K ) is a penalty function. It
is a monotonously increasing function of K . Fraley and Raftery [10] suggest
g ( K )= m K log ( N ) ,where m K is the number of independent parameters to be
estimated in the mixture model defined in Eq. 5.23.
(4) Choose the K corresponding to the largest l ( K ) as the number of clusters,
denoted as K , i.e.,
where f ( x i |
x i
( l ( K ) , K =1 , 2 ,..., K )
K =argmax
K
(5.26)
Let us review Eq. 5.25 in step (3). We can find that Eq. 5.25 is very similar
to the Bayesian Information Criterion (BIC) when we determine the best model
Search WWH ::




Custom Search