Digital Signal Processing Reference
In-Depth Information
2
Each component
g(
x
; μ i , σ
i )
is a normal distribution, either univariate or
multivariate.
Even if GMMs are the most used mixture models, mixtures of exponential families
like Gamma, Beta or Rayleigh distributions are common in some fields [ 13 , 11 ].
16.3.2 Getting Mixtures
We present here some well-known algorithms to build mixtures. For more details,
please have a look at the references cited in the next paragraphs.
Expectation-Maximization . The most common tool for the estimation of the para-
meters of a mixture model is the Expectation-Maximization (EM) algorithm [ 7 ].
It maximizes the likelihood of the density estimation by iteratively computing the
expectation of the log-likelihood using the current estimate of the parameters (E step)
and by updating the parameters in order to maximize the log-likelihood (M step).
Even if originally considered for Mixture of Gaussians (MoGs) the Expectation-
Maximization has been extended by Banerjee et al. [ 1 ] to learn mixture of arbitrary
exponential families.
The pitfall is that this method leads only to a local maximum of the log-likelihood.
Moreover, the number of components is difficult to choose.
Dirichlet Process Mixtures . To avoid the problem of the choice of the number of
components, one has proposed to use a mixture model with an infinite number of
components. It can be done with a Dirichlet process mixture (DPM) [ 20 ] which uses a
Dirichlet process to build priors for the mixing proportions of the components. If one
needs a finite mixture, it is easy to sort the components according to their weights ω i
and to keep only the components above some threshold. The main drawback is that
the building of the model needs to evaluate a Dirichlet process using a Monte-Carlo
Markov Chain (for example with the Metropolis algorithm) which is computationally
costly.
Kernel Density Estimation . The kernel density estimator (KDE) [ 18 ] (also known
as the Parzen windows method) avoids the problem of the choice of the number of
components by using one component (a Gaussian kernel) centered on each point of
the dataset. All the components share the same weight and since the μ i parameters
come directly from the data points, the only remaining parameters are the σ i which
are chosen equal to a constant called the bandwidth . The critical part of the algorithm
is the choice of the bandwidth: a lot of studies have been made to automatically tune
this parameter (see [ 25 ] for a comprehensive survey) but it can also be chosen by
hand depending on the dataset. Since there is one Gaussian component a point in the
data set, a mixture built with a kernel density is difficult to manipulate: the size is
large and common operations are slow (evaluation of the density, random sampling,
etc) since it is necessary to loop over all the components of the mixture.
Pros and Cons . The main drawbacks of the EM algorithm are the risk to converge
to a local optimum and the number of iterations needed to find this optimum. While
it may be costly, this time is only spent during the learning step. On the other hand,
Search WWH ::




Custom Search