Learning Mixtures by Simplifying Kernel Density Estimators - Matrix Information Geometry

Digital Signal Processing Reference

In-Depth Information

2

Each component

g(

x

; μ i , σ

i )

is a normal distribution, either univariate or

multivariate.

Even if GMMs are the most used mixture models, mixtures of exponential families

like Gamma, Beta or Rayleigh distributions are common in some fields [ 13 , 11 ].

16.3.2 Getting Mixtures

We present here some well-known algorithms to build mixtures. For more details,

please have a look at the references cited in the next paragraphs.

Expectation-Maximization . The most common tool for the estimation of the para-

meters of a mixture model is the Expectation-Maximization (EM) algorithm [ 7 ].

It maximizes the likelihood of the density estimation by iteratively computing the

expectation of the log-likelihood using the current estimate of the parameters (E step)

and by updating the parameters in order to maximize the log-likelihood (M step).

Even if originally considered for Mixture of Gaussians (MoGs) the Expectation-

Maximization has been extended by Banerjee et al. [ 1 ] to learn mixture of arbitrary

exponential families.

The pitfall is that this method leads only to a local maximum of the log-likelihood.

Moreover, the number of components is difficult to choose.

Dirichlet Process Mixtures . To avoid the problem of the choice of the number of

components, one has proposed to use a mixture model with an infinite number of

components. It can be done with a Dirichlet process mixture (DPM) [ 20 ] which uses a

Dirichlet process to build priors for the mixing proportions of the components. If one

needs a finite mixture, it is easy to sort the components according to their weights ω i

and to keep only the components above some threshold. The main drawback is that

the building of the model needs to evaluate a Dirichlet process using a Monte-Carlo

Markov Chain (for example with the Metropolis algorithm) which is computationally

costly.

Kernel Density Estimation . The kernel density estimator (KDE) [ 18 ] (also known

as the Parzen windows method) avoids the problem of the choice of the number of

components by using one component (a Gaussian kernel) centered on each point of

the dataset. All the components share the same weight and since the μ i parameters

come directly from the data points, the only remaining parameters are the σ i which

are chosen equal to a constant called the bandwidth . The critical part of the algorithm

is the choice of the bandwidth: a lot of studies have been made to automatically tune

this parameter (see [ 25 ] for a comprehensive survey) but it can also be chosen by

hand depending on the dataset. Since there is one Gaussian component a point in the

data set, a mixture built with a kernel density is difficult to manipulate: the size is

large and common operations are slow (evaluation of the density, random sampling,

etc) since it is necessary to loop over all the components of the mixture.

Pros and Cons . The main drawbacks of the EM algorithm are the risk to converge

to a local optimum and the number of iterations needed to find this optimum. While

it may be costly, this time is only spent during the learning step. On the other hand,

Matrix Information Geometry

Search WWH ::

Custom Search

Home