Mixture Models and EM - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

26 CHAPTER 3. MIXTUREMODELS ANDEM

It is the probability of generating x from any of the classes. The marginal probabilities therefore

account for the fact that we know which unlabeled instances are present, but not which classes they

belong to. Semi-supervised learning in mixture models amounts to finding the MLE of (3.13).

The only difference between supervised and semi-supervised learning (for mixture models) is the

objective function being maximized. Intuitively, a semi-supervised MLE θ will need to fit both the

labeled and unlabeled instances. Therefore, we can expect it to be different from the supervised

MLE.

The unobserved labels y l + 1 ,...,y l + u are called hidden variables. Unfortunately, hidden vari-

ables can make the log likelihood (3.13) non-concave and hard to optimize. 3 Fortunately, there are

several well studied optimization methods which attempt to find a locally optimal θ . We will present

a particularly important method, the Expectation Maximization (EM) algorithm, in the next sec-

tion. For Gaussian Mixture Models (GMMs), Multinomial Mixture Models, HMMs, etc., the EM

algorithm has been the de facto standard optimization technique to find an MLE when unlabeled

data is present. The EM algorithm for HMMs even has its own name: the Baum-Welch algorithm.

3.3 OPTIMIZATIONWITHTHE EMALGORITHM ∗

D ={ ( x 1 ,y 1 ),...,( x l ,y l ), x l + 1 ,..., x l + u }

Recall the observed data is

, and let the hidden data be

H ={

. Let the model parameter be θ . The EM algorithm is an iterative procedure

to find a θ that locally maximizes p(

y l + 1 ,...,y l + u }

D | θ) .

Algorithm 3.3. The ExpectationMaximization (EM) Algorithm in General.

, initial parameter θ ( 0 )

Input: observed data

, hidden data

1. Initialize t = 0 .

2. Repeat the following steps until p(

θ (t) ) converges:

D |

3. E-step: compute q (t) (

,θ (t) )

) ≡ p(

H | D

4. M-step: find θ (t + 1 ) that maximizes H q (t) (

H | θ (t + 1 ) )

) log p(

t = t +

Output: θ (t)

We comment on a few important aspects of the EM algorithm:

q (t) (

) is the hidden label distribution. It can be thought of as assigning “soft labels” to the

unlabeled data according to the current model θ (t) .

It can be proven that EM improves the log likelihood log p(

θ) at each iteration. However,

EM only converges to a local optimum . That is, the θ it finds is only guaranteed to be the best

within a neighborhood in the parameter space; θ may not be a global optimum (the desired

D |

3 The reader is invited to form the Lagrangian and solve the zero partial derivative equations. The parameters will end up on both

sides of the equation, and there will be no analytical solution for the MLE

θ .

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home