Mixture Models and EM - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

28 CHAPTER 3. MIXTUREMODELS ANDEM

The M-step updates the model parameters using the current γ values as weights on the unlabeled

instances. If we think of the E-step as creating fractional labeled instances split between the classes,

then the M-step simply computes new MLE parameters using these fractional instances and the

labeled data. The algorithm stops when the log likelihood (3.13) converges (i.e., stops changing

from one iteration to the next). The data log likelihood in the case of a mixture of two Gaussians is

l

l + u

2

log p(

D |

θ)

=

log π y i N

( x i ;

μ y i , y i )

+

log

π j

N

( x i ;

μ j , j ),

(3.20)

i =

1

i = l +

1

j =

1

where we have marginalized over the two classes for the unlabeled data.

It is instructive to note the similarity between EM and self-training. EM can be viewed as a

special form of self-training, where the current classifier θ would label the unlabeled instances with

all possible labels, but each with fractional weights p(

,θ) . Then all these augmented unlabeled

data, instead of the top few most confident ones, are used to update the classifier.

H | D

3.4 THE ASSUMPTIONS OFMIXTUREMODELS

Mixture models provide a framework for semi-supervised learning in which the role of unlabeled

data is clear. In practice, this form of semi-supervised learning can be highly effective if the generative

model is (nearly) correct. It is worth noting the assumption made here:

Remark 3.6. MixtureModel Assumption The data actually comes from the mixture model, where

the number of components, prior p(y) , and conditional p( x

| y) are all correct.

Unfortunately, it can be difficult to assess the model correctness since we do not have much

labeled data. Many times one would choose a generative model based on domain knowledge and/or

mathematical convenience. However, if the model is wrong, semi-supervised learning could actually

hurt performance. In this case, one might be better off to use only the labeled data and perform

supervised learning instead. The following example shows the effect of an incorrect model.

Example 3.7. An Incorrect Generative Model Suppose a dataset contains four clusters of data,

two of each class. This dataset is shown in Figure 3.2. The correct decision boundary is a horizontal

line along the x -axis. Clearly, the data is not generated from two Gaussians. If we insist that each

class is modeled by a single Gaussian, the results may be poor. Figure 3.3 illustrates this point by

comparing two possible GMMs fitting this data. In panel (a), the learned model fits the unlabeled

quite well (having high log likelihood), but predictions using this model will result in approximately

50% error. In contrast, the model shown in panel (b) will lead to much better accuracy. However, (b)

would not be favored by the EM algorithm since it has a lower log likelihood.

As mentioned above, we may be better off using only labeled data and supervised learning

in this case. If we have labeled data in the bottom left cluster and top right cluster, the supervised

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home