Mixture Models and EM - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

30 CHAPTER 3. MIXTUREMODELS ANDEM

hood (3.13) by a small positive weight λ< 1:

l + u

log p(y i |

θ)p( x i |

y i ,θ)

log p( x i |

θ).

(3.21)

i =

i = l +

As λ → 0, the influence of unlabeled data vanishes and one recovers the supervised learning objective.

3.5 OTHER ISSUES INGENERATIVEMODELS

When defining a generative model, identifiability is a desirable property. A model is identifiable if

p(x

θ 2 , up to a permutation of mixture component indices. That is, two

models are considered equivalent if they differ only by which component is called component one,

which is called component two, and so on. That is to say, there is a unique (up to permutation)

model θ that explains the observed unlabeled data. Therefore, as the size of unlabeled data grows,

one can hope to accurately recover the mixing components. For instance, GMMs are identifiable,

while some other models are not. The following example shows an unidentifiable model and why it

is not suitable for semi-supervised learning.

θ 1 )

p(x

θ 2 )

⇐⇒

θ 1 =

Example 3.8. An Unidentifiable Generative Model

Assume the component model p(x

y) is

uniform for y ∈{+ 1 , − 1 }

. Let us try to use semi-supervised learning to learn the mixture of uniform

distributions. We are given a large amount of unlabeled data, such that we know p(x) is uniform

[

0 , 1

]

. We also have 2 labeled data points ( 0 . 1 , −

1 ), ( 0 . 9 , +

1 ) . Can we determine the label for

x = 0 . 5?

The answer turns out to be no. With our assumptions, we cannot distinguish the following

two models (and infinitely many others):

p(y =− 1 ) = 0 . 2 ,p(x | y =− 1 ) =

unif ( 0 , 0 . 2 ), p(x | y = 1 ) =

unif ( 0 . 2 , 1 )

(3.22)

p(y

=−

1 )

0 . 6 ,p(x

=−

1 )

unif ( 0 , 0 . 6 ), p(x

1 )

unif ( 0 . 6 , 1 )

(3.23)

Both models are consistent with the unlabeled data and the labeled data, but the first model predicts

label y =

1 at x =

0 . 5, while the second model predicts y =−

1. This is illustrated by Figure 3.4.

Another issue with generative models is local optima . Even if the model is correct and iden-

tifiable, the log likelihood (3.13) as a function of model parameters θ is, in general, non-concave.

That is, there might be multiple “bumps” on the surface. The highest bump corresponds to the global

optimum , i.e., the desired MLE. The other bumps are local optima. The EM algorithm is prone to

being trapped in a local optimum. Such local optima might lead to inferior performance. A standard

practice against local optima is random restart , in which the EM algorithm is run multiple times.

Each time EM starts from a different random initial parameter θ ( 0 ) . Finally, the log likelihood that

EM converges to in each run is compared. The θ that correspond to the best log likelihood is selected.

It is worth noting that random restart does not solve the local optima problem—it only alleviates

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home