Geoscience Reference
In-Depth Information
30 CHAPTER 3. MIXTUREMODELS ANDEM
hood (3.13) by a small positive weight λ< 1:
l
l + u
log p(y i |
θ)p( x i |
y i ,θ)
+
λ
log p( x i |
θ).
(3.21)
i =
1
i = l +
1
As λ 0, the influence of unlabeled data vanishes and one recovers the supervised learning objective.
3.5 OTHER ISSUES INGENERATIVEMODELS
When defining a generative model, identifiability is a desirable property. A model is identifiable if
p(x
θ 2 , up to a permutation of mixture component indices. That is, two
models are considered equivalent if they differ only by which component is called component one,
which is called component two, and so on. That is to say, there is a unique (up to permutation)
model θ that explains the observed unlabeled data. Therefore, as the size of unlabeled data grows,
one can hope to accurately recover the mixing components. For instance, GMMs are identifiable,
while some other models are not. The following example shows an unidentifiable model and why it
is not suitable for semi-supervised learning.
|
θ 1 )
=
p(x
|
θ 2 )
⇐⇒
θ 1 =
Example 3.8. An Unidentifiable Generative Model
Assume the component model p(x
|
y) is
uniform for y ∈{+ 1 , 1 }
. Let us try to use semi-supervised learning to learn the mixture of uniform
distributions. We are given a large amount of unlabeled data, such that we know p(x) is uniform
in
[
0 , 1
]
. We also have 2 labeled data points ( 0 . 1 ,
1 ), ( 0 . 9 , +
1 ) . Can we determine the label for
x = 0 . 5?
The answer turns out to be no. With our assumptions, we cannot distinguish the following
two models (and infinitely many others):
p(y =− 1 ) = 0 . 2 ,p(x | y =− 1 ) =
unif ( 0 , 0 . 2 ), p(x | y = 1 ) =
unif ( 0 . 2 , 1 )
(3.22)
p(y
=−
1 )
=
0 . 6 ,p(x
|
y
=−
1 )
=
unif ( 0 , 0 . 6 ), p(x
|
y
=
1 )
=
unif ( 0 . 6 , 1 )
(3.23)
Both models are consistent with the unlabeled data and the labeled data, but the first model predicts
label y =
1 at x =
0 . 5, while the second model predicts y =−
1. This is illustrated by Figure 3.4.
Another issue with generative models is local optima . Even if the model is correct and iden-
tifiable, the log likelihood (3.13) as a function of model parameters θ is, in general, non-concave.
That is, there might be multiple “bumps” on the surface. The highest bump corresponds to the global
optimum , i.e., the desired MLE. The other bumps are local optima. The EM algorithm is prone to
being trapped in a local optimum. Such local optima might lead to inferior performance. A standard
practice against local optima is random restart , in which the EM algorithm is run multiple times.
Each time EM starts from a different random initial parameter θ ( 0 ) . Finally, the log likelihood that
EM converges to in each run is compared. The θ that correspond to the best log likelihood is selected.
It is worth noting that random restart does not solve the local optima problem—it only alleviates
Search WWH ::




Custom Search