Geoscience Reference
In-Depth Information
26 CHAPTER 3. MIXTUREMODELS ANDEM
It is the probability of generating x from any of the classes. The marginal probabilities therefore
account for the fact that we know which unlabeled instances are present, but not which classes they
belong to. Semi-supervised learning in mixture models amounts to finding the MLE of (3.13).
The only difference between supervised and semi-supervised learning (for mixture models) is the
objective function being maximized. Intuitively, a semi-supervised MLE θ will need to fit both the
labeled and unlabeled instances. Therefore, we can expect it to be different from the supervised
MLE.
The unobserved labels y l + 1 ,...,y l + u are called hidden variables. Unfortunately, hidden vari-
ables can make the log likelihood (3.13) non-concave and hard to optimize. 3 Fortunately, there are
several well studied optimization methods which attempt to find a locally optimal θ . We will present
a particularly important method, the Expectation Maximization (EM) algorithm, in the next sec-
tion. For Gaussian Mixture Models (GMMs), Multinomial Mixture Models, HMMs, etc., the EM
algorithm has been the de facto standard optimization technique to find an MLE when unlabeled
data is present. The EM algorithm for HMMs even has its own name: the Baum-Welch algorithm.
3.3 OPTIMIZATIONWITHTHE EMALGORITHM
D ={ ( x 1 ,y 1 ),...,( x l ,y l ), x l + 1 ,..., x l + u }
Recall the observed data is
, and let the hidden data be
H ={
. Let the model parameter be θ . The EM algorithm is an iterative procedure
to find a θ that locally maximizes p(
y l + 1 ,...,y l + u }
D | θ) .
Algorithm 3.3. The ExpectationMaximization (EM) Algorithm in General.
, initial parameter θ ( 0 )
Input: observed data
D
, hidden data
H
1. Initialize t = 0 .
2. Repeat the following steps until p(
θ (t) ) converges:
D |
3. E-step: compute q (t) (
(t) )
H
) p(
H | D
4. M-step: find θ (t + 1 ) that maximizes H q (t) (
H | θ (t + 1 ) )
H
D
) log p(
,
5.
t = t +
1
Output: θ (t)
We comment on a few important aspects of the EM algorithm:
￿ q (t) (
) is the hidden label distribution. It can be thought of as assigning “soft labels” to the
unlabeled data according to the current model θ (t) .
H
￿ It can be proven that EM improves the log likelihood log p(
θ) at each iteration. However,
EM only converges to a local optimum . That is, the θ it finds is only guaranteed to be the best
within a neighborhood in the parameter space; θ may not be a global optimum (the desired
D |
3 The reader is invited to form the Lagrangian and solve the zero partial derivative equations. The parameters will end up on both
sides of the equation, and there will be no analytical solution for the MLE
θ .
Search WWH ::




Custom Search