Graphics Reference
In-Depth Information
algorithm consists of two steps that are repeated until convergence: the expectation
(E-step) and the maximization (M-step) steps.
The E-step tries to compute the expectation of logP θ (
y
,
x
)
:
(θ, θ ) =
Q
P
θ (
b
|
a
)
logP
θ (
b
,
a
),
(4.5)
y
θ are the new distribution parameters. Please note that we are using the log .
The reason for this is that we need tomultiply the probabilities of each observed value
for an specific set of parameters. But multiplying several probabilities will soon yield
a very small number and thus produce a loss of precision in a computer due to limited
digital accuracy. A typical solution is then to use the log of these probabilities and to
look for the maximum log likelihood. As the logs will be negative, we are looking
for the set of parameters whose likelihood is as close to 0 as possible. In the M-step
we try to find the
where
that maximizes Q .
How can we find the
θ
that maximizes Q ? Let us review conditional expectation
where A and B are random variables drawn from distributions P
θ
(
a
)
and P
(
b
)
to
P
(
b
,
a
)
resolve the M-step. The conditional distribution is given by P
(
b
|
a
) =
and
]= b P
P
(
a
)
then E
[
B
(
b
)
b . For a function depending on Bh
(
B
)
the expectation is
) ]= b P
trivially obtained by E
[
h
(
B
(
b
)
h
(
b
)
. For a particular value A
(
A
=
a
)
the
]= b P
[
(
) |
(
|
)
(
)
expectation is E
h
B
a
b
a
h
b
.
θ
Remember that we want to pick a
that maximizes the log likelihood of the
observed ( a ) and the unobserved
(
b
)
variables given an observed variable a and the
θ . The conditional expectation of logP θ (
θ is
previous parameters
b
,
a
)
given a and
]=
)
E
[
logP
(
b
,
a
| θ) |
a
P
(
b
|
a
logP
(
b
,
a
| θ)
(4.6)
y
=
P θ (
b
|
a
)
logP θ (
b
,
a
).
(4.7)
y
The key is that if b P
)> b P
θ (
b
|
a
)
logP
θ (
b
,
a
θ (
b
|
a
)
logP
θ (
b
,
a
)
then P
θ (
a
)>
P
. If we can improve the expectation of the log likelihood, EM is improving the
model of the observed variable a .
In any real world problem, we do not have a single point but a series of attributes
x 1 ,...,
θ (
a
)
x n . Assuming i.i.d. we can sum over all points to compute the expectation:
n
(θ, θ ) =
Q
P
θ (
b
|
x i )
logP
θ (
b
,
x i )
(4.8)
i = 1
b
The EM algorithm is not perfect: it can be stuck in local maxima and also depends
on an initial
value. The latter is usually resolved by using a bootstrap process in
order to choose a correct initial
θ
θ
. Also the reader may have noticed that we have
 
Search WWH ::




Custom Search