Biomedical Engineering Reference
In-Depth Information
maximizing the marginal likelihood, we describe an algorithm that maximizes a
quantity called the average data likelihood to obtain estimates for the hyperparame-
ters. This algorithm is called the expectation maximization (EM) algorithm, and is
described in the following.
B.5.2
Average Data Likelihood
The EMalgorithm computes the quantity called the average data likelihood. Comput-
ing the average data likelihood ismuch easier than computing themarginal likelihood.
To define the average data likelihood, let us first define the complete data likelihood,
such that
log
p
(
y
,
x
|
ʦ
,
ʛ
)
=
log
p
(
y
|
x
,
ʛ
)
+
log
p
(
x
|
ʦ
).
(B.33)
If we observed not only
y
but also
x
, we could have estimated
ʦ
and
ʛ
bymaximizing
log
p
with respect to these hyperparameters. However, since we do not
observe
x
, we must substitute for the unknown
x
in log
p
(
y
,
x
|
ʦ
,
ʛ
)
(
y
,
x
|
ʦ
,
ʛ
)
with some
“reasonable” value.
Having observed
y
, we actually know which values of
x
are reasonable, and our
best knowledge on the unknown
x
is represented by the posterior distribution
p
.
Thus, the “reasonable” value would be the one that maximizes the posterior proba-
bility, and one solution would be to use the MAP estimate of
x
in log
p
(
x
|
y
)
(
,
|
ʦ
,
ʛ
)
.A
better solution would be to use all possible values of
x
in the complete data likelihood
and average over it with the posterior probability. This results in the average data
likelihood,
y
x
ʘ(
ʦ
,
ʛ
)
:
p
ʘ(
ʦ
,
ʛ
)
=
(
x
|
y
)
log
p
(
y
,
x
|
ʦ
,
ʛ
)
d
x
E
log
p
|
ʦ
,
ʛ
)
=
(
y
,
x
E
log
p
,
ʛ
)
+
E
log
p
|
ʦ
)
,
=
(
y
|
x
(
x
(B.34)
where the expectation
E
[·]
is taken with respect to the posterior probability
p
(
x
|
y
)
.
The estimates of the hyperparameters,
ʦ
and
ʛ
are obtained using
ʛ
=
ʘ(
ʦ
,
ʛ
),
argmax
ʛ
(B.35)
ʦ
=
argmax
ʦ
ʘ(
ʦ
,
ʛ
).
(B.36)
are expressed
in Eqs. (
B.17
) and (
B.15
), respectively. Substituting Eqs. (
B.17
) and (
B.15
)into
(
B.33
), the complete data likelihood is expressed as
2
In the Gaussian model discussed in Sect.
B.3
,
p
(
x
|
ʦ
)
and
p
(
y
|
x
,
ʛ
)
2
The constant terms containing 2
ˀ
are ignored here.