Information Technology Reference
In-Depth Information
when maximising with respect to
U
i
, where the expectation is taken with respect
to all hidden variables except for
U
i
, and the constant term is the logarithm of
the normalisation constant of
q
i
[19, 118]. In our case, we group the variables
according to their priors by
{
W
,
τ
}
,
{
α
}
,
{
V
}
,
{
β
}
,
{
Z
}
.
Handling the Softmax Function
If the model has a conjugate-exponential structure, (7.24) gives an analytical
solution with a distribution form equal to the prior of the corresponding hid-
den variable. However, in our case the generalised softmax function (7.10) does
not conform to this conjugate-exponential structure, and needs to be dealt with
separately. A possible approach is to replace the softmax function by an expo-
nential lower bound on it, which consequently introduces additional variational
variables with respect to which
L
(
q
) also needs to be maximised. This approach
was followed By Bishop and Svensen [20] and Jaakkola and Jordan [119] for
the logistic sigmoid function, but currently there is no known exponential lower
bound function on the softmax besides a conjectured one by Gibbs [93]
4
.Alter-
natively, we can follow the approach taken by Waterhouse et al. [227, 226], where
q
V
(
V
) is approximated by a Laplace approximation. Due to the lack of better
alternatives, this approach is chosen, despite such an approximation invalidating
the lower bound nature of
L
(
q
).
Update Equations and Model Posterior
To get the update equations for the parameters of the variational distribution,
we need to evaluate (7.24) for each group of hidden variables in
U
separately,
similar to the derivations by Waterhouse et al. [226] and Ueda and Ghahramani
[216]. This provides us with an approximation for the posterior
p
(
U
|
Y
) and will
be shown in the following sections.
Approximating the model evidence
p
(
Y
) requires a closed-form expression for
L
(
q
) by evaluating (7.21), where many terms of the variational update equations
can be reused, as will be shown after having derived the update equations.
Classifier Model
q
∗
W,τ
(
W, τ
)
7.3.2
The maximum of
(
q
) with respect to
W
and
τ
is given by evaluating (7.24)
for
q
W,τ
, which, by using (7.15), (7.16) and (7.6), results in
L
ln
q
∗
W,τ
(
W
,
τ
)=
E
Z
(ln
p
(
Y
|
W
,
τ
,
Z
)) +
E
α
(ln 0
p
(
W
,
τ
|
α
)) + const.
=
k
E
Z
(
z
nk
ln
p
(
y
n
|
W
k
,τ
k
))
n
+
k
E
α
(ln
p
(
W
k
,τ
k
|
α
k
)) + const.
,
(7.25)
4
A more general bound was recently developed by Wainwright, Jaakkola and Willsky
[225], but its applicability still needs to be evaluated.
Search WWH ::
Custom Search