Information Technology Reference
In-Depth Information
The get the variational bound of the whole model structure, and with it the
lower bound on the logarithm of the model evidence ln p ( Y ), we need to compute
L M ( q )+
L
( q )=
k L k ( q ) ,
(7.96)
where
L M ( q ) are given by (7.91) and (7.95) respectively.
Training the model means maximising
L k ( q )and
L
( q ) (7.96) with respect to its parame-
W k , Λ k ,a τ k ,b τ k ,a α k ,b α k , V , Λ V ,a β k ,b β k }
ters
{
. In fact, deriving the maximum
of
( q ) with respect to each of these parameters separately while keeping the
others constant results in the variational update equations that were derived in
the previous sections [19].
L
7.3.9
Independent Classifier Training
As we can see from (7.91), we need to know the responsibilities
to train
each of the classifiers. The mixing model, on the other hand, relies on the
goodness-of-fit of the classifiers, as embedded in g k in (7.95). Therefore, clas-
sifiers and mixing model need to be trained in combination to maximise (7.96).
Taking this approach, however, introduces local optima in the training process,
as already discussed for the non-Bayesian MoE model in Sect. 4.1.5. Such local
optima make evaluating the model evidence for a single model structure too
costly to perform ecient model structure search, and so the training process
needs to be modified to remove these local optima. Following the same approach
as in Sect. 4.4, we train the classifiers independently of the mixing model.
More specifically, the classifiers are fully trained on all observations that they
match, independently of other classifiers, and then combined by the mixing mo-
del. Formally, this is achieved by replacing the responsibilities r nk by the mat-
ching functions m k ( x n ).
The only required modification to the variational update equations is to
change the classifier model updates from (7.30) - (7.33) to
Λ k = E α ( α k ) I +
n
{
r nk }
m k ( x n ) x n x n ,
(7.97)
= Λ k 1
n
w kj
m k ( x n ) x n y nj ,
(7.98)
2
n
= a τ + 1
a τ k
m k ( x n ) ,
(7.99)
j
.
1
2 D Y
w kj T Λ k w kj
b τ k
m k ( x n ) y nj
= b τ +
(7.100)
n
Thus, we are now effectively finding a w kj that minimises
2 M k +
2 ,
Xw kj
y j
E α ( α k )
w kj
(7.101)
Search WWH ::




Custom Search