Information Technology Reference
In-Depth Information
where
are
the D X -dimensional weight vector w k and the noise precision (that is, inverse
variance) τ k . The distribution is centred on the inner product w k x ,andits
spread is inversely proportional to τ k and independent of the input.
As we give a detailed discussion about the implications of assuming this expert
model and various forms of its incremental training in Chap. 5, let us here only
consider how it specifies the maximisation step of the EM-algorithm for training
the MoE model, in particular with respect to the weight vector w k : Combining
(4.14) and (4.15), the term to maximise becomes
N
stands for a Gaussian, and the model parameters θ k =
{
w k k }
r nk 1
y n ) 2
N
K
N
K
2 ln τ k
τ k
2 ( w k x n
r nk ln p ( y n |
x n , w k k )=
2 π
n =1
k =1
n =1
k =1
K
N
τ k
2
r nk ( w k x n
y n ) 2 +const. ,
=
k =1
n =1
where the constant terms absorbs all terms that are independent of the weight
vectors. Considering the experts separately, the aim for expert k is to find
N
r nk ( w k x n
y n ) 2 ,
min
w k
(4.16)
n =1
which is a weighted linear least squares problem. This shows how the assumption
of a Gaussian noise locally leads to minimising the empirical risk with the L 2
loss function.
4.2.2
Experts for Classification
For classification, assume that the number of classes is D Y , and the outputs are
the vectors y =( y 1 ,...,y D Y ) T with all elements being y j = 0, except for the
element y j =1,where j is the class associated with this output vector. Thus,
similarly to the latent vector z , the different y 's obey a 1-of- D Y structure.
The expert model p ( y
x , θ k ) gives the probability of the expert having gene-
rated an observation of the class specified by y . Analogous to the gating network
(4.4), this model could assume a log-linear relationship between this probability
and the input x , which implies that p ( y
|
x , θ k ) is assumed to vary with x .Howe-
ver, to simplify interpretation of the expert model, it will be assumed that this
probability remains constant over all inputs that the expert is responsible for,
that is
|
D Y
D Y
w y kj ,
p ( y
|
x , w k )=
with
w kj =1 .
(4.17)
j =1
j =1
Thus, p ( y
x , w k ) is independent of the input x and parametrised by θ k = w k ,
and for any given y representing class j , the model's probability is given by w j ,
the j th element of w k .
|
 
Search WWH ::




Custom Search