Information Technology Reference
In-Depth Information
such that the gating network fits best the previously calculated responsibilities.
Equation (4.14) causes the experts to be only trained on the areas that they are
assigned to by the responsibilities. The next expectation step re-evaluates the re-
sponsibilities according to the new fit of the experts, and the maximisation step
adapts the gating network and the experts again. Hence, iterating the expecta-
tion and the maximisation step causes the experts to be distributed according
to their best fit to the data.
The pattern of localisation is determined by the form of the gating model. As
previously demonstrated, the softmax function causes a soft linear partition of
the input space. Thus, the underlying assumption of the model is that the data
was generated by some processes that are linearly separated in the input space.
The model structure becomes richer by adding hierarchies to the gating network
[121]. That would move MoE to far away from LCS, which is why it will not
be discussed any further.
4.1.5
Training Issues
The likelihood function of MoE is neither convex nor unimodal [20]. Hence,
training it by using a hill-climbing procedure such as the EM-algorithm will
not guarantee that we find the global maximum. Several approaches have been
developed to deal with this problem (for example, [20, 4]), all of which are either
based on random restart or stochastic global optimisers. Hence, they require
several training epochs and/or a long training time. While this is not an issue
for MoE where the global optimum only needs to be found once, it is not an
option for LCS where the model needs to be (at least partially) re-trained for
each change in the model structure. A potential LCS-related solution will be
presented in Sect. 4.4.
4.2
Expert Models
So far, p ( y
x , θ k ) has been left unspecified. Its form depends on the task that is
to be solved, and differs for regression and classification tasks. Here, we only deal
with the LCS-related univariate regression task and the multiclass classification
tasks, for which the expert models are introduced in the following sections.
|
4.2.1
Experts for Linear Regression
For each expert k , the linear univariate regression model (that is, D Y =1)is
characterised by a linear relation of the input x and the adjustable parameter w k ,
which is a vector of the same size as the input. Hence, the relation between the
input x and the output y is modelled by a hyper-plane w k x
y = 0. Additionally,
the stochasticity and measurement noise are modelled by a Gaussian. Overall,
the probabilistic model for expert k is given by
)= τ k
2 π
1 / 2
exp
y ) 2 ,
τ k
2 ( w k x
w k x 1
p ( y
|
x , w k k )=
N
( y
|
(4.15)
k
 
Search WWH ::




Custom Search