Information Technology Reference
In-Depth Information
Firstly, the standard MoE model [121] is introduced, and its training and
expert localisation is discussed. This is followed in Sect. 4.2 by a discussion of
expert models for both regression and classification. To relate MoE to LCS,
the MoE model is generalised in Sect. 4.3, together with how its training has to
be modified to accommodate these generalisations. Identifying di culties with
the latter, a modified training scheme is introduced in Sect. 4.4, that makes the
introduced model more similar to XCS.
4.1
The Mixtures-of-Experts Model
The MoE model is probably best explained from the generative point-of-view:
given a set of K experts, each observation in the training set is assumed to be
generated by one and only one of these experts. Let z =( z 1 ,...,z K ) T be a
random binary vector, where each of its elements z k is associated with an expert
and indicates whether that expert generated the given observation ( x , y ). Given
that expert k generated the observation, then z j =1for j = k ,and z j =0
otherwise, resulting in a 1-of- K structure of z . The introduced random vector
is a latent variable , as its values cannot be directly observed. Each observation
( x n , y n ) in the training set has such a random vector z n associated with it,
and Z =
denotes the set of latent variables corresponding to each of the
observations in the training set.
Each expert provides a probabilistic mapping
{
z n }
X→Y
that is given by the
conditional probability density p ( y
x , θ k ), that is, the probability of the output
being vector y , given the input vector x and the model parameters θ k of expert k .
Depending on whether we deal with regression or classification tasks, experts can
represent different parametric models. Leaving the expert models unspecified for
now, linear regression and classification models will be introduced in Sect. 4.2.
|
4.1.1 Likelihood for Known Gating
A common approach to training probabilistic models is to maximise the like-
lihood of the outputs given the inputs and the model parameters, a principle
known as maximum likelihood . As will be shown later, maximum likelihood trai-
ning is equivalent to minimising the empirical risk, with a loss function depending
on the probabilistic formulation of the model.
Following the standard assumptions of independent observations, and additio-
nally assuming knowledge of the values of the latent variables Z , the likelihood
of the training set is given by
N
p ( Y
|
X , Z , θ )=
p ( y n |
x n , z n , θ ) ,
(4.1)
n =1
where θ stands for the model parameters. Due to the 1-of- K structure of each
z n , the likelihood for the n th observation is given by
K
x n , θ k ) z nk ,
p ( y n |
x n , z n , θ )=
p ( y n |
(4.2)
k =1
 
Search WWH ::




Custom Search