Information Technology Reference
In-Depth Information
by the minimisation of the mean squared error of the global prediction with
respect to the target function, given a fixed set of fully trained classifiers. As
will be discussed in Sect. 6.4, this aim does not completely conform to the LCS
model that was introduced in Chap. 4.
Rather than using the mean squared error as a measure of the quality of a
mixing model, this chapter follows pragmatically the approach that was intro-
duced with the probabilistic LCS model: each classifier k provides a localised
probabilistic input/output mapping p ( y| x , θ k ), and the value of a binary latent
random variance z nk determines if classifier k generated the n th observation.
Each observation is generated by one and only one matching classifier, and so
the vector z n =( z n 1 ,...,z nK ) T has a single element with value 1, with all other
elements being 0. As the values of the latent variables are unknown, they are
modelled by the probabilistic model g k ( x )
x n , v k ), which is the mi-
xing model. The aim is to find a mixing model that is suciently easy to train
and maximises the data likelihood (4.9), given by
p ( z nk =1
|
N
K
l ( θ ;
D
)=
ln
g k ( x n ) p ( y n |
x n , θ k ) .
(6.1)
n =1
k =1
One possibility for such a mixing model was already introduced in Chap. 4 as
a generalisation of the gating network used in the Mixtures-of-Experts model,
and is given by the matching-augmented softmax function (4.22). Further alter-
natives will be introduced in this chapter.
The approach is called “pragmatic”, as by maximising the data likelihood,
the problem of overfitting is ignored, together with the identification of a good
model structure that is essential to LCS. Nonetheless, the methods introduced
here will reappear in only sightly modified form once these issues are dealt with,
and discussing them here provides a better understanding in later chapters.
Additionally, XCS implicitly uses an approach similar to maximum likelihood
to train its classifiers and mixing models, and deals with overfitting only at the
level of localising the classifiers in the input space (see App. B). Therefore, the
methods and approaches discussed here can be used as a drop-in replacement
for the XCS mixing model and for related LCS.
To summarise, we assume to have a set of K fully trained classifier, each of
which provides a localised probabilistic model p ( y
| x , θ k ). The aim is to find a
mixing model that provides the generative probability p ( z nk =1
| x n , v k ), that is,
the probability that classifier k generated observation n , given input x n and mi-
xing model parameters v k , that maximises the data likelihood (6.1). Additional
requirements are a suciently easy training and a good scaling of the method
with the number of classifiers.
We will firstly concentrate on the model that was introduced in Chap. 4, and
provide two approaches to training this model. Due to the thereafter discussed
weaknesses of these training procedures, a set of formally inspired and computa-
tionally cheap heuristics are introduced. Some empirical studies show that these
heuristics perform competitively when compared to the optimum. The chapter
concludes by comparing the approach of maximising the likelihood to a closely
 
Search WWH ::




Custom Search