Mixing Independently Trained Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

by the minimisation of the mean squared error of the global prediction with

respect to the target function, given a fixed set of fully trained classifiers. As

will be discussed in Sect. 6.4, this aim does not completely conform to the LCS

model that was introduced in Chap. 4.

Rather than using the mean squared error as a measure of the quality of a

mixing model, this chapter follows pragmatically the approach that was intro-

duced with the probabilistic LCS model: each classifier k provides a localised

probabilistic input/output mapping p ( y| x , θ k ), and the value of a binary latent

random variance z nk determines if classifier k generated the n th observation.

Each observation is generated by one and only one matching classifier, and so

the vector z n =( z n 1 ,...,z nK ) T has a single element with value 1, with all other

elements being 0. As the values of the latent variables are unknown, they are

modelled by the probabilistic model g k ( x )

x n , v k ), which is the mi-

xing model. The aim is to find a mixing model that is suciently easy to train

and maximises the data likelihood (4.9), given by

≡

p ( z nk =1

l ( θ ;

g k ( x n ) p ( y n |

x n , θ k ) .

(6.1)

n =1

k =1

One possibility for such a mixing model was already introduced in Chap. 4 as

a generalisation of the gating network used in the Mixtures-of-Experts model,

and is given by the matching-augmented softmax function (4.22). Further alter-

natives will be introduced in this chapter.

The approach is called “pragmatic”, as by maximising the data likelihood,

the problem of overfitting is ignored, together with the identification of a good

model structure that is essential to LCS. Nonetheless, the methods introduced

here will reappear in only sightly modified form once these issues are dealt with,

and discussing them here provides a better understanding in later chapters.

Additionally, XCS implicitly uses an approach similar to maximum likelihood

to train its classifiers and mixing models, and deals with overfitting only at the

level of localising the classifiers in the input space (see App. B). Therefore, the

methods and approaches discussed here can be used as a drop-in replacement

for the XCS mixing model and for related LCS.

To summarise, we assume to have a set of K fully trained classifier, each of

which provides a localised probabilistic model p ( y

| x , θ k ). The aim is to find a

mixing model that provides the generative probability p ( z nk =1

| x n , v k ), that is,

the probability that classifier k generated observation n , given input x n and mi-

xing model parameters v k , that maximises the data likelihood (6.1). Additional

requirements are a suciently easy training and a good scaling of the method

with the number of classifiers.

We will firstly concentrate on the model that was introduced in Chap. 4, and

provide two approaches to training this model. Due to the thereafter discussed

weaknesses of these training procedures, a set of formally inspired and computa-

tionally cheap heuristics are introduced. Some empirical studies show that these

heuristics perform competitively when compared to the optimum. The chapter

concludes by comparing the approach of maximising the likelihood to a closely

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home