Mixing Independently Trained Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

model but it is less clear how to separate the global model into local classifier

models. Maximising the likelihood for such a model results in the least-squares

problem (6.35) with f ( x ; θ )= k g k ( x ) w k x , the solution to which has been

discussed in the previous chapter.

To the other extreme, one could from the start assume that the classifiers

are trained independently, such that each of them provides the model c k with

predictive density p ( y

x ,c k ). The global model is formed by marginalising over

the local models,

p ( y

x ,c k ) p ( c k |

x ) ,

x )=

(6.37)

k =1

where p ( c k |

x ) is the probability of the model of classifier k being the “true”

model, given a certain input x . This term can be used to introduce matching,

by setting p ( c k |

x )=0if m k ( x ) = 0. Averaging over models by their probability

is known as Bayesian Model Averaging [107], which might initially look like

resulting in the same formulation as the model derived from the generalised

MoE model. The essential difference, however, is that p ( y | x ,c k ) is independent

of the model parameters θ k as it marginalises over them,

x ,c k )=

p ( y

x , θ k ,c k ) p ( θ k |

c k )d θ k .

(6.38)

Therefore, it cannot be directly compared to the mixing models introduced in

this chapter, and should be treated as a different LCS model, closely related to

ensemble learning. Further research is required to see if such an approach leads

to viable LCS formulations.

6.5

Summary and Outlook

This chapter dealt with an essential LCS component that directly emerges from

the introduced LCS model and is largely ignored by LCS research: how to

combine a set of localised models, provided by the classifiers, to provide a global

prediction. The aim of this “mixing problem” was defined by maximising the

data likelihood (6.1) of the previously introduced LCS model.

As was shown, the IRLS algorithm is a possible approach to finding the glo-

bally optimal mixing parameters V to the generalised softmax mixing model,

but it suffers from high complexity, and can therefore act as nothing more than

a benchmark to compare other approaches to. The least squares approximation,

on the other hand, scales well but lacks the desired performance, as shown in

experiments.

As an alternative, heuristics that are inspired by formal properties of mixing

by weighted average have been introduced. Not only do they scale well with the

number of classifiers as they do not have any adjustable parameters other than

the classifier parameters, but they also perform better than mixing by the least

squares approximation. In particular, mixing by inverse variance makes the least

assumptions of the introduced heuristics, and is also the best-performing one

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home