Mixing Independently Trained Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

as will be demonstrated in Sect. 6.3. As an alternative, this section introduces

some heuristic mixing models that scale linearly with the number of classifiers,

just like the least squares approximation, and feature better performance.

Before discussing different heuristics, let us define the requirements on g k :to

preserve their probabilistic interpretation, we require g k ( x )

≥

0 for all k and x ,

and k g k ( x ) = 1 for all x . In addition, we need to honour matching, which

means that if m k ( x ) = 0, we need to have g k ( x ) = 0. These requirements are

met if we define

m k ( x ) γ k ( x )

j =1 m j ( x ) γ j ( x ) ,

g k ( x )=

(6.18)

where

is a set of K functions returning positive scalars, that

implicitly rely on the mixing model parameters V . Thus, the mixing model

defines a weighted average, where the weights are specified on one hand by the

matching functions, and on the other hand by the functions γ k . The heuristics

differ among each other only in how they define the γ k 's.

Note that the generalised softmax function (6.2) also performs mixing by

weighted average, as it conforms to (6.18) with γ k ( x )=exp( v k x ) and mixing

model parameters V =

{

γ k :

X→ R

}

. The weights it assigns to each classifier are deter-

mined by the log-linear model exp( v k x ), which needs to be trained separately,

depending on the responsibilities that express the goodness-of-fit of the classifier

models for the different inputs. In contrast, all heuristic models that are introdu-

ced here rely on measures that are part of the classifiers' linear regression models

and do not need to be fitted separately. As they do not have any adjustable pa-

rameters, they all have V =

{

v k }

. The heuristics assume classifiers to use regression

rather than classification models. For the classification case, similar heuristics

are easily found by using the observations of the following section, that are valid

for any form of classifier model, to guide the design of these heuristics.

∅

6.2.1

Properties of Weighted Averaging Mixing

f k :

f k ( x )=

x , θ k ), that is, the estimator of

classifier k defined by the mean of the conditional distribution of the output given

the input and the classifier parameters. Equally, let

X→ R

( y

Let

be given by

f :

X→ R

be the global

model estimator, given by f ( x )= E ( y| x ,θ ). As by (4.8) we have p ( y| x ,θ )=

k g k ( x ) p ( y| x , θ k ), the global estimator is related to the local estimators by

f ( x )=

x , θ k )d y =

g k ( x ) f k ( x ) ,

g k ( x ) p ( y

(6.19)

and, thus, is also a weighted average of the local estimators. From this follows

that f is bounded from below and above by the lowest and highest estimate of

the local models, respectively, that is

f k ( x )

f ( x )

f k ( x ) ,

≤

∀

∈X

min

max

(6.20)

In general, we aim at minimising the deviation of the global estimator f from

the target function f that describes the data-generating process. If we measure

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home