Mixing Independently Trained Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

global prediction by s = k g k ( x ) s k , which has the sample variance 0 . 00190.

This conforms to - and thus empirically validates - the variance after (6.24),

which results in var( y

x , θ )=0 . 00191

≤

0 . 0055.

6.2.2 Inverse Variance

The unbiased noise variance estimate of a linear regression classifier k is, after

(5.13), given by

m k ( x n ) w k x n −

y n 2

τ − 1

D X ) − 1

=( c k −

(6.26)

n =1

and is therefore approximately the mean sum of squared prediction errors. If this

estimate is small, the squared prediction error is, on average, known to be small

and we can expect the predictions to have a low error. Hence, inverse variance

mixing is defined by using mixing weights that are inversely proportional to the

noise variance estimates of the according classifiers. More formally, γ k ( x )= τ k

in (6.18) for all x . The previous chapter has shown how to estimate the noise

variance of a classifier by batch or incremental learning.

6.2.3 Prediction Confidence

If the classifier model is probabilistic, its prediction can be given by a proba-

bilistic density. Knowing this density allows for the specification of an interval

on the output into which 95% of the observations are likely to fall, known as

the 95% confidence interval. The width of this interval therefore gives a measure

of how certain we are about the prediction made by this classifier. This is the

underlying idea of mixing by prediction confidence.

More formally, the predictive density of the linear classifier model is given for

classifier k by marginalising p ( y, θ k |

x )= p ( y

x , θ k ) p ( θ k ) over the parameters

θ k ,andresultsin

w k x , τ − k ( x T Λ − k x +1) ,

p ( y

x )=

(6.27)

as already introduced in Sect. 5.3.6. The 95% confidence interval - indeed that

of any percentage - is directly proportional to the standard deviation of this

density, which is the square root of its variance. Thus, to assign higher weights

to classifiers with a higher confidence prediction, that is, a prediction with a

smaller confidence interval, γ k ( x )issetto

γ k ( x )= τ − 1

k ( x T Λ − k x +1) − 1 / 2 . (6.28)

Compared to mixing by inverse variance, this measure additionally takes the

uncertainty of the weight vector estimate into account and is consequently de-

pendent on the input. Additionally, it relies on the assumption of Gaussian noise

and a Gaussian weight vector model, which might not hold - in particular when

the number of observations that the classifier is trained on is small. Therefore,

despite using more information than mixing by inverse variance, it cannot be

guaranteed to perform better.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home