Mixing Independently Trained Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

all x , the variance of y is bounded from above by the weighted average of the

variance of the local models for y, that is

x , θ )=

g k ( x ) 2 var ( y

var ( y

x , θ k )

≤

g k ( x ) var ( y

x , θ k ) ,

∀

∈X

. (6.24)

Proof. To show the above, we again take the view that each observation was

generated by one and only one classifier, and introduce the indicator variable I as

a conceptual tool that takes the value k if classifier k generated the observation,

giving g k ( x )

x ), where we are omitting the parameters of the mixing

models implicit in g k .Wealsouse p ( y

≡

p ( I = k

x , θ k )

≡

p ( y

x ,I = k ) to denote the model

x , θ )= k p ( I = k

provided by classifier k .Thus,wehave p ( y

x ) p ( y

x ,I = k ),

x , θ )= k p ( I = k

and, analogously,

x ,I = k ). However, similarly

to the basic relation var( aX + bY )= a 2 var( X )+ b 2 var( Y )+2 ab cov( X, Y ), we

have for the variance

( y

x )

( y

x , θ )=

p ( I = k ) 2 var( y

var( y

x ,I = k )+0 ,

(6.25)

where the covariance terms are zero as the classifier models are conditionally

independent given I . This confirms the equality in (6.24). The inequality is

justified by observing that the variance is non-negative, and 0

≤

g k ( x )

≤

1and

so g k ( x ) 2

≤

g k ( x ).

Here, not only a bound but also an exact expression for the variance of the

combined prediction is provided. This results in a different view on the design

criteria for possible heuristics: we want to assign weights that are in some way

inversely proportional to the classifier prediction variance. As the prediction

variance indicates the expected prediction error, this design criterion conforms

to the one that is based on Theorem 6.1.

Neither Theorem 6.1 nor Theorem 6.2 assume that the local models are linear.

In fact, they apply to any case where a global model results from a weighted

average of a set of local models. Thus, they can also be used in LCS when

the classifier models are classification model, or non-linear model (for example,

[156, 175]).

Example 6.3 (Mean and Variance of a Mixture of Gaussians). Consider 3 classi-

fiers that, for some input x provide the predictions p ( y

0 . 2 , 0 . 1 2 ),

x , θ 1 )=

( y

0 . 7 , 0 . 2 2 ). Using the mixing

weights inversely proportional to their variance, that is g 1 ( x )=0 . 20, g 2 ( x )=

0 . 76, and g 3 ( x )=0 . 04, our global estimator f ( x ), determined by (6.19), results in

f ( x )=0 . 448. Let us assume that the target function value is given by f ( x )=0 . 5,

resulting in the squared prediction error ( f ( x )

0 . 5 , 0 . 05 2 ), and p ( y

p ( y

x , θ 2 )=

( y

x , θ 3 )=

( y

f ( x )) 2

0 . 002704. This error

is correctly upper-bounded by (6.22), that results in ( f ( x )

−

≈

f ( x )) 2

0 . 0196.

The correctness of (6.24) is demonstrated by taking 10 6 samples from the pre-

dictive distributions of the different classifiers, resulting in the sample vectors

s 1 , s 2 ,and s 3 ,eachofsize10 6 . Thus, we can produce a sample vector of the

−

≤

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home