Information Technology Reference
In-Depth Information
all x , the variance of y is bounded from above by the weighted average of the
variance of the local models for y, that is
x , θ )=
k
g k ( x ) 2 var ( y
var ( y
|
|
x , θ k )
g k ( x ) var ( y
|
x , θ k ) ,
x
∈X
. (6.24)
k
Proof. To show the above, we again take the view that each observation was
generated by one and only one classifier, and introduce the indicator variable I as
a conceptual tool that takes the value k if classifier k generated the observation,
giving g k ( x )
x ), where we are omitting the parameters of the mixing
models implicit in g k .Wealsouse p ( y
p ( I = k
|
|
x , θ k )
p ( y
|
x ,I = k ) to denote the model
x , θ )= k p ( I = k
provided by classifier k .Thus,wehave p ( y
|
|
x ) p ( y
|
x ,I = k ),
x , θ )= k p ( I = k
and, analogously,
x ,I = k ). However, similarly
to the basic relation var( aX + bY )= a 2 var( X )+ b 2 var( Y )+2 ab cov( X, Y ), we
have for the variance
E
( y
|
|
x )
E
( y
|
x , θ )=
k
p ( I = k ) 2 var( y
var( y
|
|
x ,I = k )+0 ,
(6.25)
where the covariance terms are zero as the classifier models are conditionally
independent given I . This confirms the equality in (6.24). The inequality is
justified by observing that the variance is non-negative, and 0
g k ( x )
1and
so g k ( x ) 2
g k ( x ).
Here, not only a bound but also an exact expression for the variance of the
combined prediction is provided. This results in a different view on the design
criteria for possible heuristics: we want to assign weights that are in some way
inversely proportional to the classifier prediction variance. As the prediction
variance indicates the expected prediction error, this design criterion conforms
to the one that is based on Theorem 6.1.
Neither Theorem 6.1 nor Theorem 6.2 assume that the local models are linear.
In fact, they apply to any case where a global model results from a weighted
average of a set of local models. Thus, they can also be used in LCS when
the classifier models are classification model, or non-linear model (for example,
[156, 175]).
Example 6.3 (Mean and Variance of a Mixture of Gaussians). Consider 3 classi-
fiers that, for some input x provide the predictions p ( y
0 . 2 , 0 . 1 2 ),
|
x , θ 1 )=
N
( y
|
0 . 7 , 0 . 2 2 ). Using the mixing
weights inversely proportional to their variance, that is g 1 ( x )=0 . 20, g 2 ( x )=
0 . 76, and g 3 ( x )=0 . 04, our global estimator f ( x ), determined by (6.19), results in
f ( x )=0 . 448. Let us assume that the target function value is given by f ( x )=0 . 5,
resulting in the squared prediction error ( f ( x )
0 . 5 , 0 . 05 2 ), and p ( y
p ( y
|
x , θ 2 )=
N
( y
|
|
x , θ 3 )=
N
( y
|
f ( x )) 2
0 . 002704. This error
is correctly upper-bounded by (6.22), that results in ( f ( x )
f ( x )) 2
0 . 0196.
The correctness of (6.24) is demonstrated by taking 10 6 samples from the pre-
dictive distributions of the different classifiers, resulting in the sample vectors
s 1 , s 2 ,and s 3 ,eachofsize10 6 . Thus, we can produce a sample vector of the
 
Search WWH ::




Custom Search