Information Technology Reference
In-Depth Information
global prediction by s = k g k ( x ) s k , which has the sample variance 0 . 00190.
This conforms to - and thus empirically validates - the variance after (6.24),
which results in var( y
|
x , θ )=0 . 00191
0 . 0055.
6.2.2 Inverse Variance
The unbiased noise variance estimate of a linear regression classifier k is, after
(5.13), given by
m k ( x n ) w k x n
y n 2
N
τ 1
k
D X ) 1
=( c k
,
(6.26)
n =1
and is therefore approximately the mean sum of squared prediction errors. If this
estimate is small, the squared prediction error is, on average, known to be small
and we can expect the predictions to have a low error. Hence, inverse variance
mixing is defined by using mixing weights that are inversely proportional to the
noise variance estimates of the according classifiers. More formally, γ k ( x )= τ k
in (6.18) for all x . The previous chapter has shown how to estimate the noise
variance of a classifier by batch or incremental learning.
6.2.3 Prediction Confidence
If the classifier model is probabilistic, its prediction can be given by a proba-
bilistic density. Knowing this density allows for the specification of an interval
on the output into which 95% of the observations are likely to fall, known as
the 95% confidence interval. The width of this interval therefore gives a measure
of how certain we are about the prediction made by this classifier. This is the
underlying idea of mixing by prediction confidence.
More formally, the predictive density of the linear classifier model is given for
classifier k by marginalising p ( y, θ k |
x )= p ( y
|
x , θ k ) p ( θ k ) over the parameters
θ k ,andresultsin
y
w k x , τ k ( x T Λ k x +1) ,
p ( y
|
x )=
N
|
(6.27)
as already introduced in Sect. 5.3.6. The 95% confidence interval - indeed that
of any percentage - is directly proportional to the standard deviation of this
density, which is the square root of its variance. Thus, to assign higher weights
to classifiers with a higher confidence prediction, that is, a prediction with a
smaller confidence interval, γ k ( x )issetto
γ k ( x )= τ 1
k ( x T Λ k x +1) 1 / 2 . (6.28)
Compared to mixing by inverse variance, this measure additionally takes the
uncertainty of the weight vector estimate into account and is consequently de-
pendent on the input. Additionally, it relies on the assumption of Gaussian noise
and a Gaussian weight vector model, which might not hold - in particular when
the number of observations that the classifier is trained on is small. Therefore,
despite using more information than mixing by inverse variance, it cannot be
guaranteed to perform better.
 
Search WWH ::




Custom Search