Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

•

The Gaussian prior on ω provides a different interpretation of the ridge

complexity λ in ridge regression: recalling that λ corresponds to initialising

RLS with Λ − 1

0

= λ − 1 I , it is also equivalent to using the Kalman filter with

( 0 , ( λτ ) − 1 I ). Hence, ridge regression assumes the weight

vector to be centred on 0 with an independent variance of ( λτ ) − 1 of each

element of this vector. As the prior covariance is proportional to the real

noise variance τ − 1 , a smaller variance will cause stronger shrinkage due to a

more informative prior.

the prior ω 0

∼N

What if the noise distribution is not Gaussian? Would that invalidate the

approach taken by RLS and the Kalman filter? Fortunately, the Gauss-Markov

Theorem (for example, [97]) states that the least squares estimate is optimal

independent of the shape of the noise distribution, as long as its variance is

constant over all observations. Nonetheless, adding the assumption of Gaussian

noise and acquiring a Gaussian model for the weight vector allows us to specify

the predictive density. Without these assumptions, we would be unable make

any statements about this density, and are subsequently also unable to provide

a measure for the prediction confidence.

In summary, demonstrating the formal equivalence between the RLS algo-

rithm and the Kalman filter for a stationary system state has significantly in-

creased the understanding of the assumptions underlying the RLS method and

provides intuitive interpretations for matching and recency-weighting by relating

them to an increased uncertainty about the observations.

5.3.7

Incremental Noise Precision Estimation

So far, the discussion of the incremental methods has focused on estimating the

weight vector that solves (5.5). Let us now consider how we can estimate the

noise precision by incrementally solving (5.6).

For batch learning it was already demonstrated that (5.11) and (5.13) pro-

vide a biased and unbiased noise precision estimate that solves (5.6). The same

solutions are valid when using an incremental approach, and thus, after N

observations,

τ − 1

N

= c − N

2

M N

X N w N −

y N

(5.62)

provides a biased estimate of the noise precision, and

τ − 1

N

D X ) − 1

2 M N

=( c N −

X N w N −

y N

(5.63)

is the unbiased estimate. Ideally, w N is the weight vector that satisfies the Prin-

ciple of Orthogonality, but if gradient-based methods are utilised, we are forced

to rely on the current (possibly quite wrong) estimate.

Let us firstly derive a gradient-based method for estimating the noise preci-

sion, which is the one applied in XCS. Following that, a much more accurate

approach is introduced that can be used alongside the RLS algorithm to track

the exact noise precision estimate after (5.63) for each additional observation.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home