Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

corresponds to the RLS algorithm that uses (5.35), and the inverse covariance

form is equivalent to using (5.31). They also share the same characteristics: while

(5.35) is computationally cheaper, it cannot be used with a non-informative

prior, just like the covariance form. Conversely, using (5.31) allows the use of

non-informative priors, but requires a matrix inversion with every additional

update, as does the inverse covariance form to recover w by w = Λ − 1 ( Λ w ),

making it computationally more expensive.

The information gain from this relation is manifold:

•

The weight vector of the linear model corresponds to the system state of the

Kalman filter. Hence, it can be modelled by a multivariate Gaussian, that,

in the notation of the RLS algorithm, is given by ω N

( w N , ( τ Λ N ) − 1 ).

As τ is unknown, it needs to be substituted by its estimate τ .

∼N

•

Acquiring this model for ω causes the output random variable υ to become

Gaussian as well. Hence, using the model for prediction, these predictions will

be Gaussian. More specifically, given a new input x , the predictive density

w T x , τ − 1 ( x T Λ − 1 x + m ( x ) − 1 ) ,

y ∼N

(5.60)

and is thus centred on w T x . Its spread is determined on one hand by the

estimated noise variance ( m ( x ) τ ) − 1 and the uncertainty of the weight vec-

tor estimate x T ( τ Λ ) − 1 x .The Λ in the above equations refers to the one

estimated by the RLS algorithm.

Following Hastie et al. [102, Chap. 8.2.1], the two-sided 95% confidence of

the standard normal distribution is given by considering its 97 . 5% point (as

(100%

2 . 5%) = 95%), which is 1.96. Therefore, the 95% confidence

interval of the classifier predictions is centred on the mean of (5.60) with

1.96 times the square root of the prediction's variance to either side of the

mean.

−

•

In deriving the Kalman filter update equations, matching was embedded as a

modifier to the measurement noise variance, that is n ∼N

(0 , ( m ( x n ) τ ) − 1 ),

which gives us a new interpretation for matching: A matching value between

0 and 1 for a certain input can be interpreted as reducing the amount of

information that the model acquires about the associated observation by

increasing the noise of the observation and hence reducing its certainty.

•

A similar interpretation can be given for RLS with recency-weighting: the

decay factor λ acts as a multiplier to the noise precision of past observations

and hence reduces their certainty. This causes the model to put more empha-

sis on more recent observations due to their lower noise variance. Formally,

modelling the noise for the n th observation after N observations by

0 , m ( x n ) τλ j = n +1 m ( x j ) − 1

n ∼N

(5.61)

causes the Kalman filter to perform the same recency weighting as the recency-

weighted RLS variant.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home