Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

Relation to Ridge Regression

It is easy to show that the solution w N to minimising

2 ,

X N w − y N

M N + λ w

(5.36)

( λ is the positive scalar ridge complexity ) with respect to w requires

( X N M N X N + λ I ) w N = X N M N y n

(5.37)

to hold. The above is similar to (5.30) with the additional term λ I . Hence,

(5.31) still holds when initialised with Λ 0 = λ I , and consequently so does (5.34).

Therefore, initialising Λ − 1

= δ I to apply (5.35) to operate on Λ − 1

rather than

Λ is equivalent to minimising (5.36) with λ = δ − 1 .

In addition to the matching-weighted squared error, (5.36) penalises the size

of w . This approach is known as ridge regression and was initially introduced

to work around the problem of initially singular X N M N X N for small N ,that

prohibited the solution of (5.30). However, minimising (5.36) rather than (5.7) is

also advantageous if the input vectors suffer from a high noise variance, resulting

in large w and a bad model for the real data-generating process. Essentially, ridge

regression assumes that the size of w is small and hence computes better model

parameters for noisy data, given that the inputs are normalised [102, Chap. 3].

To summarise, using the RLS algorithm (5.34) and (5.35) with Λ − 0 = δ I ,a

classifier performs ridge regression with ridge complexity λ = δ − 1 . As by (5.36),

the contribution of

is independent of the number of observations N ,its

influence decreases exponentially with N .

A Recency-Weighted Variant

While the RLS algorithm provides a recursive solution such that (5.16) holds,

it weights all observations equally. Nonetheless, we might sometimes require

recency-weighting, such as when using LCS in combination with reinforcement

learning. Hence, let us derive a variant of RLS that applies a scalar decay factor

≤

1 to past observations.

More formally, after N observations, we aim at minimising

≤

m ( x n ) λ j = n +1 m ( x j ) ( w T x n −

y n ) 2 =

M N

X N w

−

y N

(5.38)

n =1

with respect to w ,wherethe λ -augmented diagonal matching matrix M N is

given by

⎛

⎞

m ( x 1 ) λ j =2 m ( x j )

⎝

⎠

m ( x 2 ) λ j =3 m ( x j )

M N

(5.39)

. . .

m ( x N )

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home