Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

Using the normalised least mean squared (NLMS) algorithm as described in

Sect. 5.3.4, the weight vector estimate update for classifier k is given by

2 Q t +1 ( x t ,a t )

w k x t ,

x t

w k,t +1 = w k,t + αm k ( x t ,a t )

−

(9.28)

x t

where α denotes the step size, and Q t +1 ( x t ,a t ) is given by (9.26). As discussed

in more detail in Sect. 9.3.6, this is the weight vector update of XCSF.

The noise variance of the model can be estimate by the LMS algorithm, as

described in Sect. 5.3.7. This results in the update equation

+ αm k ( x t ,a t ) w k,t +1 x t −

Q t +1 ( x t ,a t ) 2

τ − 1

k,t +1 = τ − 1

τ − 1

k,t

−

(9.29)

k,t

where α is again the scalar step size, and Q t +1 ( x t ,a t ) is given by (9.26).

9.3.5

Q-Learning by Recursive Least Squares

As shown in Chap. 5, incremental methods based on gradient descent might

suffer from slow convergence rates. Thus, despite their higher computational and

space complexity, methods based on directly tracking the least squares solution

are to be preferred. Consequently, rather than using NLSM, this section shown

how to apply recursive least squares (RLS) and direct noise precision tracking

to Q-Learning with LCS.

The non-stationarity of the action-value function estimate needs to be take

into account by using a recency-weighed RLS variant that puts more weight on

recent observation. This was not an issue for the NLMS algorithm, as it performs

recency-weighting implicitly.

Minimising the recency-weighted variant of the sum of squared errors (9.27),

the update equations are according to Sect. 5.3.5 given by

k,t +1 x t Q t +1 ( x t ,a t )

w k,t x t (9.30)

w k,t +1 = λ m k ( x t ,a t ) w k,t + m k ( x t ,a t ) Λ − 1

−

Λ − 1

k,t +1 = λ −m k ( x t ,a t ) Λ − 1

k,t ,

(9.31)

Λ − 1

k,t x t x t Λ − 1

k,t

λ m k ( x t ,a t ) + m k ( x t ,a t ) x t Λ − 1

m k ( x t ,a t ) λ −m k ( x t ,a t )

−

k,t x t

where Q t +1 ( x t ,a t ) is given by (9.26), and w k, 0 and Λ − 1

k, 0 are initialised by w k, 0 =

0 and Λ k, 0 = δ I ,where δ is a large scalar. λ determines the recency weighting,

which is strongest for λ = 0, where only the last observation is considered, and

deactivated when λ =1.

Using the RLS algorithm to track the least squares approximation of the

action-values for each classifier allows us to directly track the classifier's model

noise variance, as described in Sect. 5.3.7. More precisely, we track the sum of

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home