Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

squared errors, denoted by s k,t for classifier k at time t , and can the compute

the noise precision by (5.63). By (5.69), the sum of squared errors is updated by

s k,t +1 = λ m ( x t ,a t ) s k,t

(9.32)

+ m k ( x t ,a t )( w k,t x t −

Q t +1 ( x t ,a t ))( w k,t +1 x t −

Q t +1 ( x t ,a t )) ,

starting with s k, 0 =0.

Even though XCS has already been used with RLS by Lanzi et al. [142, 143],

they have never applied it to sequential decision tasks. We have already investi-

gated the incremental noise precision update as given in this chapter for simple

regression tasks [155], but its empirical evaluation when applied to sequential

decision tasks is still pending.

9.3.6

XCS with Gradient Descent

Some recent work by Butz et al. [47, 45] has caused some confusion over how

XCS performs Q-Learning, and how this can be enhanced by the use of gradient

descent [223, 224, 142, 140, 139]. This section clarifies that XCS(F) in its current

form already performs gradient descent and does not need to be modified, as it

updates the classifiers' weight vectors by (9.28), which is a gradient-based update

algorithm. As XCSF is (besides the MAM update) equivalent to XCS if D X =1,

the following discussion only considers XCSF.

To show the equivalence between XCSF and (9.28), let us have a closer look

at the weight vector update of XCSF: upon arriving at state x t ,XCSFforms

a match set that contains all classifiers for that m k ( x t ,a ) > 0, independent of

the action a . The match set is then partitioned into one subset per possible

action, resulting in

subsets. The subset associated with action a contains

all classifiers for that m k ( x ,a ) > 0, and for each of these subsets the action-

|A|

value estimate Q t ( x t ,a )= k g k ( x t ,a ) Q k,t ( x t ,a ) is calculated, resulting in the

prediction vector ( Q t ( x t ,a 1 ) ,..., Q t ( x t ,a |A| )) that predicts the expected return

for the current state x t and each possible action that can be performed. Ba-

sed on this prediction vector, an action a t is chosen and performed, leading

to the next state x t +1 and reward r x t x t +1 ( a t ). The subset of the match set

that promoted the chosen action becomes the action set that contains all clas-

sifiers such that m k ( x t ,a t ) > 0. At the same time, a new prediction vector

( Q t ( x t +1 ,a 1 ) ,..., Q t ( x t +1 ,a |A| )) for state x t +1 is formed, and its largest ele-

ment is chosen, giving max a∈A

Q t ( x t +1 ,a ). Then, all classifiers in the action set

are updated by the modified delta rule (which is equivalent to the NLMS algo-

rithm) with the target value r x t x t +1 ( a t )+ γ max a∈A

Q t ( x t +1 ,a ). The update in

(9.28) uses exactly this target value, as given by (9.26), and updates the weight

vector of each classifier for which m k ( x t ,a t ) > 0, which are the classifiers in the

action set. This shows that (9.28) describes the weight vector update as it is

performed in XCSF, and therefore XCS(F) performs gradient descent without

any additional modification.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home