Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

on V t +1 and its matching function, as described in Sect. 5.2. This completes one

iteration of LCS approximate value iteration.

The only modification introduced by the asynchronous variant is that rather

than updating the value function for all states at once, a single state is picked per

iteration, and the value function is updated for this state, as already described in

Sect. 9.2.2. Let

{

x i 1 , x i 2 ,...

}

be the sequence of states that determine with state

is updated at which iteration. Thus in the t th iteration we compute V t ( x i t )by

(9.24), which results in the sequence {V 1 ( x i 1 ) ,V 2 ( x i 2 ) ,...} thatcanbeusedto

incrementally train the classifiers by a method of choice from Sect. 5.3. For the

asynchronous variant we cannot use batch learning anymore, as not all elements

of V t +1 are available at once.

9.3.4

Q-Learning by Least Mean Squares

So far it was assumed that the transition and reward function of the given

problem are known. To perform model-free RL, the LCS model needs to be

adjusted to handle action-value function estimates rather than value function

estimates by redefining the input state to be the space of all state/action pairs.

Thus, given state x and action a , the matching probability of classifier k is given

by m k ( x ,a ), and the approximation of its action-value by Q k ( x ,a ). Mixing is

also based on state and action, where the mixing coecient for classifier k is

given by g k ( x ,a ). This results in the global approximation of the action-value

for state x and action a to be given by

Q ( x ,a )=

g k ( x ,a ) Q k ( x ,a ) .

(9.25)

As describes in Sect. 9.2.7, approximate Q-Learning in based on minimising

(9.18). For independently trained classifiers, each classifier minimises this cost

independently, but with the value function estimate V of the next state based

on the global estimate. Thus the target for Q k for the transition from x t to x t +1

under action a t is

Q t ( x t +1 ,a ) ,

Q t +1 ( x t ,a t )= r x t x t +1 ( a t )+ γ max

a∈A

(9.26)

given that classifier k matches ( x t ,a t ). Using the linear classifier model Q k ( x ,a )=

w k x , each classifier k aims at finding w k that, by (9.18) after t steps, minimises

m k ( x m ,a m ) w k x

Q m +1 ( x m ,a m ) 2 ,

−

(9.27)

m =0

where m k ( x m ,a m ) was introduced to only consider matched state/action pairs.

This standard linear least squares problem can be handled incrementally with

any of the methods discussed in Chap. 5. It cannot be trained by batch learning

as the target function relies on previous updates and is thus non-stationary.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home