Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

For a sequence of states

{

x 1 , x 2 ,...

}

and actions

{

a 1 ,a 2 ,...

}

,theQ-values

are updated by

Q t +1 ( x t ,a t )= Q t ( x t ,a t )+ α t r x t x t +1 ( a t )+ γ max

a∈A

Q t ( x t ,a t ) ,

(9.17)

where α t denotes the step size at time t . As before, the explore/exploit dilemma

applies when selecting actions based on the current Q .

A variant of Q-Learning, called Q( λ ), is an extension that uses eligibility

traces like TD( λ ) as long as it performs on-policy actions [229]. As soon as an

off-policy action is chosen, all traces are reset to zero, as the off-policy action

breaks the temporal sequence of predictions. Hence, the performance increase

due to traces depends significantly on the policy that is used, but is usually

marginal. In a study by Drugowitsch and Barry [77] it was shown that, when

used in XCS, it performs even worse than standard Q-Learning.

Q t ( x t +1 ,a )

−

9.2.7 Approximate Reinforcement Learning

Analogous to approximate DP, RL can handle large state spaces by approxi-

mating the action-value function. Given some estimator Q that approximates

the action-value function, this estimator is, as before, to be updated after recei-

ving reward r t for a transition from x t to x t +1 when performing action a t .The

estimator's target value is Q ( x t ,a t )= r x t x t +1 ( a t )+ γ V ( x t +1 ), where V ( x t +1 )is

the currently best estimate of the value of state x t +1 .Thus,attime t ,theaim

is to find the estimator Q that minimises some distance between itself and all

previous target values, which, when assuming the squared distance, results in

minimising

Q ( x m ,a n )

r x m x m +1 ( a m )+ γ V t ( x m +1 ) 2

−

(9.18)

m =1

As previously shown, Q-Learning uses V t ( x )=max a Q t ( x ,a ), and SARSA(0)

relies on V t ( x )= Q t ( x ,a ), where in case of the latter, a is the action performed

in state x .

Tabular Q-Learning and SARSA(0) are easily extracted from the above pro-

blem formulation by assuming that each state/action pair is estimated separa-

tely by Q ( x ,a )= θ x,a . Under this assumption, applying the LMS algorithm to

minimising (9.18) directly results in (9.16) or (9.17), depending on how

V t is

estimated.

The next section shows from first principles how the same approach can be

applied to performing RL with LCS,thatis,when Q is an estimator that is

given by an LCS.

9.3

Reinforcement Learning with LCS

Performing RL with LCS means to use LCS to approximate the action-value

function estimate. RL methods upgrade this estimate incrementally, and we can

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home