Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

about T μ . For more information on temporal-difference learning, the interested

reader is referred to the work of Bertsekas and Tsitsiklis [17] and Drugowitsch

and Barry [79].

The temporal-difference learning operator T ( λ )

is parametrised by 0

≤

and, when applied to V results in [215]

λ ) ∞

(T ( λ μ V ) i =(1

λ m

γ t r x t x t +1 + γ m +1 V m +1 |

−

x 0 = x i

(9.14)

m =0

t =0

for λ< 1. The definition for λ = 1 is given in [79]. The expectation in the

above expression is equivalent to the n -step return V n (9.1), which shows that

the temporal-difference update is based on mixing returns of various lengths,

where the mixing coecients are controlled by λ . To implement the above up-

date incrementally, Sutton uses eligibility traces that propagate current temporal

differences to previously visited states [207].

Its most interesting property for our purpose is that T ( λ )

forms a contraction

mapping with respect to the weighted norm

· D , which is defined as given

in Sect. 5.2, and the diagonal weight matrix D is given by the steady-state

distribution of the Markov chain P μ

that corresponds to policy μ [215, 17].

More formally, we have for any V , V ,

γ (1

−

λ )

T ( λ )

V D ≤

V D .

−

γλ

−

(9.15)

−

T (0 μ , and therefore T μ also forms a contraction mapping with

Note that T μ ≡

respect to

· D . This property can be used to analyse both convergence and

stability of the method, as shown in Sect. 9.4.

9.2.5

SARSA( λ )

Coming to the first reinforcement learning algorithm, SARSA stands for State-

Action-Reward-State-Action, as SARSA(0) requires only information on the

current and next state/action pair and the reward that was received for the tran-

sition. Its name was coined by Sutton [208] for an algorithm that was developed

by Rummery and Nirahnja [192] in its approximate form, which is very similar

to Wilson's ZCS [236], as discussed by Sutton and Barto [209, Chap. 6.10].

It conceptually performs policy iteration and uses TD( λ ) to update its action-

value function Q . More specifically it performs optimistic policy iteration, where

in contrast to standard policy iteration the policy improvement step is based on

an incompletely evaluated policy.

Consider being in state x t at time t and performing action a t , leading to

the next state x t +1 and reward r t . The current action-value function estima-

tes are given by Q t . These estimates are to be updated for ( x t ,a t ) to reflect

the newly observed reward. The basis of policy iteration, as described by T μ

(9.10), is to update the estimate of the value function of one particular state

by relating it to all potential next states and the expected reward for these

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home