Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

At every step t , the TD(0) algorithm that was described modifies the action-value

function only in the current state s t . Since, however, the action values of the following

states are also included in its calculation of the discount, conversely, the action values

of all preceding states can also be updated at each step, which significantly

increases the speed of learning. This is achieved using algorithms of the TD(

)

family. By this means, at every step, all action values are updated as follows:

ðÞ:¼ qs

;

ðÞþα t d t s t ;

;

a t ;

s tþ 1 ;

a tþ 1

Þz t s

ð ,

;

ð 3

13 Þ

where the weighting function z t ( s , a ) describes the relevance of the temporary

difference at the time point t for the state-action pair ( s, a ). The weighting function

z t ( s , a ) is called eligibility traces and assigns recently visited states with a higher

weighting than states which have not been visited for some time. The usual

definition of eligibility traces is

(

ðÞ¼ s t , a t

ðÞ¼ γλ

z t 1 s

ðÞþ 1,

;

z t s

;

ðÞ6¼ s t , a t

ð 3

14 Þ

γλ

z t 1 s

ð ,

;

where 0 λ 1 is the eligibility trace parameter. For this, z t ( s , a ) is initialized as

zero for all states. Obviously, z t ( s , a )isrelativelyhighif( s, a ) is visited often and

( s t , a t ) can be reached from ( s, a ) in only a few steps. The weighting z t ( s , a )decreases

exponentially with the number of steps since the last visit of the pair ( s, a ) .

For computational purposes, z t ( s , a ) can be set to zero if it falls below an epsilon

barrier specified beforehand, so that it is given a local support and thus can be

implemented asymptotically optimal with respect to computation time and mem-

ory. In practice, this means that after each update, ( 3.12 ) has been performed for the

current time step t , that is, for ( s t , a t ), the update ( 3.13 ) is performed for all

preceding time steps in the current episode t-1, t-2,

, t-m , where t-m is the last

preceding time step where z t ( s , a ) lies above the epsilon barrier. At the same time,

z t ( s , a ) must be updated in accordance with ( 3.14 ) for all affected time steps.

In this way, the TD(

...

) algorithm described above performs a continual adapta-

tion of the action-value function q(s, a) in an intuitive fashion. It can be shown that

the TD(

) algorithm converges in a certain probabilistic sense, which is technically

referred to as almost sure convergence , to the optimal policy

* and action-value

function q *. Of course, it can also be interpreted as a general policy iteration. At

every step t , a policy evaluation is always performed in accordance with ( 3.13 ), and

after that - based on the updated action-value function - a further policy improve-

ment is performed, and so on.

Notice that there exist different versions of the TD( λ ) algorithm; the one we have

described here is called Sarsa ( λ ) because it works with the quintuple ( s t , a t , r t +1 ,

s t +1 , a t +1 ). Further important versions are concerned with Q learning , for example,

the Watkins Q(

). We do not want to explain this here, and for details, refer to

[SB98].

We have shown that RL can also learn in an online fashion, thus solving Problem

3 in Chap. 2 .

Realtime Data Mining

Search WWH ::

Custom Search

Home