Database Reference
In-Depth Information
At every step t , the TD(0) algorithm that was described modifies the action-value
function only in the current state s t . Since, however, the action values of the following
states are also included in its calculation of the discount, conversely, the action values
of all preceding states can also be updated at each step, which significantly
increases the speed of learning. This is achieved using algorithms of the TD(
λ
)
family. By this means, at every step, all action values are updated as follows:
qs
ðÞ:¼ qs
;
ðÞþα t d t s t ;
;
ð
a t ;
s 1 ;
a 1
Þz t s
ð ,
;
ð 3
:
13 Þ
where the weighting function z t ( s , a ) describes the relevance of the temporary
difference at the time point t for the state-action pair ( s, a ). The weighting function
z t ( s , a ) is called eligibility traces and assigns recently visited states with a higher
weighting than states which have not been visited for some time. The usual
definition of eligibility traces is
(
ðÞ¼ s t , a t
ðÞ¼ γλ
z t 1 s
ðÞþ 1,
;
if
s
;
z t s
;
ðÞ6¼ s t , a t
ð 3
:
14 Þ
,
γλ
z t 1 s
ð ,
;
if
s
;
where 0 λ 1 is the eligibility trace parameter. For this, z t ( s , a ) is initialized as
zero for all states. Obviously, z t ( s , a )isrelativelyhighif( s, a ) is visited often and
( s t , a t ) can be reached from ( s, a ) in only a few steps. The weighting z t ( s , a )decreases
exponentially with the number of steps since the last visit of the pair ( s, a ) .
For computational purposes, z t ( s , a ) can be set to zero if it falls below an epsilon
barrier specified beforehand, so that it is given a local support and thus can be
implemented asymptotically optimal with respect to computation time and mem-
ory. In practice, this means that after each update, ( 3.12 ) has been performed for the
current time step t , that is, for ( s t , a t ), the update ( 3.13 ) is performed for all
preceding time steps in the current episode t-1, t-2,
, t-m , where t-m is the last
preceding time step where z t ( s , a ) lies above the epsilon barrier. At the same time,
z t ( s , a ) must be updated in accordance with ( 3.14 ) for all affected time steps.
In this way, the TD(
...
) algorithm described above performs a continual adapta-
tion of the action-value function q(s, a) in an intuitive fashion. It can be shown that
the TD(
λ
) algorithm converges in a certain probabilistic sense, which is technically
referred to as almost sure convergence , to the optimal policy
λ
* and action-value
function q *. Of course, it can also be interpreted as a general policy iteration. At
every step t , a policy evaluation is always performed in accordance with ( 3.13 ), and
after that - based on the updated action-value function - a further policy improve-
ment is performed, and so on.
Notice that there exist different versions of the TD( λ ) algorithm; the one we have
described here is called Sarsa ( λ ) because it works with the quintuple ( s t , a t , r t +1 ,
s t +1 , a t +1 ). Further important versions are concerned with Q learning , for example,
the Watkins Q(
π
λ
). We do not want to explain this here, and for details, refer to
[SB98].
We have shown that RL can also learn in an online fashion, thus solving Problem
3 in Chap. 2 .
Search WWH ::




Custom Search