Database Reference
In-Depth Information
Fig. 3.7 A sequence of
an episode of 2 steps
a t +1
s t +1
s t +2
a t
s t
So while t indicates the step within the episode, that is, along the chain
ð
s 1 ;
a 1
Þ!s 2 ;
ð
a 2
Þ!s 3 ;
ð
a 3
Þ!...
k is the index of the update for a fixed pair ( s t , a t ) throughout all episodes. In order
not to overburden the notation, we leave out the index k and in its place use the
assignment symbol “: ¼ .”
Before we come to the explanation, the first question immediately arises: since,
to carry out an update of the action value q ( s t , a t ) at step t in realtime , we need the
action value q ( s t +1 , a t +1 ) of the next step t+1, how is this supposed to work in
practice? Doesn't this remind you of Baron M¨nchhausen, who escapes from the
swamp by pulling himself up by the hair?
There are simple solutions to this: we can, for instance, wait until step t+1 and
then perform the update ( 3.10 ), that is, always learn with a delay of one step. Or
we may exploit the fact that we determine the actions ourselves via the policy
(provided we are not learning from historical data): at step t , we already know our
next action a t +1 and can thus work with the current q ( s t +1 , a t +1 ) (Fig. 3.7 ).
To continue with the explanation,
α t is the learning parameter at step t .
The higher it is, the faster the algorithm learns. Thus, the current temporal-
difference d t is
d t s t ;
ð
a t ;
s 1 ;
a 1
Þ ¼ r 1 þ γ
qs 1 ;
ð
a 1
Þ qs t ;
ð
a t
Þ
ð 3
:
11 Þ
and ( 3.10 ) takes the following form:
qs t ;
ð
a t
Þ :¼ qs t ;
ð
a t
Þþα t d t s t ;
ð
a t ;
s 1 ;
a 1
Þ:
ð 3
:
12 Þ
e
This means that we compute the new estimate
qs t ;
ð
a t
Þ :¼ r 1 þ γ
qs 1 ;
ð
a 1
Þ
e
and subtract the previous iterate q ( s t , a t ) therefrom. If
qs t ;
ð
a t
Þ
is greater than
e
q ( s t , a t ), then the latter is increased in accordance with ( 3.11 ); if
qs t ;
ð
a t
Þ
is less
than q ( s t , a t ) , then the latter is decreased in accordance with ( 3.11 ).
So what does e qs t ;
ð Þ mean? We know that q ( s t , a t )is
the expected return taken across the remainder of the episode. The first term r t +1 is
the direct reward of the recommendation a t . The second term
ð
a t
Þ :¼ r 1 þ γ
qs 1 ;
a 1
q ( s t +1 , a t +1 ) is the
expected return from the new state s t +1 . It follows that there are once again two
possibilities for the reason why eqs t ;
γ
að Þ may be higher than q ( s t , a t ): either the direct
reward r t +1 is high or the action a t has led to a valuable state s t +1 with a high action
value q ( s t +1 , a t +1 ) (or both).
Search WWH ::




Custom Search