Information Technology Reference
In-Depth Information
about T
μ
. For more information on temporal-difference learning, the interested
reader is referred to the work of Bertsekas and Tsitsiklis [17] and Drugowitsch
and Barry [79].
The temporal-difference learning operator T
(
λ
)
μ
is parametrised by 0
≤
λ
≤
1,
and, when applied to
V
results in [215]
m
,
λ
)
∞
(T
(
λ
μ
V
)
i
=(1
λ
m
γ
t
r
x
t
x
t
+1
+
γ
m
+1
V
m
+1
|
−
E
x
0
=
x
i
(9.14)
m
=0
t
=0
for
λ<
1. The definition for
λ
= 1 is given in [79]. The expectation in the
above expression is equivalent to the
n
-step return
V
n
(9.1), which shows that
the temporal-difference update is based on mixing returns of various lengths,
where the mixing coecients are controlled by
λ
. To implement the above up-
date incrementally, Sutton uses
eligibility traces
that propagate current temporal
differences to previously visited states [207].
Its most interesting property for our purpose is that T
(
λ
)
μ
forms a contraction
mapping with respect to the weighted norm
·
D
, which is defined as given
in Sect. 5.2, and the diagonal weight matrix
D
is given by the steady-state
distribution of the Markov chain
P
μ
that corresponds to policy
μ
[215, 17].
More formally, we have for any
V
,
V
,
γ
(1
−
λ
)
T
(
λ
)
μ
T
(
λ
)
μ
V
D
≤
V
D
≤
V
D
.
V
−
γλ
V
−
γ
V
−
(9.15)
1
−
T
(0
μ
, and therefore T
μ
also forms a contraction mapping with
Note that T
μ
≡
respect to
·
D
. This property can be used to analyse both convergence and
stability of the method, as shown in Sect. 9.4.
9.2.5
SARSA(
λ
)
Coming to the first reinforcement learning algorithm, SARSA stands for State-
Action-Reward-State-Action, as SARSA(0) requires only information on the
current and next state/action pair and the reward that was received for the tran-
sition. Its name was coined by Sutton [208] for an algorithm that was developed
by Rummery and Nirahnja [192] in its approximate form, which is very similar
to Wilson's ZCS [236], as discussed by Sutton and Barto [209, Chap. 6.10].
It conceptually performs policy iteration and uses TD(
λ
) to update its action-
value function
Q
. More specifically it performs optimistic policy iteration, where
in contrast to standard policy iteration the policy improvement step is based on
an incompletely evaluated policy.
Consider being in state
x
t
at time
t
and performing action
a
t
, leading to
the next state
x
t
+1
and reward
r
t
. The current action-value function estima-
tes are given by
Q
t
. These estimates are to be updated for (
x
t
,a
t
) to reflect
the newly observed reward. The basis of policy iteration, as described by T
μ
(9.10), is to update the estimate of the value function of one particular state
by relating it to all potential next states and the expected reward for these
Search WWH ::
Custom Search