Information Technology Reference
In-Depth Information
about T μ . For more information on temporal-difference learning, the interested
reader is referred to the work of Bertsekas and Tsitsiklis [17] and Drugowitsch
and Barry [79].
The temporal-difference learning operator T ( λ )
μ
is parametrised by 0
λ
1,
and, when applied to V results in [215]
m
,
λ )
(T ( λ μ V ) i =(1
λ m
γ t r x t x t +1 + γ m +1 V m +1 |
E
x 0 = x i
(9.14)
m =0
t =0
for λ< 1. The definition for λ = 1 is given in [79]. The expectation in the
above expression is equivalent to the n -step return V n (9.1), which shows that
the temporal-difference update is based on mixing returns of various lengths,
where the mixing coecients are controlled by λ . To implement the above up-
date incrementally, Sutton uses eligibility traces that propagate current temporal
differences to previously visited states [207].
Its most interesting property for our purpose is that T ( λ )
μ
forms a contraction
mapping with respect to the weighted norm
· D , which is defined as given
in Sect. 5.2, and the diagonal weight matrix D is given by the steady-state
distribution of the Markov chain P μ
that corresponds to policy μ [215, 17].
More formally, we have for any V , V ,
γ (1
λ )
T ( λ )
μ
T ( λ )
μ
V D
V D
V D .
V
γλ
V
γ
V
(9.15)
1
T (0 μ , and therefore T μ also forms a contraction mapping with
Note that T μ
respect to
· D . This property can be used to analyse both convergence and
stability of the method, as shown in Sect. 9.4.
9.2.5
SARSA( λ )
Coming to the first reinforcement learning algorithm, SARSA stands for State-
Action-Reward-State-Action, as SARSA(0) requires only information on the
current and next state/action pair and the reward that was received for the tran-
sition. Its name was coined by Sutton [208] for an algorithm that was developed
by Rummery and Nirahnja [192] in its approximate form, which is very similar
to Wilson's ZCS [236], as discussed by Sutton and Barto [209, Chap. 6.10].
It conceptually performs policy iteration and uses TD( λ ) to update its action-
value function Q . More specifically it performs optimistic policy iteration, where
in contrast to standard policy iteration the policy improvement step is based on
an incompletely evaluated policy.
Consider being in state x t at time t and performing action a t , leading to
the next state x t +1 and reward r t . The current action-value function estima-
tes are given by Q t . These estimates are to be updated for ( x t ,a t ) to reflect
the newly observed reward. The basis of policy iteration, as described by T μ
(9.10), is to update the estimate of the value function of one particular state
by relating it to all potential next states and the expected reward for these
 
Search WWH ::




Custom Search