Database Reference
In-Depth Information
Algorithm 3.3: (continued)
3: initialize s , a
4: repeat each step of episode
5: take action a , observe r , s 0
6: update p ss 0 , r ss 0 , e.g. using ( 3.8 )
7: v ( s ): ¼ max a s 0 p ss 0 [ r ss 0 +
v ( s 0 )]
γ
8:
until s is terminal
9:
v : ¼ VALUE_ITERATION( v,P , R ,
γ
,
θ
)
update v over all states
10: until stop
Of course, in step 9, also the policy iteration of Algorithm 3.1 may be applied.
Additionally, step 9 is not required to be performed after each episode - the call
may also occur less often. It is also not necessary to compute v exactly, for example,
the iteration process may be stopped before the termination criterion is fulfilled.
One iteration sweep through all states is enough. We only need to ensure that no
states are left out permanently in the ADP.
Now we switch to the convergence of the TD(
λ
) method which is a more
complex topic. First, we rewrite the TD(
λ
) method ( 3.13 ) in vector notation and
include the iteration index k :
w k
:¼ w k
þ α t z t d t
,
ð 3
:
20 Þ
where w k in accordance to ( 3.17 ) represents the vector of action values q k ( s , a ) and
T
z t ¼ z t s 1 ;
½
ð
a 1
Þ
z t s 1 ;
ð
a 2
Þ
...
z t s N ;
ð
a M
Þ
:
The following theorem ensures the convergence.
Theorem 3.2 Let the following assumptions hold:
1.
π
is a policy for an MDP M such that M π is irreducible and aperiodic.
2.
iðÞ t N 0
is a trajectory generated by M π .
3.
[0, 1]
4. The sequence of step sizes
λ
αð t N 0 satisfies ( 3.9 ).
Then, for any initial guess w 0 , the sequence w k k N 0 generated by the iteration
( 3.20 ) converges almost surely to w π .
Now following the reasoning of Sect. 3.6 , by selecting a proper policy in each
iteration step, we ensure the convergence of the TD(
λ
) method to the optimal policy
π
*. For details, we refer to [BT96].
Finally, we enlist the TD(
λ
) algorithm, namely, the Sarsa(
λ
) version, in Algo-
rithm 3.4.
Search WWH ::




Custom Search