Database Reference
In-Depth Information
¼ T π w k , T π ðÞ¼^
r π þ γ P π w
w 1
:
Therefore, we can now enlist the complete algorithm of the policy iteration
(Algorithm 3.1). To make the description easier, we again use the state-value
function, having in mind that it looks similar for the action-value function.
Algorithm 3.1: Policy iteration
Input: transition probabilities P and -rewards R , discount rate
γ
, small inner
θ ∈ R, θ >
bound
0
Output: optimal state-value function v *
1. procedure POLICY_ITERATION( v, P , R ,
γ
,
θ
)
2.
repeat
block of policy evaluation
3.
Δ
: ¼ 0
4.
for each s
S do
5.
v : ¼ v ( s )
v ( s ): ¼ s 0 p πðsÞ
[ r πðsÞ
ss 0
v ( s 0 )]
6.
+
γ
ss 0
7.
Δ
: ¼ max(
Δ
,| v v ( s )|)
8.
end for
9.
until
Δ < θ
policy-stable:¼true
10.
block of policy improvement
for each s
S do
11.
12.
b : ¼ π
( s )
( s ): ¼ arg max a s 0 r ss 0 [ p ss 0 +
v ( s 0 )]
13.
π
γ
14.
( s ) then
15. policy-stable:¼false
16. end if
17. end for
18. if policy-stable then
19. stop
20. else
21. goto 2
22. end if
23. return v
24. end procedure
25. initialize v ( s ),
if b 6¼ π
S arbitrarily
26. v : ¼ POLICY_ITERATION( v, P , R ,
π
( s )
8 s
γ
,
θ
)
27. return v
Of course, there are numerous variants of policy iteration for the solution of the
Bellman equation. First, instead of directly applying ( 3.19 ) in the policy evaluation
for the solution of ( 3.15 ), which may be considered as a simple Richardson
iteration, we may employ more sophisticated iteration procedures. We shall address
this in Chap. 6 , in particular with respect to its connection with hierarchical
methods for convergence acceleration.
Search WWH ::




Custom Search