Database Reference
In-Depth Information
¼ T
π
w
k
,
T
π
ðÞ¼^
r
π
þ γ P
π
w
w
kþ
1
:
Therefore, we can now enlist the complete algorithm of the policy iteration
(Algorithm 3.1). To make the description easier, we again use the state-value
function, having in mind that it looks similar for the action-value function.
Algorithm 3.1: Policy iteration
Input: transition probabilities
P
and -rewards
R
, discount rate
γ
, small inner
θ
∈ R,
θ >
bound
0
Output: optimal state-value function
v
*
1. procedure POLICY_ITERATION(
v, P
,
R
,
γ
,
θ
)
2.
repeat
⊳
block of policy evaluation
3.
Δ
:
¼
0
4.
for
each s
S
do
∈
5.
v
:
¼ v
(
s
)
v
(
s
):
¼
∑
s
0
p
πðsÞ
[
r
πðsÞ
ss
0
v
(
s
0
)]
6.
+
γ
ss
0
7.
Δ
:
¼
max(
Δ
,|
v v
(
s
)|)
8.
end for
9.
until
Δ < θ
policy-stable:¼true
10.
⊳
block of policy improvement
for
each s
S
do
11.
∈
12.
b
:
¼ π
(
s
)
(
s
):
¼
arg max
a
∑
s
0
r
ss
0
[
p
ss
0
+
v
(
s
0
)]
13.
π
γ
14.
(
s
) then
15.
policy-stable:¼false
16. end if
17. end for
18. if
policy-stable
then
19. stop
20. else
21. goto 2
22. end if
23. return
v
24. end procedure
25. initialize
v
(
s
),
if
b 6¼ π
S
arbitrarily
26.
v
:
¼
POLICY_ITERATION(
v, P
,
R
,
π
(
s
)
8 s
∈
γ
,
θ
)
27. return
v
Of course, there are numerous variants of policy iteration for the solution of the
Bellman equation. First, instead of directly applying (
3.19
) in the policy evaluation
for the solution of (
3.15
), which may be considered as a simple Richardson
iteration, we may employ more sophisticated iteration procedures. We shall address
methods for convergence acceleration.