Database Reference
In-Depth Information
r
π
þ γ P
π
w
π
,
w
π
¼ ^
ð
3
:
17
Þ
where the vectors of the action values
w
π
and rewards
r
π
together with the matrix of
^
transition probabilities
P
π
are defined as follows:
T
,
w
π
¼ q
π
s
1
;
q
π
s
1
;
q
π
s
N
;
½
ð
a
1
Þ
ð
a
2
Þ ...
ð
a
M
Þ
T
,
^
r
π
¼
^
r
π
s
1
;
r
π
s
1
;
^
Þ ...
^
r
π
s
N
;
½
ð
a
1
Þ
ð
a
2
ð
a
M
Þ
ðÞ¼
X
s
0
ð
3
:
18
Þ
p
ss
0
r
ss
0
,
p
s
,
a
,
s
0
,
a
0
ðÞ¼p
ss
0
π
s
0
,
a
0
:
^
r
π
s
;
We now exemplarily consider the state-value function and define the
Bellman
operator T
π
for a policy
π
as
T
π
ðÞ¼r
π
þ γ
P
π
v
and consider the iteration
¼ T
π
v
k
:
v
kþ
1
ð
3
:
19
Þ
The following proposition can be easily shown.
Proposition 3.1 For all MDPs, each policy has a unique and finite state-value
function, which is the unique fix point of the iteration defined in Equation (
3.19
).
The iteration converges to its unique fix point at asymptotic rate
γ
for any initial
guess. The asymptotic rate is attained in
l
∞
.
Thus, Proposition 3.1 ensures that in each policy evaluation step of the policy
iteration, we will find a unique solution of the Bellman equation.
We now turn our attention to the existence and uniqueness of an optimal policy
π
*, that is, a policy fulfilling
v
π
v
π
,
π
∈
Π
M
,
where
Π
M
is the space of all policies of the Markov decision process
M
. The
following theorem provides the desired result.
Theorem 3.1
The policy iteration terminates after a finite number of iterations
with the tuple
(
*
,
v
π
*
)
.
π
Thus, it turns out that the convergence of the policy iteration does not require
further specific assumptions.
At the same time, iteration (
3.19
) answers the open question of Sect.
3.4
for a
solution method of the Bellman equation for a given policy
. In case we are
working with the action-value function instead of the state-value function, as, for
example, in Sect.
3.6
, everything described here carries over to the action values
w
π
, and we obtain instead of (
3.19
) the fix point equation
π