Database Reference
In-Depth Information
r π þ γ P π w π ,
w π ¼ ^
ð 3
:
17 Þ
where the vectors of the action values w π and rewards
r π together with the matrix of
^
transition probabilities P π are defined as follows:
T ,
w π ¼ q π s 1 ;
q π s 1 ;
q π s N ;
½
ð
a 1
Þ
ð
a 2
Þ ...
ð
a M
Þ
T ,
^
r π ¼ ^
r π s 1 ;
r π s 1 ;
^
Þ ... ^
r π s N ;
½
ð
a 1
Þ
ð
a 2
ð
a M
Þ
ðÞ¼ X
s 0
ð 3
:
18 Þ
p ss 0 r ss 0 ,
p s , a , s 0 , a 0 ðÞ¼p ss 0 π s 0 , a 0 :
^
r π s
;
We now exemplarily consider the state-value function and define the Bellman
operator T π for a policy
π
as
T π ðÞ¼r π þ γ
P π v
and consider the iteration
¼ T π v k :
v 1
ð 3
:
19 Þ
The following proposition can be easily shown.
Proposition 3.1 For all MDPs, each policy has a unique and finite state-value
function, which is the unique fix point of the iteration defined in Equation ( 3.19 ).
The iteration converges to its unique fix point at asymptotic rate
γ
for any initial
guess. The asymptotic rate is attained in l .
Thus, Proposition 3.1 ensures that in each policy evaluation step of the policy
iteration, we will find a unique solution of the Bellman equation.
We now turn our attention to the existence and uniqueness of an optimal policy
π
*, that is, a policy fulfilling
v π
v π ,
π Π M ,
where
Π M is the space of all policies of the Markov decision process M . The
following theorem provides the desired result.
Theorem 3.1 The policy iteration terminates after a finite number of iterations
with the tuple (
* , v π * ) .
π
Thus, it turns out that the convergence of the policy iteration does not require
further specific assumptions.
At the same time, iteration ( 3.19 ) answers the open question of Sect. 3.4 for a
solution method of the Bellman equation for a given policy
. In case we are
working with the action-value function instead of the state-value function, as, for
example, in Sect. 3.6 , everything described here carries over to the action values
w π , and we obtain instead of ( 3.19 ) the fix point equation
π
Search WWH ::




Custom Search