Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

r π þ γ P π w π ,

w π ¼ ^

ð 3

17 Þ

where the vectors of the action values w π and rewards

r π together with the matrix of

transition probabilities P π are defined as follows:

T ,

w π ¼ q π s 1 ;

q π s 1 ;

q π s N ;

a 1

a 2

Þ ...

a M

T ,

r π ¼ ^

r π s 1 ;

Þ ... ^

r π s N ;

a 1

a 2

a M

ðÞ¼ X

s 0

ð 3

18 Þ

p ss 0 r ss 0 ,

p s , a , s 0 , a 0 ðÞ¼p ss 0 π s 0 , a 0 :

r π s

;

We now exemplarily consider the state-value function and define the Bellman

operator T π for a policy

T π ðÞ¼r π þ γ

P π v

and consider the iteration

¼ T π v k :

v kþ 1

ð 3

19 Þ

The following proposition can be easily shown.

Proposition 3.1 For all MDPs, each policy has a unique and finite state-value

function, which is the unique fix point of the iteration defined in Equation ( 3.19 ).

The iteration converges to its unique fix point at asymptotic rate

for any initial

guess. The asymptotic rate is attained in l ∞ .

Thus, Proposition 3.1 ensures that in each policy evaluation step of the policy

iteration, we will find a unique solution of the Bellman equation.

We now turn our attention to the existence and uniqueness of an optimal policy

*, that is, a policy fulfilling

v π

v π ,

π ∈ Π M ,

where

Π M is the space of all policies of the Markov decision process M . The

following theorem provides the desired result.

Theorem 3.1 The policy iteration terminates after a finite number of iterations

with the tuple (

* , v π * ) .

Thus, it turns out that the convergence of the policy iteration does not require

further specific assumptions.

At the same time, iteration ( 3.19 ) answers the open question of Sect. 3.4 for a

solution method of the Bellman equation for a given policy

. In case we are

working with the action-value function instead of the state-value function, as, for

example, in Sect. 3.6 , everything described here carries over to the action values

w π , and we obtain instead of ( 3.19 ) the fix point equation

Realtime Data Mining

Search WWH ::

Custom Search

Home