Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

¼ T π w k , T π ðÞ¼^

r π þ γ P π w

w kþ 1

Therefore, we can now enlist the complete algorithm of the policy iteration

(Algorithm 3.1). To make the description easier, we again use the state-value

function, having in mind that it looks similar for the action-value function.

Algorithm 3.1: Policy iteration

Input: transition probabilities P and -rewards R , discount rate

, small inner

θ ∈ R, θ >

bound

Output: optimal state-value function v *

1. procedure POLICY_ITERATION( v, P , R ,

)

repeat

⊳ block of policy evaluation

: ¼ 0

for each s

S do

∈

v : ¼ v ( s )

v ( s ): ¼ ∑ s 0 p πðsÞ

[ r πðsÞ

ss 0

v ( s 0 )]

ss 0

: ¼ max(

,| v v ( s )|)

end for

until

Δ < θ

policy-stable:¼true

10.

⊳

block of policy improvement

for each s

S do

11.

∈

12.

b : ¼ π

( s )

( s ): ¼ arg max a ∑ s 0 r ss 0 [ p ss 0 +

v ( s 0 )]

13.

14.

( s ) then

15. policy-stable:¼false

16. end if

17. end for

18. if policy-stable then

19. stop

20. else

21. goto 2

22. end if

23. return v

24. end procedure

25. initialize v ( s ),

if b 6¼ π

S arbitrarily

26. v : ¼ POLICY_ITERATION( v, P , R ,

( s )

8 s

∈

)

27. return v

Of course, there are numerous variants of policy iteration for the solution of the

Bellman equation. First, instead of directly applying ( 3.19 ) in the policy evaluation

for the solution of ( 3.15 ), which may be considered as a simple Richardson

iteration, we may employ more sophisticated iteration procedures. We shall address

this in Chap. 6 , in particular with respect to its connection with hierarchical

methods for convergence acceleration.

Realtime Data Mining

Search WWH ::

Custom Search

Home