Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

Algorithm 3.3: (continued)

3: initialize s , a

4: repeat each step of episode

5: take action a , observe r , s 0

6: update p ss 0 , r ss 0 , e.g. using ( 3.8 )

7: v ( s ): ¼ max a ∑ s 0 p ss 0 [ r ss 0 +

v ( s 0 )]

until s is terminal

v : ¼ VALUE_ITERATION( v,P , R ,

)

⊳

update v over all states

10: until stop

Of course, in step 9, also the policy iteration of Algorithm 3.1 may be applied.

Additionally, step 9 is not required to be performed after each episode - the call

may also occur less often. It is also not necessary to compute v exactly, for example,

the iteration process may be stopped before the termination criterion is fulfilled.

One iteration sweep through all states is enough. We only need to ensure that no

states are left out permanently in the ADP.

Now we switch to the convergence of the TD(

) method which is a more

complex topic. First, we rewrite the TD(

) method ( 3.13 ) in vector notation and

include the iteration index k :

w k

:¼ w k

þ α t z t d t

ð 3

20 Þ

where w k in accordance to ( 3.17 ) represents the vector of action values q k ( s , a ) and

z t ¼ z t s 1 ;

a 1

z t s 1 ;

a 2

...

z t s N ;

a M

The following theorem ensures the convergence.

Theorem 3.2 Let the following assumptions hold:

is a policy for an MDP M such that M π is irreducible and aperiodic.

iðÞ t ∈ N 0

is a trajectory generated by M π .

[0, 1]

4. The sequence of step sizes

λ ∈

αð t ∈ N 0 satisfies ( 3.9 ).

Then, for any initial guess w 0 , the sequence w k k ∈ N 0 generated by the iteration

( 3.20 ) converges almost surely to w π .

Now following the reasoning of Sect. 3.6 , by selecting a proper policy in each

iteration step, we ensure the convergence of the TD(

) method to the optimal policy

*. For details, we refer to [BT96].

Finally, we enlist the TD(

) algorithm, namely, the Sarsa(

) version, in Algo-

rithm 3.4.

Realtime Data Mining

Search WWH ::

Custom Search

Home