Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

that fact that with no acceleration, we still remain in the same state or even pass into

the next highest state in rare cases. The result is therefore plausible.

■

3.6 Determining an Optimal Solution

The question remains of how to determine an optimal solution to the Bellman

equation, since we neither know its action-value function q π ( s , a ) nor its policy

π ( s , a ). A solution to this is provided by the policy iteration method from dynamic

programming, which in a generalized form can be used as a central tool in RL and

generally for REs.

Policy iteration is based on the following approach: starting with an arbitrary

initial policy

π 0 , the action-value function q i corresponding to the current policy

π i is computed by solving the Bellman equation ( 3.6 ) in every step i ¼ 0,

, n (the

solution method will be described in Sect. 3.9.4 ). After this, we determine a greedy

policy corresponding to the action-value function q i , that is,

...

q i

π iþ 1 ðÞ¼ arg max

ð :

;

AðÞ

∈

π i +1 is taken to be a policy which in every state s selects one

of the actions a , such as to maximize q i ( s , a ). For

In plain English,

π i +1 , the action-value function q i +1

is then calculated in turn, and so on. This then yields a sequence of policies and

action-value functions:

π 0 ! q 0

! π 1 ! q 1

! π 2 ! q 2

! ::::

It can be shown that after a finite number of iterations, this process terminates with

the optimal policy

* and corresponding action-value function q *, which satisfy

q s

q π s

ðÞ¼ max

;

ð ,

;

, 8a

AðÞ:

∈

Example 3.6 For our example of the car with γ ¼ 0.5 as above, policy iteration

yields - not surprisingly - the optimal policy

*, which stipulates that in each of the

three states, the action a 1 be performed, that is, to always accelerate. The associated

action values are

q s 1 ;

q s 2 ;

q s 3 ;

a 1

Þ 182,

a 1

Þ 194,

a 1

Þ 198,

and are thus all greater than those of the non-acceleration policy considered in the

last example, which is incidentally the least successful among all policies.

What will happen if we decrease the reward of the third state? We might, for

instance, be pulled over by the police for exceeding a speed limit. The result for

different r ss 3 values is shown in Fig. 3.5 , where a) shows the case r ss 3 ¼ 100 under

consideration. The lower the value now assigned to this reward, the more unattractive

Realtime Data Mining

Search WWH ::

Custom Search

Home