Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

This is an even clearer form, since the state-value function v π ( s ) depends only on

the state s , unlike the action-value function q π ( s , a ), which additionally depends on

the action a. We will, however, mainly work with the action-value function, since we

need it for the model-free case, which is of practical importance (and to which

we have yet to come), where it cannot be converted directly into the state-value

function (since in the model-free case p ss 0 and r ss 0 are not explicitly known).

After so many abstract explanations, we shall seek to illustrate the Bellman

equation using our simple example of a car.

Example 3.5 Let us now return to our example of a car, and calculate it exemplarily

for the Bellman equation with the discount parameter

γ ¼ 0.5 for the policy which

in each of the three states performs the action a 0 , that is, the one where the

accelerator is never pressed.

From ( 3.6 ), we then obtain for the first state s 1 :

q π

Þ¼p a 0

s 1 s 1

r a 0

5 q π

þ p a 0

s 1 s 2

r a 0

5 q π

s 1 ;

a 0

s 1 s 1 þ 0

s 1 ;

a 0

s 1 s 2 þ 0

s 2 ;

a 0

5 q π s 1 ;

5 q π s 2 ;

¼ 0

80 þ 0

a 0

þ 0

3 90 þ 0

a 0

Similarly, we obtain for the second state s 2 :

i þ p a 0

s 2 s 2

q π s 2 ;

Þ¼p a 0

s 2 s 1

r a 0

5 q π s 1 ;

r a 0

5 q π s 2 ;

a 0

s 2 s 1 þ 0

a 0

s 2 s 2 þ 0

a 0

þ p a 0

s 2 s 3

r a 0

s 2 s 3 þ 0

5 q π s 3 ;

a 0

80 þ 0 : 5 q π

þ 0 : 2 90 þ 0 : 5 q π

¼ 0 : 75

s 1 ;

a 0

s 2 ;

a 0

5 q π

þ 0

100 þ 0

s 3 ;

a 0

and for s 3 :

q π

Þ¼p a 0

s 3 s 2

r a 0

5 q π

þ p a 0

s 3 s 3

r a 0

5 q π

s 3 ;

a 0

s 3 s 2 þ 0

s 3 ;

a 0

s 3 s 3 þ 0

s 3 ;

a 0

5 q π s 2 ;

5 q π s 3 ;

¼ 0

9 90 þ 0

a 0

þ 0

1 100 þ 0

a 0

We thus have a system of three equations with three unknowns, the action

values. Its solution yields

q π s 1 ;

q π s 2 ;

q π s 3 ;

a 0

Þ 166,

a 0

Þ 167,

a 0

Þ 174

So far, this is sensible: since, with no acceleration, the states s 1 and s 2 almost

always lead to the state s 1 , they also obtain largely the same expected return. Without

acceleration, the state s 3 almost always leads to the state s 2 and therefore has a higher

expected return. The fact that q π ( s 2 , a 0 ) is somewhat higher than q π ( s 1 , a 0 )isdueto

Realtime Data Mining

Search WWH ::

Custom Search

Home