Database Reference
In-Depth Information
This is an even clearer form, since the state-value function v π ( s ) depends only on
the state s , unlike the action-value function q π ( s , a ), which additionally depends on
the action a. We will, however, mainly work with the action-value function, since we
need it for the model-free case, which is of practical importance (and to which
we have yet to come), where it cannot be converted directly into the state-value
function (since in the model-free case p ss 0 and r ss 0 are not explicitly known).
After so many abstract explanations, we shall seek to illustrate the Bellman
equation using our simple example of a car.
Example 3.5 Let us now return to our example of a car, and calculate it exemplarily
for the Bellman equation with the discount parameter
γ ¼ 0.5 for the policy which
in each of the three states performs the action a 0 , that is, the one where the
accelerator is never pressed.
From ( 3.6 ), we then obtain for the first state s 1 :
h
i
h
i
q π
Þ¼p a 0
s 1 s 1
r a 0
5 q π
þ p a 0
s 1 s 2
r a 0
5 q π
ð
s 1 ;
a 0
s 1 s 1 þ 0
:
ð
s 1 ;
a 0
Þ
s 1 s 2 þ 0
:
ð
s 2 ;
a 0
Þ
5 q π s 1 ;
5 q π s 2 ;
¼ 0
:
7
½
80 þ 0
:
ð
a 0
Þ
þ 0
:
3 90 þ 0
½
:
ð
a 0
Þ
:
Similarly, we obtain for the second state s 2 :
h
i þ p a 0
s 2 s 2
h
i
q π s 2 ;
Þ¼p a 0
s 2 s 1
r a 0
5 q π s 1 ;
r a 0
5 q π s 2 ;
ð
a 0
s 2 s 1 þ 0
:
ð
a 0
Þ
s 2 s 2 þ 0
:
ð
a 0
Þ
h
i
þ p a 0
s 2 s 3
r a 0
s 2 s 3 þ 0
:
5 q π s 3 ;
ð
a 0
Þ
80 þ 0 : 5 q π
þ 0 : 2 90 þ 0 : 5 q π
¼ 0 : 75
½
ð
s 1 ;
a 0
Þ
½
ð
s 2 ;
a 0
Þ
5 q π
þ 0
:
05
½
100 þ 0
:
ð
s 3 ;
a 0
Þ
and for s 3 :
h
i
h
i
q π
Þ¼p a 0
s 3 s 2
r a 0
5 q π
þ p a 0
s 3 s 3
r a 0
5 q π
ð
s 3 ;
a 0
s 3 s 2 þ 0
:
ð
s 3 ;
a 0
Þ
s 3 s 3 þ 0
:
ð
s 3 ;
a 0
Þ
5 q π s 2 ;
5 q π s 3 ;
¼ 0
:
9 90 þ 0
½
:
ð
a 0
Þ
þ 0
:
1 100 þ 0
½
:
ð
a 0
Þ
:
We thus have a system of three equations with three unknowns, the action
values. Its solution yields
q π s 1 ;
q π s 2 ;
q π s 3 ;
ð
a 0
Þ 166,
ð
a 0
Þ 167,
ð
a 0
Þ 174
:
So far, this is sensible: since, with no acceleration, the states s 1 and s 2 almost
always lead to the state s 1 , they also obtain largely the same expected return. Without
acceleration, the state s 3 almost always leads to the state s 2 and therefore has a higher
expected return. The fact that q π ( s 2 , a 0 ) is somewhat higher than q π ( s 1 , a 0 )isdueto
Search WWH ::




Custom Search