Database Reference
In-Depth Information
This is an even clearer form, since the state-value function
v
π
(
s
) depends only on
the state
s
, unlike the action-value function
q
π
(
s
,
a
), which additionally depends on
the action
a.
We will, however, mainly work with the action-value function, since we
need it for the model-free case, which is of practical importance (and to which
we have yet to come), where it cannot be converted directly into the state-value
function (since in the model-free case
p
ss
0
and
r
ss
0
are not explicitly known).
After so many abstract explanations, we shall seek to illustrate the Bellman
equation using our simple example of a car.
Example 3.5
Let us now return to our example of a car, and calculate it exemplarily
for the Bellman equation with the discount parameter
γ ¼
0.5 for the policy which
in each of the three states performs the action
a
0
, that is, the one where the
accelerator is never pressed.
From (
3.6
), we then obtain for the first state
s
1
:
h
i
h
i
q
π
Þ¼p
a
0
s
1
s
1
r
a
0
5
q
π
þ p
a
0
s
1
s
2
r
a
0
5
q
π
ð
s
1
;
a
0
s
1
s
1
þ
0
:
ð
s
1
;
a
0
Þ
s
1
s
2
þ
0
:
ð
s
2
;
a
0
Þ
5
q
π
s
1
;
5
q
π
s
2
;
¼
0
:
7
½
80
þ
0
:
ð
a
0
Þ
þ
0
:
3
90
þ
0
½
:
ð
a
0
Þ
:
Similarly, we obtain for the second state
s
2
:
h
i
þ p
a
0
s
2
s
2
h
i
q
π
s
2
;
Þ¼p
a
0
s
2
s
1
r
a
0
5
q
π
s
1
;
r
a
0
5
q
π
s
2
;
ð
a
0
s
2
s
1
þ
0
:
ð
a
0
Þ
s
2
s
2
þ
0
:
ð
a
0
Þ
h
i
þ p
a
0
s
2
s
3
r
a
0
s
2
s
3
þ
0
:
5
q
π
s
3
;
ð
a
0
Þ
80
þ
0
:
5
q
π
þ
0
:
2
90
þ
0
:
5
q
π
¼
0
:
75
½
ð
s
1
;
a
0
Þ
½
ð
s
2
;
a
0
Þ
5
q
π
þ
0
:
05
½
100
þ
0
:
ð
s
3
;
a
0
Þ
and for
s
3
:
h
i
h
i
q
π
Þ¼p
a
0
s
3
s
2
r
a
0
5
q
π
þ p
a
0
s
3
s
3
r
a
0
5
q
π
ð
s
3
;
a
0
s
3
s
2
þ
0
:
ð
s
3
;
a
0
Þ
s
3
s
3
þ
0
:
ð
s
3
;
a
0
Þ
5
q
π
s
2
;
5
q
π
s
3
;
¼
0
:
9
90
þ
0
½
:
ð
a
0
Þ
þ
0
:
1
100
þ
0
½
:
ð
a
0
Þ
:
We thus have a system of three equations with three unknowns, the action
values. Its solution yields
q
π
s
1
;
q
π
s
2
;
q
π
s
3
;
ð
a
0
Þ
166,
ð
a
0
Þ
167,
ð
a
0
Þ
174
:
So far, this is sensible: since, with no acceleration, the states
s
1
and
s
2
almost
always lead to the state
s
1
, they also obtain largely the same expected return. Without
acceleration, the state
s
3
almost always leads to the state
s
2
and therefore has a higher
expected return. The fact that
q
π
(
s
2
,
a
0
) is somewhat higher than
q
π
(
s
1
,
a
0
)isdueto