Database Reference
In-Depth Information
The action “no acceleration” generally leads to reduced speed; however, on level
or downhill stretches, it can lead to constant or even increased speed. For instance,
for
s
2
, we can specify
p
a
0
75,
p
a
0
2,
p
a
0
s
2
s
1
¼
0
:
s
2
s
2
¼
0
:
s
2
s
3
¼
0
:
05
:
So if we drive at 90 km/h and do not accelerate, the probability that the speed
will reduce to 80 km/h is 75 %, that it will remain at 90 km/h is 20 %, and that it
will increase to 100 km/h is 5 %. Remember that in accordance with (
3.2
), the
probabilities must add up to 100 %. Similarly, for the remaining states
s
1
and
s
3
,we
can define
p
a
0
7,
p
a
0
s
1
s
1
¼
0
:
s
1
s
2
¼
0
:
3,
p
a
0
9,
p
a
0
s
3
s
2
¼
0
:
s
3
s
3
¼
0
:
1
:
The action “acceleration” of course has precisely the inverse effect. We start
once again with the specification for
s
2
:
p
a
1
1,
p
a
1
2,
p
a
1
s
2
s
1
¼
0
:
s
2
s
2
¼
0
:
s
2
s
3
¼
0
:
7
:
So if we drive at 90 km/h and accelerate, the probability that the speed will
increase to 100 km/h is 70 %, that it will remain at 90 km/h is 20 %, and that it
will decrease to 80 km/h is 10 %. Similarly, for the remaining states
s
1
and
s
3
,we
can define
p
a
1
3,
p
a
1
s
1
s
1
¼
0
:
s
1
s
2
¼
0
:
7,
p
a
1
1,
p
a
1
s
3
s
2
¼
0
:
s
3
s
3
¼
0
:
9
:
In so doing, we have adequately described our environment.
■
3.5 The Bellman Equation
We first define an MDP as a quadruplet
M:¼
(
S, A, P, R
) of the state and action
spaces
S
and
A
, the transition probabilities
P
, and rewards
R
. Please note that the
Markov property need not be explicitly stipulated to hold, since it implicitly follows
from the given representations of
P
and
R
.
Each policy
(
s
,
a
) induces a
Markov chain
(MC), which is characterized by the
tuple
M
π
:
¼
(
S
,
P
π
), where
P
π
¼
(
p
s
,
s
0
)
s
,
s
0
∈
S
denote the transition probabilities
that result from following the policy
π
(
s
,
a
):
p
ss
0
¼
X
π
ð p
ss
0
:
a
∈
AðÞ
π
s
;
ð
3
:
3
Þ