Database Reference
In-Depth Information
The action “no acceleration” generally leads to reduced speed; however, on level
or downhill stretches, it can lead to constant or even increased speed. For instance,
for s 2 , we can specify
p a 0
75, p a 0
2, p a 0
s 2 s 1 ¼ 0
:
s 2 s 2 ¼ 0
:
s 2 s 3 ¼ 0
:
05
:
So if we drive at 90 km/h and do not accelerate, the probability that the speed
will reduce to 80 km/h is 75 %, that it will remain at 90 km/h is 20 %, and that it
will increase to 100 km/h is 5 %. Remember that in accordance with ( 3.2 ), the
probabilities must add up to 100 %. Similarly, for the remaining states s 1 and s 3 ,we
can define
p a 0
7, p a 0
s 1 s 1 ¼ 0
:
s 1 s 2 ¼ 0
:
3,
p a 0
9, p a 0
s 3 s 2 ¼ 0
:
s 3 s 3 ¼ 0
:
1
:
The action “acceleration” of course has precisely the inverse effect. We start
once again with the specification for s 2 :
p a 1
1, p a 1
2, p a 1
s 2 s 1 ¼ 0
:
s 2 s 2 ¼ 0
:
s 2 s 3 ¼ 0
:
7
:
So if we drive at 90 km/h and accelerate, the probability that the speed will
increase to 100 km/h is 70 %, that it will remain at 90 km/h is 20 %, and that it
will decrease to 80 km/h is 10 %. Similarly, for the remaining states s 1 and s 3 ,we
can define
p a 1
3, p a 1
s 1 s 1 ¼ 0
:
s 1 s 2 ¼ 0
:
7,
p a 1
1, p a 1
s 3 s 2 ¼ 0
:
s 3 s 3 ¼ 0
:
9
:
In so doing, we have adequately described our environment.
3.5 The Bellman Equation
We first define an MDP as a quadruplet M:¼ ( S, A, P, R ) of the state and action
spaces S and A , the transition probabilities P , and rewards R . Please note that the
Markov property need not be explicitly stipulated to hold, since it implicitly follows
from the given representations of P and R .
Each policy
( s , a ) induces a Markov chain (MC), which is characterized by the
tuple M π : ¼ ( S , P π ), where P π ¼ ( p s , s 0 ) s , s 0 S denote the transition probabilities
that result from following the policy
π
( s , a ):
p ss 0 ¼ X
π
ð p ss 0 :
a AðÞ π
s
;
ð 3
:
3 Þ
Search WWH ::




Custom Search