Database Reference
In-Depth Information
make what we think are the best moves. From time to time, however, we try out a
new move in a known position - even Kasparov does that. In doing so, we also
solve the problem of self-reinforcing recommendations (Chap. 2 , Problem 2)
suffered by conventional recommendation engines.
3.4 Model of the Environment
Before finally addressing the Bellman equation, we still need a model of the
environment, which is given by the transition probabilities and rewards. Let p ss 0
be the transition probabilities from the state s into state s 0 as a result of the action
a and r ss 0 the corresponding transition rewards. Put another way, if in the state s the
action a is performed, p ss 0 gives the probability of passing into state s 0 and r ss 0
the
reward obtained as a result of the transition to s 0 . In particular,
X
p ss 0 ¼ 1,
ð 3
:
2 Þ
s 0
that is, when performing the action a in state s, the sum of the transition probabil-
ities over all possible subsequent states s 0 equals 1, because we must of necessity
pass into one of those states.
Example 3.4 Let us consider a fictional car which is being driven along
a mountainous road, mostly uphill. Let the states be the speeds s 1 ¼ 80 km / h ,
s 2 ¼ 90 km / h , s 3 ¼ 100 km / h , the actions a 0 ¼ noaccelerator, a 1 ¼ accelerator,
and the rewards always the speed values in the subsequent state, that is, we want to
get to our goal as quickly as possible (Fig. 3.2 ).
This gives us for the rewards:
r ss i ¼ v i ,
that is, the value of the speed v i in the subsequent state s i independent of the state
s and the action a . If, for instance, the driver in the state s 2 presses the accelerator,
that is, action a 1 , and passes into the state s 3 , then the reward is r a 1
s 2 s 3 ¼ v 3 ¼ 100.
a
b
= accel
= accel
= accel
a 1
a 1
a 1
a 1
a 0
km
km
km
s 1
=
80
s 2
=
90
s 3
=
100
h
h
h
a 0 = no accel
a 0 = no accel
a 0 = no accel
Fig. 3.2 A car with three speed states and two control actions
 
Search WWH ::




Custom Search