Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining - page 18

Database Reference

In-Depth Information

make what we think are the best moves. From time to time, however, we try out a

new move in a known position - even Kasparov does that. In doing so, we also

solve the problem of self-reinforcing recommendations (Chap. 2 , Problem 2)

suffered by conventional recommendation engines.

3.4 Model of the Environment

Before finally addressing the Bellman equation, we still need a model of the

environment, which is given by the transition probabilities and rewards. Let p ss 0

be the transition probabilities from the state s into state s 0 as a result of the action

a and r ss 0 the corresponding transition rewards. Put another way, if in the state s the

action a is performed, p ss 0 gives the probability of passing into state s 0 and r ss 0

the

reward obtained as a result of the transition to s 0 . In particular,

X

p ss 0 ¼ 1,

ð 3

:

2 Þ

s 0

that is, when performing the action a in state s, the sum of the transition probabil-

ities over all possible subsequent states s 0 equals 1, because we must of necessity

pass into one of those states.

Example 3.4 Let us consider a fictional car which is being driven along

a mountainous road, mostly uphill. Let the states be the speeds s 1 ¼ 80 km / h ,

s 2 ¼ 90 km / h , s 3 ¼ 100 km / h , the actions a 0 ¼ noaccelerator, a 1 ¼ accelerator,

and the rewards always the speed values in the subsequent state, that is, we want to

get to our goal as quickly as possible (Fig. 3.2 ).

This gives us for the rewards:

r ss i ¼ v i ,

that is, the value of the speed v i in the subsequent state s i independent of the state

s and the action a . If, for instance, the driver in the state s 2 presses the accelerator,

that is, action a 1 , and passes into the state s 3 , then the reward is r a 1

s 2 s 3 ¼ v 3 ¼ 100.

a

b

= accel

= accel

= accel

a 1

a 1

a 1

a 1

a 0

km

km

km

s 1

=

80

s 2

=

90

s 3

=

100

h

h

h

a 0 = no accel

a 0 = no accel

a 0 = no accel

Fig. 3.2 A car with three speed states and two control actions

Next Page

Realtime Data Mining

Search WWH ::

Custom Search

Home