Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

In other words, by following a policy, we obtain a sequence of states which is

generated by a Markov chain. The latter follows from the simple fact that the

probability of a transition from a state to another state under a given policy

( s , a )

depends exclusively on the current state s and not on its predecessors.

We thus arrive at the Bellman equation . For the discrete case, the action-value

function for each state s and each action a

A ( s ) satisfies

∈

v π s 0

i , v π ðÞ¼ X

ðÞ¼ X

s 0

q π s

p ss 0

r ss 0 þ γ

ð q π s

;

a π

;

ð :

;

ð 3

4 Þ

v π ( s ) is what we call the state-value function , which assigns to each state s the

expected cumulative reward, that is, the expected return. The state-value function

and action-value function are thus related and can be converted from one into the

other (provided the model of the environment is known).

At this point, we should further mention that the Bellman equation ( 3.4 ) repre-

sents the discrete counterpart of the Hamilton-Jacobi-Bellman (HJB) differential

equation, a fact which will become significant in Chap. 6 of the hierarchical

methods. For a detailed discussion of the HJB equation and the relation to other

formulations, we refer to [Mun00].

At the first glance, the Bellman equation appears rather complex, but it is not so

difficult to understand. Let us first consider the case

γ ¼ 0, that is, taking into

account only the immediate reward. The Bellman equation ( 3.4 ) then takes the

following simplified form:

ðÞ¼ X

s 0

q π s

p ss 0 r ss 0 :

;

ð 3

5 Þ

The expected return in the state s on taking the action a therefore equals the sum

(over all possible subsequent states s 0 ) for all products of the probability p ss 0

passing into the subsequent state s 0 and the reward r ss 0 obtained by doing so.

For

0, in addition to the immediate reward r ss 0 , the expected additional

return over all subsequent transactions, which is

γ >

v π ( s 0 ), must now be added for the

transition to the subsequent state s 0 (“chain optimization”); see Fig. 3.3a . In general,

there are always two possibilities of reward in RL: the immediate reward or the

indirect reward via the fact that it leads to an attractive subsequent state (or both).

The state-value function can in turn be determined using ( 3.4 ) from the action-

value function, namely, as the sum (over all actions a permissible in s ) of the

product of the probability of the selection of the action a by the existing policy and

its expected action value (Fig. 3.3b ). By substituting the state-value function v π into

( 3.4 ), we can write it similarly to the Bellman equation (Fig. 3.4a ):

a 0 q π s 0

a 0

ðÞ¼ X

s 0

q π s

p ss 0

r ss 0 þ γ

;

ð 3

6 Þ

a 0

Realtime Data Mining

Search WWH ::

Custom Search

Home