Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

3.2 Markov Property

In order to keep the complexity of determining a good (most nearly optimal) policy

within bounds, in most cases, it is assumed that the RL problem satisfies what is

called the Markov property:

Assumption 3.1 (Markov property): In every state, the selection of the best

action depends only on this current state and not on transactions preceding it.

A good example of a problem which satisfies the Markov property is once again

the game of chess. In order to make the best move in any position, from a

mathematical point of view, it is totally irrelevant how the position on the board

was reached (though when playing the game in practice, it is generally helpful). On

the other hand, it is important to think through all possible subsequent transactions

for every move (which of course in practice can be performed only to a certain

depth of analysis) in order to find the optimal move.

Put simply, we have to work out the future from where we are, irrespective of

how we got here. This allows us to reduce drastically the complexity of the

calculations. At the same time, we must of course check each model to determine

whether the Markov property is adequately satisfied. Where this is not the case, a

possible remedy is to record a certain limited number of preceding transactions

(generalized Markov property; see Chap. 10 ) and to extend the definition of the

states in a general sense.

Provided the Markov property is now satisfied ( Markov decision process -

MDP ), the policy

( s ). RL is

now based directly on the DP methods for solution of the Bellman equation. This

involves assigning to each policy

depends solely on the current state, that is, a ¼ π

an action-value function q π ( s , a ) which assigns

for each state s and for all the permissible actions a for that state the expected value

of the cumulative rewards throughout the remainder of the episode. We shall refer

to this magnitude as the expected return R :

2 r tþ 3 þ ...¼ 1

k r tþkþ 1 ,

R t

:¼ r tþ 1 þ γ

r tþ 2 þ γ

k¼ 0 γ

ð 3

1 Þ

where t denotes the current time step and

γ ∈

[0,1] the rate for discounting future

rewards. For

γ ¼ 0, the agent acts myopically in that it seeks to maximize only the

immediate reward r t +1 . With

increasing, the agent acts in a more and more long-

term oriented fashion in that future rewards make a larger contribution. If

1, the

infinite series always converges to finite value, provided that the sequence { r k }is

bounded.

If then for any two actions a , b

γ <

A ( s ): q π ( s , a )

q π ( s , b ), then b ensures a

∈

higher return than a . Therefore, the policy

( s ) should prefer the action b to the

action a , but we will come to that in a minute.

Realtime Data Mining

Search WWH ::

Custom Search

Home