Database Reference
In-Depth Information
3.2 Markov Property
In order to keep the complexity of determining a good (most nearly optimal) policy
within bounds, in most cases, it is assumed that the RL problem satisfies what is
called the Markov property:
Assumption 3.1 (Markov property): In every state, the selection of the best
action depends only on this current state and not on transactions preceding it.
A good example of a problem which satisfies the Markov property is once again
the game of chess. In order to make the best move in any position, from a
mathematical point of view, it is totally irrelevant how the position on the board
was reached (though when playing the game in practice, it is generally helpful). On
the other hand, it is important to think through all possible subsequent transactions
for every move (which of course in practice can be performed only to a certain
depth of analysis) in order to find the optimal move.
Put simply, we have to work out the future from where we are, irrespective of
how we got here. This allows us to reduce drastically the complexity of the
calculations. At the same time, we must of course check each model to determine
whether the Markov property is adequately satisfied. Where this is not the case, a
possible remedy is to record a certain limited number of preceding transactions
(generalized Markov property; see Chap. 10 ) and to extend the definition of the
states in a general sense.
Provided the Markov property is now satisfied ( Markov decision process -
MDP ), the policy
( s ). RL is
now based directly on the DP methods for solution of the Bellman equation. This
involves assigning to each policy
π
depends solely on the current state, that is, a ¼ π
an action-value function q π ( s , a ) which assigns
for each state s and for all the permissible actions a for that state the expected value
of the cumulative rewards throughout the remainder of the episode. We shall refer
to this magnitude as the expected return R :
π
2 r 3 þ ...¼ 1
k r tþkþ 1 ,
R t
:¼ r 1 þ γ
r 2 þ γ
0 γ
ð 3
:
1 Þ
where t denotes the current time step and
γ
[0,1] the rate for discounting future
rewards. For
γ ¼ 0, the agent acts myopically in that it seeks to maximize only the
immediate reward r t +1 . With
increasing, the agent acts in a more and more long-
term oriented fashion in that future rewards make a larger contribution. If
γ
1, the
infinite series always converges to finite value, provided that the sequence { r k }is
bounded.
If then for any two actions a , b
γ <
A ( s ): q π ( s , a )
q π ( s , b ), then b ensures a
<
higher return than a . Therefore, the policy
( s ) should prefer the action b to the
action a , but we will come to that in a minute.
π
Search WWH ::




Custom Search