Database Reference
In-Depth Information
In other words, by following a policy, we obtain a sequence of states which is
generated by a Markov chain. The latter follows from the simple fact that the
probability of a transition from a state to another state under a given policy
π
( s , a )
depends exclusively on the current state s and not on its predecessors.
We thus arrive at the Bellman equation . For the discrete case, the action-value
function for each state s and each action a
A ( s ) satisfies
h
v π s 0
i , v π ðÞ¼ X
ðÞ¼ X
s 0
q π s
p ss 0
r ss 0 þ γ
ð q π s
;
a π
s
;
ð :
;
ð 3
:
4 Þ
v π ( s ) is what we call the state-value function , which assigns to each state s the
expected cumulative reward, that is, the expected return. The state-value function
and action-value function are thus related and can be converted from one into the
other (provided the model of the environment is known).
At this point, we should further mention that the Bellman equation ( 3.4 ) repre-
sents the discrete counterpart of the Hamilton-Jacobi-Bellman (HJB) differential
equation, a fact which will become significant in Chap. 6 of the hierarchical
methods. For a detailed discussion of the HJB equation and the relation to other
formulations, we refer to [Mun00].
At the first glance, the Bellman equation appears rather complex, but it is not so
difficult to understand. Let us first consider the case
γ ¼ 0, that is, taking into
account only the immediate reward. The Bellman equation ( 3.4 ) then takes the
following simplified form:
ðÞ¼ X
s 0
q π s
p ss 0 r ss 0 :
;
ð 3
:
5 Þ
The expected return in the state s on taking the action a therefore equals the sum
(over all possible subsequent states s 0 ) for all products of the probability p ss 0
of
passing into the subsequent state s 0 and the reward r ss 0 obtained by doing so.
For
0, in addition to the immediate reward r ss 0 , the expected additional
return over all subsequent transactions, which is
γ >
v π ( s 0 ), must now be added for the
transition to the subsequent state s 0 (“chain optimization”); see Fig. 3.3a . In general,
there are always two possibilities of reward in RL: the immediate reward or the
indirect reward via the fact that it leads to an attractive subsequent state (or both).
The state-value function can in turn be determined using ( 3.4 ) from the action-
value function, namely, as the sum (over all actions a permissible in s ) of the
product of the probability of the selection of the action a by the existing policy and
its expected action value (Fig. 3.3b ). By substituting the state-value function v π into
( 3.4 ), we can write it similarly to the Bellman equation (Fig. 3.4a ):
γ
"
#
a 0 q π s 0
a 0
ðÞ¼ X
s 0
X
s 0
q π s
p ss 0
r ss 0 þ γ
;
π
;
;
:
ð 3
:
6 Þ
a 0
Search WWH ::




Custom Search