Information Technology Reference
In-Depth Information
x t +1
x t
a t
r t
Fig. 9.1. The variables of an MDP involved in a single transition from state x t to
state x t +1 after the agent performed action a t and received reward r t
rewards. A possible solution is represented by a policy μ :
X→A
, which returns
the chosen action a = μ ( x ) for any state x
. With a fixed policy μ ,the
MDP is reduced to a Markov chain with transition probabilities p μ ( x j |
∈X
x i )=
x i ,a = μ ( x i )), and rewards r x i x j = r x i x j ( μ ( x i )). In such cases it is common
to operate with the expected reward r x i
p ( x j |
= j p μ ( x j |
x i ) r x i ,x j .Thisrewardis
the one expected to be received in state x i when actions are chosen according
to policy μ .
9.1.2
The Value Function, the Action-Value Function and
Bellman's Equation
The approach taken by dynamic programming (DP) and reinforcement learning
(RL) is to define a value function V :
that expresses for each state how
much reward we can expect to receive in the long run. While V was previously
used to denote the mixing weight vectors,thosewillnotbereferredtointhis
chapter, and hence any ambiguity is avoided. Let μ =
X→ R
be a sequence
of policies where we use policy μ t at time t , starting at time t = 0. Then, the
reward that is accumulated after n steps when starting at state x , called the
n-step return V n for state x ,isgivenby
{
μ 0 1 ,...
}
γ n R ( x n )+
x 0 = x ,
n− 1
V n ( x )=
γ t r μ t
E
x t x t +1 |
(9.1)
t =0
where
is the sequence of states, and R ( x n ) denotes the expected
return that will be received when starting from state x n .The return differs from
the reward in that it implicitly considers future reward.
In finite horizon cases ,where n<
{
x 0 , x 1 ,...
}
, the optimal policy μ is the one that
maximises the expected return for each state x
, giving the optimal n-step
return V n ( x )=max μ V n ( x ). Finite horizon cases can be seen as a special case of
infinite horizon cases with zero-reward absorbing states [17]. For infinite horizon
cases, the expected return when starting at state x is analogously to (9.1) given
by
∈X
n− 1
x 0 = x .
V μ ( x ) = lim
γ t r μ t
n→∞ E
x i x i +1 |
(9.2)
t =0
 
Search WWH ::




Custom Search