Information Technology Reference
In-Depth Information
x
t
+1
x
t
a
t
r
t
Fig. 9.1.
The variables of an MDP involved in a single transition from state
x
t
to
state
x
t
+1
after the agent performed action
a
t
and received reward
r
t
rewards. A possible solution is represented by a
policy μ
:
X→A
, which returns
the chosen action
a
=
μ
(
x
) for any state
x
. With a fixed policy
μ
,the
MDP is reduced to a Markov chain with transition probabilities
p
μ
(
x
j
|
∈X
x
i
)=
x
i
,a
=
μ
(
x
i
)), and rewards
r
x
i
x
j
=
r
x
i
x
j
(
μ
(
x
i
)). In such cases it is common
to operate with the expected reward
r
x
i
p
(
x
j
|
=
j
p
μ
(
x
j
|
x
i
)
r
x
i
,x
j
.Thisrewardis
the one expected to be received in state
x
i
when actions are chosen according
to policy
μ
.
9.1.2
The Value Function, the Action-Value Function and
Bellman's Equation
The approach taken by dynamic programming (DP) and reinforcement learning
(RL) is to define a value function
V
:
that expresses for each state how
much reward we can expect to receive in the long run. While
V
was previously
used to denote the mixing weight vectors,thosewillnotbereferredtointhis
chapter, and hence any ambiguity is avoided. Let
μ
=
X→
R
be a sequence
of policies where we use policy
μ
t
at time
t
, starting at time
t
= 0. Then, the
reward that is accumulated after
n
steps when starting at state
x
, called the
n-step return V
n
for state
x
,isgivenby
{
μ
0
,μ
1
,...
}
γ
n
R
(
x
n
)+
x
0
=
x
,
n−
1
V
n
(
x
)=
γ
t
r
μ
t
E
x
t
x
t
+1
|
(9.1)
t
=0
where
is the sequence of states, and
R
(
x
n
) denotes the expected
return that will be received when starting from state
x
n
.The
return
differs from
the reward in that it implicitly considers future reward.
In
finite horizon cases
,where
n<
{
x
0
,
x
1
,...
}
, the optimal policy
μ
is the one that
maximises the expected return for each state
x
∞
, giving the optimal n-step
return
V
n
(
x
)=max
μ
V
n
(
x
). Finite horizon cases can be seen as a special case of
infinite horizon cases
with zero-reward absorbing states [17]. For infinite horizon
cases, the expected return when starting at state
x
is analogously to (9.1) given
by
∈X
n−
1
x
0
=
x
.
V
μ
(
x
) = lim
γ
t
r
μ
t
n→∞
E
x
i
x
i
+1
|
(9.2)
t
=0
Search WWH ::
Custom Search