Information Technology Reference
In-Depth Information
The optimal policy is the one that maximises this expected return for each
state x
, and results in the optimal value function V ( x )=max μ V μ ( x ).
Therefore, knowing V , we can infer the optimal policy by
∈X
μ ( x ) = argmax
a
( r xx ( a )+ γV ( x )
E
|
x ,a ) .
(9.3)
∈A
Thus, the optimal policy is given by choosing the action that maximises the
expected sum of immediate reward and the discounted expected optimal return
of the next state. This reduces the goal of finding the policy that maximises
the reward in the long run to learning the optimal value function, which is the
approach taken by DP and RL. In fact, Sutton conjectures that
“All ecient methods for solving sequential decision problems determine
(learn or compute) value functions as an intermediate step.”
which he calls the “Value-Function Hypothesis” [206].
In some cases, such as one does not have a model of the transition function,
the expectation in (9.3) cannot be evaluated. Then, it is easier to work with the
action-value function Q :
that estimates the expected return Q ( x ,a )
when taking action a in state x , and is for some policy μ defined by
X×A→ R
r x 0 x 1 ( a )+ γ
x 0 = x ,a
n− 1
Q μ ( x ,a ) = lim
γ t r x t x t +1 |
n→∞ E
t =1
( r xx ( a )+ γV μ ( x )
=
E
|
x ,a ) .
(9.4)
V μ is recovered from Q μ by V μ ( x )= Q μ ( x ( x )). Given that the optimal action-
value function Q is known, getting the optimal policy μ is simplified from (9.3)
to
μ ( x ) = argmax
a∈A
Q ( x ,a ) ,
(9.5)
that is, by choosing the action a in state x that maximises the expected return
given by Q ( x ,a ).
Note that V and Q are related by V ( x )= Q ( x ( x )) = max a∈A Q ( x ,a ).
Combining this relation with (9.4) gives us Bellman's Equation
V ( x )=max
( r xx ( a )+ γV ( x )
a∈A E
|
x ,a ) ,
(9.6)
which relates the optimal values of different states to each other, and to which
finding the solution forms the core of DP. Similarly, Bellman's equation for a
fixed policy μ is given by
( r xx + γV μ ( x )
V μ ( x )=
E
|
x ) .
(9.7)
An example for a problem that can be described by an MDP, together with
its optimal value function and one of its optimal policies is shown in Fig. 2.2.
 
Search WWH ::




Custom Search