Information Technology Reference
In-Depth Information
5.3.5.2 Policy Iteration Method
This algorithm focuses on displaying a sequence of stationary policies, which
undergo step-by-step improvements. Hereafter, we describe an iteration
n
+1
from the policy
π
n
that was obtained at iteration
n
:
J
n
is computed as the expected cost of policy
π
n
,
Q
n
is computed as the associated value function:
π
n
+1
(
x
) = Arg min
u/
(
x,u
)
∈
A
Q
n
(
x,u
)
.
That algorithm provides an explicit sequence of feasible policies that improve
monotonically step by step. Their cost may be controlled, so that the algo-
rithm can be stopped as soon as an acceptable cost is reached. In “actor-
critics” methods, a policy is first applied (computation of the cost function
J
n
) and then criticized (here by minimization of the value function) to obtain
a new policy. Of course, here, the policy application is rather heavy since it
needs the exhaustive computation of the values of the cost function. Moreover
it is a virtual application, which is just simulated.
That computation is performed by solving the following linear system:
E, J
n
(
x
)=
y∈E
∀
x
∈
P
π
n
(
x
)
(
x,y
)[
c
(
x,π
(
x
)
,y
)+
αJ
n
(
y
)]
.
That algorithm can be shown to converge “linearly” towards the optimal
policy
π
∗
. In other words, the deviation between the current cost and the
optimal cost vanishes and is bounded by a geometric sequence, whose ratio is
strictly smaller than 1. For some classical problems, the algorithm may end
up within a finite number of iterations.
5.3.5.3 Value-Function Iteration Method
This algorithm focuses on generating a sequence of cost functions that are not
necessarily associated to feasible policies, but which converges to the optimal
cost function. We will also write down the iteration
n
+ 1 of this algorithm
from the cost function
J
n
, which is the algorithm output at iteration
n
:
Q
n
is computed as the value function from cost function
J
n
J
n
+1
(
x
)=
min
u/
(
x,u
)
∈
A
Q
n
(
x,u
)
.
It can be also shown, by a contraction method, that this algorithm converges
linearly (i.e. at exponential speed) towards the value function that is asso-
ciated to the optimal policy. The optimal policy is recovered from this limit
value function by the classical variational Bellman equation:
π
∗
(
x
) = Arg min
u/
(
x,u
)
∈
A
Q
∗
(
x,u
)
.
Search WWH ::
Custom Search