Information Technology Reference
In-Depth Information
5.3.5.2 Policy Iteration Method
This algorithm focuses on displaying a sequence of stationary policies, which
undergo step-by-step improvements. Hereafter, we describe an iteration n +1
from the policy π n that was obtained at iteration n :
J n is computed as the expected cost of policy π n ,
Q n is computed as the associated value function:
π n +1 ( x ) = Arg min
u/ ( x,u ) A
Q n ( x,u ) .
That algorithm provides an explicit sequence of feasible policies that improve
monotonically step by step. Their cost may be controlled, so that the algo-
rithm can be stopped as soon as an acceptable cost is reached. In “actor-
critics” methods, a policy is first applied (computation of the cost function
J n ) and then criticized (here by minimization of the value function) to obtain
a new policy. Of course, here, the policy application is rather heavy since it
needs the exhaustive computation of the values of the cost function. Moreover
it is a virtual application, which is just simulated.
That computation is performed by solving the following linear system:
E, J n ( x )=
y∈E
x
P π n ( x ) ( x,y )[ c ( x,π ( x ) ,y )+ αJ n ( y )] .
That algorithm can be shown to converge “linearly” towards the optimal
policy π . In other words, the deviation between the current cost and the
optimal cost vanishes and is bounded by a geometric sequence, whose ratio is
strictly smaller than 1. For some classical problems, the algorithm may end
up within a finite number of iterations.
5.3.5.3 Value-Function Iteration Method
This algorithm focuses on generating a sequence of cost functions that are not
necessarily associated to feasible policies, but which converges to the optimal
cost function. We will also write down the iteration n + 1 of this algorithm
from the cost function J n , which is the algorithm output at iteration n :
Q n is computed as the value function from cost function J n
J n +1 ( x )=
min
u/ ( x,u ) A
Q n ( x,u ) .
It can be also shown, by a contraction method, that this algorithm converges
linearly (i.e. at exponential speed) towards the value function that is asso-
ciated to the optimal policy. The optimal policy is recovered from this limit
value function by the classical variational Bellman equation:
π ( x ) = Arg min
u/ ( x,u ) A
Q ( x,u ) .
Search WWH ::




Custom Search