Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

5.3.5.2 Policy Iteration Method

This algorithm focuses on displaying a sequence of stationary policies, which

undergo step-by-step improvements. Hereafter, we describe an iteration n +1

from the policy π n that was obtained at iteration n :

J n is computed as the expected cost of policy π n ,

Q n is computed as the associated value function:

π n +1 ( x ) = Arg min

u/ ( x,u ) ∈ A

Q n ( x,u ) .

That algorithm provides an explicit sequence of feasible policies that improve

monotonically step by step. Their cost may be controlled, so that the algo-

rithm can be stopped as soon as an acceptable cost is reached. In “actor-

critics” methods, a policy is first applied (computation of the cost function

J n ) and then criticized (here by minimization of the value function) to obtain

a new policy. Of course, here, the policy application is rather heavy since it

needs the exhaustive computation of the values of the cost function. Moreover

it is a virtual application, which is just simulated.

That computation is performed by solving the following linear system:

E, J n ( x )=

y∈E

∀

x

∈

P π n ( x ) ( x,y )[ c ( x,π ( x ) ,y )+ αJ n ( y )] .

That algorithm can be shown to converge “linearly” towards the optimal

policy π ∗ . In other words, the deviation between the current cost and the

optimal cost vanishes and is bounded by a geometric sequence, whose ratio is

strictly smaller than 1. For some classical problems, the algorithm may end

up within a finite number of iterations.

5.3.5.3 Value-Function Iteration Method

This algorithm focuses on generating a sequence of cost functions that are not

necessarily associated to feasible policies, but which converges to the optimal

cost function. We will also write down the iteration n + 1 of this algorithm

from the cost function J n , which is the algorithm output at iteration n :

Q n is computed as the value function from cost function J n

J n +1 ( x )=

min

u/ ( x,u ) ∈ A

Q n ( x,u ) .

It can be also shown, by a contraction method, that this algorithm converges

linearly (i.e. at exponential speed) towards the value function that is asso-

ciated to the optimal policy. The optimal policy is recovered from this limit

value function by the classical variational Bellman equation:

π ∗ ( x ) = Arg min

u/ ( x,u ) ∈ A

Q ∗ ( x,u ) .

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home