Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

= E P π ( x 1 , 1) ( x, 0)[ c ( x,π ( x, 1) ,x 1 )+ J 1 ,N

( X 1 )] .

It is a very simple consequence of the summation of elementary costs step by

step along a trajectory. This form shows that the optimal policy π

∗

,which

minimizes J 0 ,N

, also minimizes J k,N

. Thus, the following basic statement is

proven:

E p u ( x 1 ) c (( x,u ) ,X 1 )+ J 1 ,N

( X 1 ) .

J 0 ,N

π ∗

( x )=

min

u/ ( x,u )

π ∗

∈ A

That equation, which is verified by optimal policy, is called Bellman's opti-

mality principle.

5.3.4.2 Dynamic Programming Algorithm under Finite Horizon

Assumption

Bellman's optimality principle enables to deduce an algorithm which solves

the finite horizon Markov decision problem: it is the celebrated Dynamic

Programming algorithm . Its principle is to determine the optimal policy

for the last instant and then to go back in time by sequentially optimizing

J N− 1 ,N

,...,J k,k +1

,...,J 0 , 1

For k varying sequentially from N

−

1 to 1, one solves problem

⎡

⎣

y∈E

⎤

⎦ .

P u ( x,y ) c ( x,u,y )+ J k +1 ,N

π ∗ ( x,k ) = Arg min

u/ ( x,u ) ∈ A

( y )

Then the optimal cost is updated

( x )=

y∈E

J k,N

π ∗

P π ∗ ( x,k ) ( x,y )[ c ( x,π ∗ ( x,k ) ,y )+ J k +1 ,N

( y )] .

π ∗

It is convenient to introduce a new quantity: the value function Q k,N , which

is defined on the set of feasible (state-action) couples

Q k,N ( x,u )=

y∈E

P u ( x,y ) c ( x,u,y )+ J k +1 ,N

( y ) .

π ∗

Then the dynamic programming algorithm takes the following form:

Q k,N ( x,u )=

y∈E

P u ( x,y ) c ( x,u,y )+ J k +1 ,N

( y )

π ∗

Q k,N ( x,u )

π ∗ ( x,k ) = Arg min

u/ ( x,u ) ∈ A

J k,N

π ∗

( x )= Q k,N ( x,π ∗ ( x,k )) .

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home