Information Technology Reference
In-Depth Information
= E P π ( x 1 , 1) ( x, 0)[ c ( x,π ( x, 1) ,x 1 )+ J 1 ,N
( X 1 )] .
π
It is a very simple consequence of the summation of elementary costs step by
step along a trajectory. This form shows that the optimal policy π
,which
minimizes J 0 ,N
π
, also minimizes J k,N
π
. Thus, the following basic statement is
proven:
E p u ( x 1 ) c (( x,u ) ,X 1 )+ J 1 ,N
( X 1 ) .
J 0 ,N
π
( x )=
min
u/ ( x,u )
π
A
That equation, which is verified by optimal policy, is called Bellman's opti-
mality principle.
5.3.4.2 Dynamic Programming Algorithm under Finite Horizon
Assumption
Bellman's optimality principle enables to deduce an algorithm which solves
the finite horizon Markov decision problem: it is the celebrated Dynamic
Programming algorithm . Its principle is to determine the optimal policy
for the last instant and then to go back in time by sequentially optimizing
J N− 1 ,N
π
,...,J k,k +1
π
,...,J 0 , 1
π
.
For k varying sequentially from N
1 to 1, one solves problem
y∈E
.
P u ( x,y ) c ( x,u,y )+ J k +1 ,N
π ( x,k ) = Arg min
u/ ( x,u ) A
( y )
π
Then the optimal cost is updated
( x )=
y∈E
J k,N
π
P π ( x,k ) ( x,y )[ c ( x,π ( x,k ) ,y )+ J k +1 ,N
( y )] .
π
It is convenient to introduce a new quantity: the value function Q k,N , which
is defined on the set of feasible (state-action) couples
Q k,N ( x,u )=
y∈E
P u ( x,y ) c ( x,u,y )+ J k +1 ,N
( y ) .
π
Then the dynamic programming algorithm takes the following form:
Q k,N ( x,u )=
y∈E
P u ( x,y ) c ( x,u,y )+ J k +1 ,N
( y )
π
Q k,N ( x,u )
π ( x,k ) = Arg min
u/ ( x,u ) A
J k,N
π
( x )= Q k,N ( x,π ( x,k )) .
Search WWH ::




Custom Search