Information Technology Reference
In-Depth Information
=
E
P
π
(
x
1
,
1)
(
x,
0)[
c
(
x,π
(
x,
1)
,x
1
)+
J
1
,N
(
X
1
)]
.
π
It is a very simple consequence of the summation of elementary costs step by
step along a trajectory. This form shows that the optimal policy
π
∗
,which
minimizes
J
0
,N
π
, also minimizes
J
k,N
π
. Thus, the following basic statement is
proven:
E
p
u
(
x
1
)
c
((
x,u
)
,X
1
)+
J
1
,N
(
X
1
)
.
J
0
,N
π
∗
(
x
)=
min
u/
(
x,u
)
π
∗
∈
A
That equation, which is verified by optimal policy, is called Bellman's opti-
mality principle.
5.3.4.2 Dynamic Programming Algorithm under Finite Horizon
Assumption
Bellman's optimality principle enables to deduce an algorithm which solves
the finite horizon Markov decision problem: it is the celebrated
Dynamic
Programming algorithm
. Its principle is to determine the optimal policy
for the last instant and then to go back in time by sequentially optimizing
J
N−
1
,N
π
,...,J
k,k
+1
π
,...,J
0
,
1
π
.
For
k
varying sequentially from
N
−
1 to 1, one solves problem
⎡
⎣
y∈E
⎤
⎦
.
P
u
(
x,y
)
c
(
x,u,y
)+
J
k
+1
,N
π
∗
(
x,k
) = Arg min
u/
(
x,u
)
∈
A
(
y
)
π
Then the optimal cost is updated
(
x
)=
y∈E
J
k,N
π
∗
P
π
∗
(
x,k
)
(
x,y
)[
c
(
x,π
∗
(
x,k
)
,y
)+
J
k
+1
,N
(
y
)]
.
π
∗
It is convenient to introduce a new quantity: the value function
Q
k,N
, which
is defined on the set of feasible (state-action) couples
Q
k,N
(
x,u
)=
y∈E
P
u
(
x,y
)
c
(
x,u,y
)+
J
k
+1
,N
(
y
)
.
π
∗
Then the dynamic programming algorithm takes the following form:
Q
k,N
(
x,u
)=
y∈E
P
u
(
x,y
)
c
(
x,u,y
)+
J
k
+1
,N
(
y
)
π
∗
Q
k,N
(
x,u
)
π
∗
(
x,k
) = Arg min
u/
(
x,u
)
∈
A
J
k,N
π
∗
(
x
)=
Q
k,N
(
x,π
∗
(
x,k
))
.
Search WWH ::
Custom Search