Information Technology Reference
In-Depth Information
J
=
∞
0
e
−αt
c
[
x
(
t
)
,u
(
t
)]d
t.
A stationary policy
π
defines an autonomous dynamical system d
x/
d
t
=
f
(
x,π
(
x
)).
To value policy
π
, one must compute the state function
J
π
(
x
)=
∞
0
e
−αt
c
[
x
(
t
)
,π
(
x
(
t
))]d
t
;
the integral is computed on the trajectory of the autonomous dynamical sys-
tem originating from the initial state
x
.
Therefore, a stationary optimal policy
π
∗
follows the variational equation:
c
(
x,u
)+
∇
x
(
J
π
∗
)
d
x
d
t
π
∗
(
x
) = Arg min
u/
(
x,u
)
∈
A
=Argmin
u/
(
x,u
)
[
c
(
x,u
)+
∇
x
(
J
π
∗
)
f
(
x,t
)]
.
∈
A
That equation is exactly the HBJ equation of the control problem. When
a neural network approximates the total cost of a policy
π
, the latter may
compute the gradient of the cost function
∇
x
(
J
π
∗
), which can be plugged into
the previous formula. Thus, it is possible to infer a training algorithm of the
continuous value function
Q
that is defined by
Q
(
x,u
)=
c
(
x,u
)+
∇
x
(
J
π
∗
)
f
(
x,t
)
and to use it within a generalized continuous Q-learning algorithm.
Recent publications investigate systematically the implementation of re-
inforcement learning to learn an optimal control law when the model is not
known. See for instance [Bertsekas et al. 1996] for a general introduction. More
recently, [Doya 2000] presents a nice derivation of several reinforcement learn-
ing algorithms in the continuous framework and test them using the inverted
pendulum problem as a benchmark.
References
1. Anderson B.D.O., Moore J.B. [1979],
Optimal Filtering
, Prentice Hall
2. Azencott R., Dacunha-Castelle D. [1984],
Series d'observations irregulieres.
Modelisation et prevision
, Masson
3. Barto A.G., Sutton R.S., Anderson C.W. [1983], Neuron-like elements than can
solve di
cult learning control problemes,
IEEE Trans. On Systems, Man and
Cybernetics
, 13, pp 835-846
4. Benveniste A., Metivier M., Priouret P. [1987],
Algorithmes adaptatifs et approx-
imations stochastiques. Theorie et application a l'identification, au traitement
du signal et a la reconnaissance des formes
, Masson
Search WWH ::
Custom Search