Information Technology Reference
In-Depth Information
optimal feasible policy. That is not the case in Markov decision problems. It is
important to emphasize that the feasible optimal policy in partially observed
Markov decision problems is generally neither deterministic nor Markov.
Another way to cope with these problems, when the immediate perception
is not su cient for determining a good policy, consists in taking advantage of
past observations to estimate the current state. When the model is unknown,
that reconstruction step may be time-consuming, and its completion may
be checked using statistical tests [Dutech 1999]. No general solution to those
problems is available, so that specific applications require the design of specific
solutions.
5.4.4 Reinforcement Learning and Neuronal Approximation
5.4.4.1 Approximate Reinforcement Learning
It is often di cult to use reinforcement learning to deal with large size prob-
lems, since those algorithms are relatively complex. The algorithms that have
been outlined here are based on the iterative updating of a value table. Sto-
chastic approximation allows implementing the algorithm on an adaptive ba-
sis. Thus, simultaneous updating of all the values is not necessary, so that
the cost information that is provided by a time step can be used e ciently
for updating all the relevant values of the cost function. Nevertheless, when
the cardinality of state space or of the feasible state-action couple set is too
large, the visit of a generic couple is scarce: thus, the updates of a value do
not occur often enough, and the convergence of the algorithm slows down.
Reliable results cannot be obtained within reasonable computational times.
An alternative solution consists in using the methodology of supervised
learning to maintain a current approximation of the cost function or of the
value function. A linear approximation or a neural network approximation
may be used. The input is the state (when value iteration of the optimistic
policy is performed) or the value function (if Q-learning is performed) and
the output is the desired approximation of the updated function.
Many algorithms along those lines have been published. One will select
one of them, taking into account the condition of the learning process (using
a simulator or an experimental device) and the relevant exploration policy.
Here is a description of one iteration of a commonly used approximate
Q-learning algorithm.
Search WWH ::




Custom Search