Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

optimal feasible policy. That is not the case in Markov decision problems. It is

important to emphasize that the feasible optimal policy in partially observed

Markov decision problems is generally neither deterministic nor Markov.

Another way to cope with these problems, when the immediate perception

is not su cient for determining a good policy, consists in taking advantage of

past observations to estimate the current state. When the model is unknown,

that reconstruction step may be time-consuming, and its completion may

be checked using statistical tests [Dutech 1999]. No general solution to those

problems is available, so that specific applications require the design of specific

solutions.

5.4.4 Reinforcement Learning and Neuronal Approximation

5.4.4.1 Approximate Reinforcement Learning

It is often di cult to use reinforcement learning to deal with large size prob-

lems, since those algorithms are relatively complex. The algorithms that have

been outlined here are based on the iterative updating of a value table. Sto-

chastic approximation allows implementing the algorithm on an adaptive ba-

sis. Thus, simultaneous updating of all the values is not necessary, so that

the cost information that is provided by a time step can be used e ciently

for updating all the relevant values of the cost function. Nevertheless, when

the cardinality of state space or of the feasible state-action couple set is too

large, the visit of a generic couple is scarce: thus, the updates of a value do

not occur often enough, and the convergence of the algorithm slows down.

Reliable results cannot be obtained within reasonable computational times.

An alternative solution consists in using the methodology of supervised

learning to maintain a current approximation of the cost function or of the

value function. A linear approximation or a neural network approximation

may be used. The input is the state (when value iteration of the optimistic

policy is performed) or the value function (if Q-learning is performed) and

the output is the desired approximation of the updated function.

Many algorithms along those lines have been published. One will select

one of them, taking into account the condition of the learning process (using

a simulator or an experimental device) and the relevant exploration policy.

Here is a description of one iteration of a commonly used approximate

Q-learning algorithm.

Search WWH ::

Custom Search

Home