Information Technology Reference
In-Depth Information
provides a unified way of handling those models, whatever their special archi-
tecture may be (delay distribution, etc.). That state representation was called
canonical form and was fully described in Chap. 2.
Any recurrent neural network , however complex, has a minimal state rep-
resentation, called “canonical form.” The algorithms that are described in the
previous section may be applied to the canonical form in a straightforward
way.
In Chap. 2, the paragraph that is entitled “Canonical form of dynamical
models” and complementary sections address that issue. Several examples are
presented there to illustrate the approach.
4.6 Learning for Recurrent Networks
E. Sontag has proven in [Sontag 1996] that recurrent neural networks are uni-
versal approximators of controlled, observable, deterministic dynamical sys-
tems. Note that, just as Hornik's universal approximation theorem for func-
tion approximation, the present theorem is not constructive, and provides no
indication either on the architecture or on the learning algorithm.
The main problem with recurrent neural network learning using a descent
method (first order gradient method or second-order method) comes from
the time range of the consequences of changing a weight value. The influence
of a weight value on the cost function is not limited to the current time: it
propagates through the computing horizon, which is theoretically unbounded.
In a rigorous mathematical treatment, the computation of the gradient of
the cost function requires propagating the computation for each example on
the full computational horizon, compute the weight correction and iterate as
necessary. The training process for recurrent networks would be then a very
expensive procedure for very long training sequences. It would be di cult to
implement on real-time applications. Therefore, when recurrent architectures
for neural networks were suggested for dynamical system identification and
control, approximate solutions were used. The basic paper [Williams 1989]
presents an interesting approach.
When the state of the system is completely known because it is measured at
each time step, there is no particular problem. A teacher-forcing algorithm can
readily be implemented, although (see Chap. 2) that technique is appropriate
only in applications where the relevant uncertainty is modeled by a state
noise. That approach was shown to be poor when a measurement noise must
be taken into account, a very frequent situation in industrial applications.
In the general case, where the knowledge of the state of the process is
corrupted by a measurement noise, or is not fully measured, one must make
a choice between two approximations:
Either compute the true gradient with respect to the current weights but
change the cost function by truncating the computation period to a sliding
window: that is called back-propagation through time (BPTT)
Search WWH ::




Custom Search