Information Technology Reference
In-Depth Information
in reverse order, each unit multiplies its accumulated error with the derivative of its
transfer function and increments the accumulators of its source units by this quan-
tity, weighted with the efficacy of the connection. This requires the same address
computations as the forward-step that computes the activities of the network. In the
same loop, the weight modifications can be computed since the source activity can
be easily accessed. It is also useful to store in the forward step not only the outputs
of the units, but also their weighted input sums that have not been passed through
the transfer function since they may be needed for the computation of the derivative.
The choice of the learning rate η is also important for the convergence of gra-
dient descent. It should be chosen inversely proportional to the square root of the
degree of weight sharing [135] since the effective learning rate is increased by shar-
ing the weight. Hence, the learning rate is low in the lower layers of the pyramid
and increases with height.
6.2.2 Improvements to Backpropagation
Since the training of neural networks with backpropagation can be time-consuming,
several improvements to the basic method have been developed to speed up training.
Online Training. On example of such improvements is online training. In the orig-
inal formulation, the contributions for the weight update from all examples of the
training set must be computed before a weight is modified. This is called batch
training. If the number of training examples is large, this may be computationally
inefficient. Online training updates the weights after every example presentation.
To avoid oscillations, the examples must be presented in a randomized order. On-
line training is only an approximation to gradient descent. Nevertheless, due to the
noise introduced by the randomized example presentation it may escape small local
minima of the error function and even improve generalization of the network [30].
If the training set contains much redundancy, online training can be significantly
faster than batch training since it estimates the gradient from subsets of the training
set and updates the weights more often.
Momentum Term. Another modification to speed up gradient descent is the addi-
tion of a momentum term:
= η ∂E
∆w ( t )
∂w + α∆w ( t− 1) ,
(6.7)
with 0 α 1 . It adds a fraction of the last weight update to the current update.
The momentum term makes gradient descent analogous to particles moving through
a vicious medium in a force field [180]. This averages out oscillations and helps to
overcome flat regions in the error surface.
Advanced optimization methods have been developed for the supervised training
of neural networks [135]. They include conjugate gradient, second order methods,
and adaptive step size algorithms.
Search WWH ::




Custom Search