Information Technology Reference
In-Depth Information
Conjugate Gradient. The main idea of conjugate gradient [66, 161] is to find a
descent direction which does not disturb the result of the previous iteration. It picks
a descent direction, e.g. the gradient, finds a minimum along this direction using
a line-search, and moves into the conjugate direction where the gradient does not
change its direction, only its length. Conjugate gradient can be viewed as an ad-
vanced way of choosing the momentum. It requires fewer iterations than generic
gradient descent. However, each iteration is more expensive, and it can only be used
for batch training since an accurate line search must be performed.
Second Order Methods. Second order methods [18] consider more information
about the shape of the error function than the mere gradient. They use a quadratic
approximation to estimate the curvature of the error function. Quasi-Newton meth-
ods, for example, attempt to keep a positive definite estimate of the inverse Hessian
directly, without the need for matrix inversion. They are based on gradient informa-
tion only but require a line search. One popular algorithm is the BFGS method [67]
that uses O ( N ) operations for an iteration and needs to store a N × N matrix, where
N is the number of weights.
Another popular second order method is the Quickprop algorithm, proposed
by Fahlman [63]. It uses a one-dimensional quadratic approximation to the error
function for each weight, yielding the learning rule:
ij E ( t )
ij E ( t− 1) −∇ ij E ( t ) ,
∆w ( t )
= ∆w ( t− 1)
·
(6.8)
ij
ij
∂E ( t )
where ij E ( t )
∂w ij . Quickprop can be initialized
using a gradient descent step. Care must be taken to avoid weight updates that are
too large.
Adaptive Step Size. The choice of the learning rate η is crucial for both the stability
and the convergence speed of gradient descent. One can try to adapt a global learning
rate automatically [202], but this is difficult when the error function has different
properties in different dimensions.
For this reason, several methods have been developed that maintain a local learn-
ing rate for each weight. The algorithm proposed by Silva and Almeida [212], for ex-
ample, increases the learning rate if successive weight updates go in the same direc-
tion and decreases it otherwise. Another example is the SuperSAB algorithm [228]
that combines the adaptive learning rate with a momentum term.
One of the fastest and most robust methods for training neural networks is the
resilient propagation (RPROP) algorithm, proposed by Riedmiller and Braun [191].
It uses only the sign of the gradient ij E ( t )
denotes the partial derivative
and an adaptable step size ( t )
to
ij
change the weights:
8
<
: ( t )
, if ij E ( t ) > 0
ij
∆w ( t )
ij
=
+ ( t )
ij
, if ij E ( t ) < 0
0 , else
(6.9)
.
Search WWH ::




Custom Search