Supervised Learning - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

Conjugate Gradient. The main idea of conjugate gradient [66, 161] is to find a

descent direction which does not disturb the result of the previous iteration. It picks

a descent direction, e.g. the gradient, finds a minimum along this direction using

a line-search, and moves into the conjugate direction where the gradient does not

change its direction, only its length. Conjugate gradient can be viewed as an ad-

vanced way of choosing the momentum. It requires fewer iterations than generic

gradient descent. However, each iteration is more expensive, and it can only be used

for batch training since an accurate line search must be performed.

Second Order Methods. Second order methods [18] consider more information

about the shape of the error function than the mere gradient. They use a quadratic

approximation to estimate the curvature of the error function. Quasi-Newton meth-

ods, for example, attempt to keep a positive definite estimate of the inverse Hessian

directly, without the need for matrix inversion. They are based on gradient informa-

tion only but require a line search. One popular algorithm is the BFGS method [67]

that uses O ( N ) operations for an iteration and needs to store a N × N matrix, where

N is the number of weights.

Another popular second order method is the Quickprop algorithm, proposed

by Fahlman [63]. It uses a one-dimensional quadratic approximation to the error

function for each weight, yielding the learning rule:

∇ ij E ( t )

∇ ij E ( t− 1) −∇ ij E ( t ) ,

∆w ( t )

= ∆w ( t− 1)

(6.8)

∂E ( t )

where ∇ ij E ( t )

∂w ij . Quickprop can be initialized

using a gradient descent step. Care must be taken to avoid weight updates that are

too large.

Adaptive Step Size. The choice of the learning rate η is crucial for both the stability

and the convergence speed of gradient descent. One can try to adapt a global learning

rate automatically [202], but this is difficult when the error function has different

properties in different dimensions.

For this reason, several methods have been developed that maintain a local learn-

ing rate for each weight. The algorithm proposed by Silva and Almeida [212], for ex-

ample, increases the learning rate if successive weight updates go in the same direc-

tion and decreases it otherwise. Another example is the SuperSAB algorithm [228]

that combines the adaptive learning rate with a momentum term.

One of the fastest and most robust methods for training neural networks is the

resilient propagation (RPROP) algorithm, proposed by Riedmiller and Braun [191].

It uses only the sign of the gradient ∇ ij E ( t )

denotes the partial derivative

and an adaptable step size ∆ ( t )

change the weights:

: − ∆ ( t )

, if ∇ ij E ( t ) > 0

∆w ( t )

+ ∆ ( t )

, if ∇ ij E ( t ) < 0

0 , else

(6.9)

Hierarchical Neural Networks for Image Interpretation

Search WWH ::

Custom Search

Home