Modeling with Neural Networks: Principles and Model Design Methodology - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

w ∗ ) 2 d 2 J

d w 2

J ( w )= J ( w ∗ )+ 1

+O( w 3 )

2 ( w

−

w = w ∗

for the gradient of the cost function is zero at the minimum. Differentiating

the above relation with respect to w gives an approximation of the gradient

of the cost function in the neighborhood of a minimum,

w ∗ ) d 2 J

d w 2

d J

d w =( w

−

.

w = w ∗

Therefore, if variable w is in the neighborhood of w ∗ , the minimum could

be reached in a single iteration if the second derivative of the cost function at

the minimum were known: w would simply be updated by an amount

(d J/ d w )

(d 2 J/ d w 2 ) w = w ∗

∆ w =

−

.

The same argument holds for a function of several variables, except for

the fact that the second derivative becomes the Hessian matrix H ( w )ofthe

cost function, whose general term is ( ∂ 2 J ) / ( ∂w i ∂w j ): in order to reach the

minimum of the cost function in a single iteration, the weight vector should

be updated (provided the Hessian matrix is invertible) by the amount

[ H ( w ∗ )] − 1

∆ w =

−

∇·

J ( w ) .

Thus, by contrast to simple gradient descent, the direction of motion, in para-

meter space, is not the direction of the gradient, but a linear transformation

of the gradient.

Clearly, that relation is not applicable in practice, since vector w ∗ is not

known. However, it suggests several iterative techniques that use an approxi-

mation of the Hessian matrix (or of its inverse). We discuss two of them in the

additional material at the end of the present chapter: the Broyden-Fletcher-

Goldfarb-Shanno algorithm (BFGS algorithm [Broyden 1970]) and the Leven-

berg-Marquardt algorithm ([Levenberg et al. 1944; Marquardt et al. 1963]).

Obviously, those minimization methods are by no means specific to neural

networks. Detailed descriptions are to be found in [Press et al. 1992], where

the conjugate gradient method is also discussed.

What to Do in Practice?

First of all, one should by all means refrain from using simple gradient de-

scent and its variants: their convergence times to a minimum (in number of

iterations and in net computation time) are larger than those of second order

methods by several orders of magnitude. Simple gradient should be used only

in extreme cases for very large networks (several thousands of parameters) that

may be useful in image processing with low-level picture representation, or for

very large data bases (with millions of examples). In such cases, minimization

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home