Information Technology Reference
In-Depth Information
w ) 2 d 2 J
d w 2
J ( w )= J ( w )+ 1
+O( w 3 )
2 ( w
w = w
for the gradient of the cost function is zero at the minimum. Differentiating
the above relation with respect to w gives an approximation of the gradient
of the cost function in the neighborhood of a minimum,
w ) d 2 J
d w 2
d J
d w =( w
.
w = w
Therefore, if variable w is in the neighborhood of w , the minimum could
be reached in a single iteration if the second derivative of the cost function at
the minimum were known: w would simply be updated by an amount
(d J/ d w )
(d 2 J/ d w 2 ) w = w
w =
.
The same argument holds for a function of several variables, except for
the fact that the second derivative becomes the Hessian matrix H ( w )ofthe
cost function, whose general term is ( 2 J ) / ( ∂w i ∂w j ): in order to reach the
minimum of the cost function in a single iteration, the weight vector should
be updated (provided the Hessian matrix is invertible) by the amount
[ H ( w )] 1
w =
∇·
J ( w ) .
Thus, by contrast to simple gradient descent, the direction of motion, in para-
meter space, is not the direction of the gradient, but a linear transformation
of the gradient.
Clearly, that relation is not applicable in practice, since vector w is not
known. However, it suggests several iterative techniques that use an approxi-
mation of the Hessian matrix (or of its inverse). We discuss two of them in the
additional material at the end of the present chapter: the Broyden-Fletcher-
Goldfarb-Shanno algorithm (BFGS algorithm [Broyden 1970]) and the Leven-
berg-Marquardt algorithm ([Levenberg et al. 1944; Marquardt et al. 1963]).
Obviously, those minimization methods are by no means specific to neural
networks. Detailed descriptions are to be found in [Press et al. 1992], where
the conjugate gradient method is also discussed.
What to Do in Practice?
First of all, one should by all means refrain from using simple gradient de-
scent and its variants: their convergence times to a minimum (in number of
iterations and in net computation time) are larger than those of second order
methods by several orders of magnitude. Simple gradient should be used only
in extreme cases for very large networks (several thousands of parameters) that
may be useful in image processing with low-level picture representation, or for
very large data bases (with millions of examples). In such cases, minimization
 
Search WWH ::




Custom Search