Information Technology Reference
In-Depth Information
For both neurons, the search directions of the algorithm are taken conjugate with
respect to the Hessian matrix:
d i Hd j
= 0
i = j
(4.12)
where the d i 's are the search directions (at time instants i ). Hence, E being the
cost function, the CG algorithm can be formulated as
w(
t
+
1
) = w(
t
) + α(
t
)
d
(
t
)
(4.13)
d
(
0
) =−∇
E
(w(
0
))
(4.14)
d ( t + 1 ) =−∇ E (w( t + 1 )) + β( t ) d ( t )
(4.15)
β( t ) = E T
(w( t + 1 )) [ E (w( t + 1 )) −∇ E (w( t )) ]
d T
(4.16)
(
t
)
E
(w(
t
)
where eq. (4.16) is called the Hestenes-Stiefel formula (other formulations are
possible). The learning rate parameter is defined as
d T
( t ) E (w( t ))
d T
α( t ) =−
(4.17)
(
t
)
Hd
(
t
)
and in the case of these two neurons, there is no need to avoid computation of the
Hessian matrix using a line minimization because of the a priori knowledge of this
matrix [see eq. (2.10)]. The CG algorithm has been derived on the assumption of
a quadratic error function with a positive definite Hessian matrix: In this case it
finds the minimum after at most n iterations, with n the dimension of the weight
vector. This clearly represents a significant improvement on the simple gradient
descent approach, which could take a very large number of steps to minimize
even a quadratic error function. In the case of MCA EXIN and TLS EXIN, the
error function is not quadratic and, for MCA EXIN, the corresponding Hessian
matrix near the minimum is not positive definite; this implies the possibility
of nondescent directions. To improve the method, the scaled conjugate gradient
(SCG) algorithm [133] has been implemented. It combines the CG approach with
the model trust region approach of the Levenberg - Marquardt algorithm. It adds
some multiple ( λ ) of the unit matrix to the Hessian matrix, giving the following
learning rate:
d T
d T
( t ) E (w( t ))
δ( t )
( t ) E (w( t )
α( t ) =−
=−
(4.18)
2
2
d T
( t ) Hd ( t ) + λ( t ) d ( t )
For a positive definite Hessian matrix, δ( t ) > 0. If this is not the case, the value
of λ( t ) must be increased to make the denominator positive. Let the raised value
of λ( t ) be called λ( t ) . In [133] the following value is proposed:
λ( t ) = 2
δ( t )
d ( t )
λ( t )
(4.19)
2
2
Search WWH ::




Custom Search