Information Technology Reference
In-Depth Information
The above relations are valid for a single-output model. The extension to
multiple outputs is straightforward.
Because the second term of the above relation is proportional to the error
e k , it can be neglected in a first approximation, which yields
∂e k
w ( i )
∂e k
w ( i )
T
∂y k
w ( i )
∂y k
w ( i )
T .
N
N
H ( w ( i )) =
=
k =1
k =1
In the case of a model that is linear with respect to its parameters, y is a
linear function of w , so that the second term of the expression of is equal to
zero.
Several techniques can be useful for inverting matrix [ H + µ i I ].
Indirect inversion.
The inverse of a matrix can be computed recursively by the following
inversion lemma. If A , B , C and D denote four matrices, one has:
A 1 B C 1 + DA 1 B 1 DA 1 .
( A + BCD ) 1 = A 1
Moreover, with the notation ζ k =( ∂y k ) / ( w ( i )), matrix H can be con-
structed recursively by defining “partial” matrices H k , of dimension ( k , k ),
by
H k = H k− 1 + z K ( z k ) T with k =1 ,...,N . One has H = H N as desired.
If the inversion lemma is applied to the previous relation with A =
H , B
= ζ k , C = I ,and D = ζ k T , one gets:
H k− 1 1
ζ k ζ k T H k− 1 1
1+( ζ k ) T H k− 1 1
H k 1
= H k− 1 1
ζ k
At the first step ( k := 1), one takes H 0 = µ i I , which gives, at step N :
H N 1
= ( H )+ µ i I 1
.
Direct inversion.
Many inversion methods exist. Since the algorithm is iterative, and since
the line search procedure (described below) often requires several matrix in-
versions, an e cient inversion method is mandatory. Since the approximation
of the Hessian matrix µ i I is symmetric definite positive, it is advantageous
to use Cholesky's method [Press et al. 1992].
As for simple gradient descent and for BFGS, µ i must be adjusted at each
iteration. A line search method can be used as discussed in the next section.
Note that the expression of the Hessian of the cost function is specific to the
least squares cost function; in contrast to the BFGS method, the Levenberg-
Marquardt algorithm cannot be used for minimizing arbitrary cost functions,
in particular the cross-entropy cost function, often used for classification.
 
Search WWH ::




Custom Search