Information Technology Reference
In-Depth Information
where
i e is the unit vector in the weight space and
T
i w is the weight connecting
the i th input of the j th hidden unit.
To solve the minimization problem, we form the corresponding Lagrangian
1
L
G
wHwe
T
G
OG
(
T
ij
ww
)
,
ij
2
where Ȝ is the Lagrange multiplier. The derivative of the Lagrangian with respect
to
G and the equation
T
ij
eww
G ,
0
ij
define the optimal weight change
w
ij
G
w
H
1
e
.
ij
[
H
1
]
ij
i w is
Correspondingly, the related optimal value of Lagrangian L for the weight
2
w
1
2[
ij
L
,
ij
H
1
]
ij
H is the i th element of the inverse Hessian matrix H . The i L value of
the Lagrangian determined in this way represents the increase of mean square error
caused by the removal of the weight
1
where
[
] ij
i w , known as saliency of the weight
i w . It
is obvious that, because the saliency depends on the square value of
ij w the small
values of weights have a low influence on the mean square error. However,
because the saliency is inversely proportional to
,
H
1
H
1
[
] ij
, small values of
[
] ij
can also have a strong influence on the mean square error.
Although pruning methods, such as optimal brain damage , and optimal brain
surgeon , rely on the weight ranking with respect to saliency, i.e . on changes in
training error caused by pruning an individual weight, there is still an essential
difference between them: the optimal brain damage procedure does not require
retraining of the network after removing a weight element, whereas the optimal
brain surgeon procedure requires this.
The disadvantage of both methods is that, if no stopping criterion is built, the
removal of the least significant weights can lead to network overfitting. As an
efficient stopping criterion, the calculation of the test error using Akaike's (1970)
final prediction error (FPE) estimation and its modification is used to cover the
estimation of average generalization error in regularized networks (Moody, 1991).
 
Search WWH ::




Custom Search