Information Technology Reference
In-Depth Information
method reduces the rank of weights in each layer by deletion of the smallest salient
eigen-nodes. Finally, the proposed method does not require network training.
A network pruning approach is preferably used in designing networks with a
high generalization capability , i.e. networks that are not only good enough to solve
the prediction or classification problems present in the training set, but also some
similar problems using some fresh, never seen and not previously known training
sets of data. This is achieved through a trade-off between the intention that the
trained network should be capable of learning a broad spectrum of similar problem
categories, which would require a large-sized network, and the requirement that the
network should be as simple as possible, in order to avoid the overtraining .
In practical application of a trained network, there is a fundamental
recommendation, i.e . where several trained networks have approximately the same
final performances, the structurally simplest network should be selected as the best
generalized one. This recommendation reflects Occam's razor philosophy , which
recommends that a scientific model should favour simplicity.
Many training strategies have been interrogated for network simplification at
lower training cost. Such strategies have been discovered within the framework of
minimization of the error function extended by a penalty term. To this category of
strategies belong:
x the weight decay approach (Hinton, 1989), a subset of regularization
approaches based on minimization of the weight tuning rule augmented by
a complexity penalty term
'
wt
( )
SG
x
O
w
ij
i
j
ij
that penalizes the large weight values.
x the weight elimination approach (Weigend et al. , 1991), based on
minimization of network training cost function to which a term is added
that accounts for the number of parameters:
wt
()
ij
'
wt
( )
KG
x
O
,
ij
i
i
2
2
[1
wt
( )]
ij
x is
where Ȝ represents the weight decay constant,
G is the local error,
the local activation, and Ș is the learning rate.
In contrast to weight decay, which shrinks large values of weights more than small
ones, the weight elimination shrinks predominantly the small weight values and is
to a certain degree similar to the pruning process. Hansen and Rasmussen (1994)
have demonstrated that network pruning may result when the weight decay
parameter is determined by data. The added term punishes the large weight values
and forces them to obtain small absolute values and simultaneously retains the
other values unchanged. This, however, is favourable in preventing worsening of
Search WWH ::




Custom Search