Information Technology Reference
In-Depth Information
6.2.3 Regularization
As already discussed in Section 6.1, the capacity of a learning machine must be
appropriate for the task to ensure generalization. One way to restrict the capacity of
a neural network is to use few adaptable parameters. This can be done by keeping
the network small or by sharing weights.
Another way is to lower the capacity of a high-capacity machine by regular-
ization. Regularization constrains the parameters such that only smooth approxima-
tions to the training set are possible.
It was already mentioned that early stopping has a regularizing effect. The rea-
son for this is that weights of a neural network are still relatively small when the
training is stopped early. This limits the nonlinearity of the network since the trans-
fer functions are almost linear for small inputs. Limited nonlinearity yields decision
functions that smoothly approximate the training set.
Weight Decay. Another possibility to regularize a neural network is weight decay.
Krogh and Hertz [128] proposed adding a term to the cost function E that penalizes
large weights:
X
1
2 λ
w k ,
E d = E +
(6.11)
k
where λ is a parameter that determines the strength of the penalty. If gradient descent
is used for learning, the penalty leads to a new term λw k in the weight update:
∆w k = η ∂E d
∂w k
∂E
∂w k
= η
+ λw k
(6.12)
.
The new term would decay weights exponentially if no forces from the cost function
E were present.
Low-Activity Prior. It is also possible to include terms in the cost function that
enforce properties of the representation present at hidden units. For instance, one
can force units to have a low average activity, e.g. α = 0 . 1 :
X
1
2 λ
( h o k i− α ) 2 ,
E a = E +
(6.13)
k
where h o k i denotes the expected value of the activity of unit k . Gradient descent
yields the additional term λ ( h o k i − α ) that must be multiplied with the deriva-
tive of the transfer function of unit k and added to its error component δ k . A low-
activity prior for hidden units, combined with a force that produces variance, can
yield sparse representations.
6.3 Recurrent Neural Networks
So far, the function graph describing the neural network was acyclic. If the graph of
primitive functions contains cycles, it is called recurrent neural network (RNN). In
Search WWH ::




Custom Search