Information Technology Reference
In-Depth Information
feedforward neural networks for classification, a constraint must frequently be
obeyed: some parameters of the model must have equal values at the end of
training (this is known as the “shared weight” technique [Waibel et al. 1989]).
Since the weights are updated, at each epoch of training, as a function of the
gradient of the cost function, there is no reason why different weights, even if
initialized at equal values at the beginning of training, should stay equal even
after a single epoch. Therefore, a special procedure must be implemented.
We assume that, in a given network, v parameters must stay equal: w 1 =
w 2 =
= w v = w .
The corresponding component of the gradient of the cost function can be
written as
···
∂w = ∂J
∂J
∂w 1
∂w + ∂J
∂w 2
∂w +
+ ∂J
∂w ν
∂w ν
∂w .
···
∂w 1
∂w 2
Because
ν
∂w 1
∂w = ∂w 2
= ∂w ν
∂w =1 , one has ∂J
∂J
∂w i
∂w =
···
∂w =
.
i =1
Thus, when a network contains shared weights, backpropagation must be per-
formed, at each epoch, in the conventional way, in order to compute the partial
derivatives of the cost function with respect to those weights; then the sum
of those partial derivatives must be computed, and that value must be as-
signed to the partial derivatives, before updating the parameters by one of
the methods discussed in the next section.
2.5.2.3 Updating the Parameters as a Function of the Gradient of
the Cost Function
In the previous section, the evaluation of the gradient of the cost function, at
a given epoch of training, was discussed. The gradient is subsequently used
in an iterative minimization algorithm. The present section examines some
popular iterative schemes for the minimization of the cost function.
Simple Gradient Descent
The simple gradient descent consists in updating the weights by the following
relation, at epoch i of training:
w ( i )= w ( i− 1) −µ i ∇J ( w ( i− 1))
with µ i > 0 .
Thus, the descent direction, in parameter space, is opposite to the direction
of the gradient. µ i is called gradient step or learning rate.
This very simple, attractive method has several shortcomings:
If the learning rate is too small, the cost function decreases very slowly;
if the rate is too large, the cost may increase or oscillate; that situation is
Search WWH ::




Custom Search