Modeling with Neural Networks: Principles and Model Design Methodology - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

feedforward neural networks for classification, a constraint must frequently be

obeyed: some parameters of the model must have equal values at the end of

training (this is known as the “shared weight” technique [Waibel et al. 1989]).

Since the weights are updated, at each epoch of training, as a function of the

gradient of the cost function, there is no reason why different weights, even if

initialized at equal values at the beginning of training, should stay equal even

after a single epoch. Therefore, a special procedure must be implemented.

We assume that, in a given network, v parameters must stay equal: w 1 =

w 2 =

= w v = w .

The corresponding component of the gradient of the cost function can be

written as

···

∂w = ∂J

∂J

∂w 1

∂w + ∂J

∂w 2

∂w +

+ ∂J

∂w ν

∂w .

···

∂w 1

∂w 2

Because

ν

∂w 1

∂w = ∂w 2

= ∂w ν

∂w =1 , one has ∂J

∂J

∂w i

∂w =

···

∂w =

.

i =1

Thus, when a network contains shared weights, backpropagation must be per-

formed, at each epoch, in the conventional way, in order to compute the partial

derivatives of the cost function with respect to those weights; then the sum

of those partial derivatives must be computed, and that value must be as-

signed to the partial derivatives, before updating the parameters by one of

the methods discussed in the next section.

2.5.2.3 Updating the Parameters as a Function of the Gradient of

the Cost Function

In the previous section, the evaluation of the gradient of the cost function, at

a given epoch of training, was discussed. The gradient is subsequently used

in an iterative minimization algorithm. The present section examines some

popular iterative schemes for the minimization of the cost function.

Simple Gradient Descent

The simple gradient descent consists in updating the weights by the following

relation, at epoch i of training:

w ( i )= w ( i− 1) −µ i ∇J ( w ( i− 1))

with µ i > 0 .

Thus, the descent direction, in parameter space, is opposite to the direction

of the gradient. µ i is called gradient step or learning rate.

This very simple, attractive method has several shortcomings:

•

If the learning rate is too small, the cost function decreases very slowly;

if the rate is too large, the cost may increase or oscillate; that situation is

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home