Modeling with Neural Networks: Principles and Model Design Methodology - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

is stopped before a minimum is reached, in order to prevent overfitting. That

is a regularization method called early stopping), which will be discussed in

the section devoted to training with regularization.

A heuristics called “momentum term” is often mentioned in the literature

([Plaut et al. 1986]); it consists in adding to the gradient term

J ,in

simple gradient descent, a term that is proportional to the parameter update

at the previous epoch λ [ w ( i

−

µ i ∇

2)]; that kind of low-pass filter may

prevent oscillations and improve convergence speed if an appropriate value of

λ is found.

The choice between BFGS and Levenberg-Marquardt is based on compu-

tation time and memory size. The BFGS method requires starting training

with simple gradient descent in order to reach the vicinity of a minimum,

then switching to BFGS to speed up the convergence; there is no principled

method for finding the most appropriate number of iterations of simple de-

scent before switching to BFGS: some trial-and error procedure is necessary.

The Levenberg-Marquardt does not have that drawback, but it becomes de-

manding in memory size for large networks (about a hundred parameters), be-

cause of the necessary matrix inversions. Therefore, the Levenberg-Marquardt

method will be preferred for “small” networks, and BFGS otherwise. If time

is available, both should be tried.

−

1)

−

w ( i

−

Parameter Initialization

Since the above training methods are iterative, the parameters must be as-

signed initial values prior to training. The following arguments are guidelines

for initialization:

•

The parameters related to the bias inputs (constant inputs equal to 1)

must be initialized to zero, in order to ascertain that the sigmoids of the

hidden neurons are initialized around zero; then, if the inputs have been

appropriately normalized and centered as recommended earlier, the values

of the outputs of the hidden neurons will be normalized and centered too.

•

Moreover, it should be ascertained that the values of the outputs of the

hidden neurons are not too close to +1 or -1 (the sigmoids are said to be

saturated). That is important because the gradient of the cost function,

which is the driving force of minimization during training, depends on the

derivatives of the activation functions of the hidden neurons with respect

to the potential. If the outputs of the hidden neurons are initially near +1

or

1, the derivatives are very small, so that training starts very slowly,

if at all.

−

If n is the number of inputs of the network, each hidden neuron receives

n

1 variables x i . The nonzero parameters should be small enough that the

potential of the hidden neurons have a variance on the order of 1, in order to

prevent the sigmoids from going into saturation. Assume that the inputs x i

can be viewed as realizations of random, identically distributed, centered and

−

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home