Information Technology Reference
In-Depth Information
is stopped before a minimum is reached, in order to prevent overfitting. That
is a regularization method called early stopping), which will be discussed in
the section devoted to training with regularization.
A heuristics called “momentum term” is often mentioned in the literature
([Plaut et al. 1986]); it consists in adding to the gradient term
J ,in
simple gradient descent, a term that is proportional to the parameter update
at the previous epoch λ [ w ( i
µ i
2)]; that kind of low-pass filter may
prevent oscillations and improve convergence speed if an appropriate value of
λ is found.
The choice between BFGS and Levenberg-Marquardt is based on compu-
tation time and memory size. The BFGS method requires starting training
with simple gradient descent in order to reach the vicinity of a minimum,
then switching to BFGS to speed up the convergence; there is no principled
method for finding the most appropriate number of iterations of simple de-
scent before switching to BFGS: some trial-and error procedure is necessary.
The Levenberg-Marquardt does not have that drawback, but it becomes de-
manding in memory size for large networks (about a hundred parameters), be-
cause of the necessary matrix inversions. Therefore, the Levenberg-Marquardt
method will be preferred for “small” networks, and BFGS otherwise. If time
is available, both should be tried.
1)
w ( i
Parameter Initialization
Since the above training methods are iterative, the parameters must be as-
signed initial values prior to training. The following arguments are guidelines
for initialization:
The parameters related to the bias inputs (constant inputs equal to 1)
must be initialized to zero, in order to ascertain that the sigmoids of the
hidden neurons are initialized around zero; then, if the inputs have been
appropriately normalized and centered as recommended earlier, the values
of the outputs of the hidden neurons will be normalized and centered too.
Moreover, it should be ascertained that the values of the outputs of the
hidden neurons are not too close to +1 or -1 (the sigmoids are said to be
saturated). That is important because the gradient of the cost function,
which is the driving force of minimization during training, depends on the
derivatives of the activation functions of the hidden neurons with respect
to the potential. If the outputs of the hidden neurons are initially near +1
or
1, the derivatives are very small, so that training starts very slowly,
if at all.
If n is the number of inputs of the network, each hidden neuron receives
n
1 variables x i . The nonzero parameters should be small enough that the
potential of the hidden neurons have a variance on the order of 1, in order to
prevent the sigmoids from going into saturation. Assume that the inputs x i
can be viewed as realizations of random, identically distributed, centered and
Search WWH ::




Custom Search