Information Technology Reference
In-Depth Information
errors. Otherwise, the training algorithm, or the program (or both) should be
checked for errors.
The structure of the student network is identical to that of the teacher
network within permutations of the hidden neurons. This is a consequence of
the unicity theorem [Sontag 1993].
Two Test Problems
Problem 1: A network with 8 inputs, 6 hidden neurons and one output is
generated by drawing weights uniformly in the interval [ 20, +20]; a training
set and a test set of 1,500 examples each are generated with random inputs
from a uniform distribution in [
1 , +1]; a network having the same structure is
trained as follows: initialization of the parameters from a uniform distribution
in [
0 . 6 , +0 . 6], computation of the gradient by backpropagation, minimiza-
tion of the cost function by the Levenberg-Marquardt algorithm. The teacher
network is retrieved exactly (TMSE and VMSE on the order of 10 31 )in
96% of trainings (for 48 trainings out of 50 trainings performed with different
initializations).
Problem 2: A network with 10 inputs, 5 hidden neurons and an output is
generated with weights drawn uniformly in [
1 , +1]; a training set and a test
set are generated with random inputs from a normal distribution; training is
performed as in the previous example; the teacher network is retrieved in 96%
of the trainings if the training set has 400 examples; it is retrieved in 100% of
the trainings if the training set has 2,000 examples.
For the same problems, training always fails to retrieve the teacher network
if simple gradient descent or stochastic gradient (see next section) are used,
with or without momentum term.
Note that the teacher-student problem becomes di cult for some archi-
tectures because of a large number of local minima.
2.5.2.4 Summary
We summarize the procedure that must be used for training a feedforward
neural network with a given number of inputs and hidden neurons:
Initialize the parameters with the method described above.
Compute the gradient of the cost function by backpropagation.
Update the parameters iteratively with an appropriate minimization algo-
rithm (simple gradient descent, BFGS, Levenberg-Marquardt, conjugate
gradient, etc.).
If a prescribed maximum number of epochs is reached, or if the variation
of the module of the vector of parameters is smaller than a given thresh-
old (the weights no longer change significantly), or if the module of the
gradient is smaller than a given threshold (a minimum has been reached),
terminate the procedure; otherwise, start a new epoch by iterating to the
gradient evaluation.
Search WWH ::




Custom Search