Information Technology Reference
In-Depth Information
3. hyperparameter β + (suggested value: 10 2 /
x
max )
Training
1. update and normalize the weights according to
w ( t )+∆ w
w ( t +1)=
w ( t )+∆ w
µ
M ( δ w + + δ w )
w =
where µ> 0 is the learning rate, and
δ w ± =
k/γ k ∈γ ±
β ±
cosh 2 β ± γ k
y k x k
where γ ± denotes the subset of examples with positive ( γ + ) and neg-
ative ( γ ) stabilities, respectively.
2. update the iteration counter and the hyperparameters: t ← t +1 +
β + + δβ + , β = β + ± .
3. if β + and β are su ciently large that βγ k
1 for all k , no example
can significantly contribute to modify the weights (within the accuracy
limits of the problem),
then stop .
else, go to training .
It is possible, and often useful, to modify the learning rate and adapt it at
each iteration, as discussed in Chap. 2.
Remark. The Minimerror algorithm combines a gradient descent with the
adaptation of the hyperparameter. It converges towards a local minimum.
It has been shown [Gordon 1995] that if the training patterns are linearly
separable, the minimization of the cost function for increasing values of β
allows finding the hyperplane with maximal margin. If the examples are not
linearly separable, the algorithm converges to weights that minimize locally
(in the neighborhood of the hyperplane) the number of training errors. These
properties are very useful for constructive training algorithms, as explained
later in this chapter.
The hyperparameter β may be interpreted as the inverse of a noise, or a
temperature, T =1 [Gordon 1995]. We will come back to that interpretation
below. Further details, and examples of applications of Minimerror can be
found in [Ra n et al. 1995], [Torres Moreno et al. 1998], [Torres Moreno
1997] and [Godin 2000].
Remark. The least squares partial cost is of particular interest when applied
to a network without hidden units, i.e., to a single neuron with sigmoidal
activation function. Since y k =
±
1, one has
 
Search WWH ::




Custom Search