Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

3. hyperparameter β + (suggested value: 10 − 2 /

x

max )

•

Training

1. update and normalize the weights according to

w ( t )+∆ w

w ( t +1)=

w ( t )+∆ w

µ

M ( δ w + + δ w − )

∆ w = −

where µ> 0 is the learning rate, and

δ w ± =

k/γ k ∈γ ±

β ±

cosh 2 β ± γ k

y k x k

where γ ± denotes the subset of examples with positive ( γ + ) and neg-

ative ( γ − ) stabilities, respectively.

2. update the iteration counter and the hyperparameters: t ← t +1 ,β + ←

β + + δβ + , β − = β + /β ± .

3. if β + and β − are su ciently large that βγ k

1 for all k , no example

can significantly contribute to modify the weights (within the accuracy

limits of the problem),

then stop .

else, go to training .

It is possible, and often useful, to modify the learning rate and adapt it at

each iteration, as discussed in Chap. 2.

Remark. The Minimerror algorithm combines a gradient descent with the

adaptation of the hyperparameter. It converges towards a local minimum.

It has been shown [Gordon 1995] that if the training patterns are linearly

separable, the minimization of the cost function for increasing values of β

allows finding the hyperplane with maximal margin. If the examples are not

linearly separable, the algorithm converges to weights that minimize locally

(in the neighborhood of the hyperplane) the number of training errors. These

properties are very useful for constructive training algorithms, as explained

later in this chapter.

The hyperparameter β may be interpreted as the inverse of a noise, or a

temperature, T =1 /β [Gordon 1995]. We will come back to that interpretation

below. Further details, and examples of applications of Minimerror can be

found in [Ra n et al. 1995], [Torres Moreno et al. 1998], [Torres Moreno

1997] and [Godin 2000].

Remark. The least squares partial cost is of particular interest when applied

to a network without hidden units, i.e., to a single neuron with sigmoidal

activation function. Since y k =

±

1, one has

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home