Information Technology Reference
In-Depth Information
Remark 2. Even the examples that are correctly classified, i.e., with γ> 0,
contribute to the cost function. the closer they lie to the hyperplane, the larger
their contribution.
Remark 3. If β is small enough that βγ k
1 for all k , then all the examples
contribute with almost the same prefactor, like in Hebb's rule discussed before.
Moreover, in the limit β
0, the stabilities of all the examples are in the
region where the cost function is linear (in the neighborhood of γ = 0), and
the prefactor in the gradient of the cost function is the same for all examples.
Remark 4. For intermediate values of β , the examples with large stabilities
with respect to the virtual window width 1 ( β
1) do not contribute
significantly to training, since their prefactor in the gradient of cost func-
tion is exponentially small (in the limit β
|
γ
|
1, one has 1 / (cosh 2 ( βγ )) <
|
γ
|
> 5, the prefactor is of order 10 4 . Loosely
speaking, the algorithm uses for learning only the examples lying inside a vir-
tual window of width β|γ| on both sides of the hyperplane.
4 exp(
2 β
|
γ
|
)). For example, if β
|
γ
|
The above remarks are at the basis of the Minimerror algorithm. The
hyperparameter β , which increases throughout the iterations to optimize the
solution, allows one to obtain a linear separation with large margin if it exists,
or finds surfaces that are locally discriminant (with large margins) otherwise.
The weights are initialized using Hebb's rule, which corresponds to β =0.
The iterations start with β su ciently small for all the patterns to be inside
the virtual window. If
max
x
is the norm of the example of largest norm,
one can use β ini =10 2 /
max . Then, at each training step (iteration) the
weights are updated and β is increased by a small amount δβ . This procedure
is known in the literature as deterministic annealing , a concept close to that
of simulated annealing , used in optimization problems (see the Chap. 8 on
optimization).
A heuristic improvement consists in considering two different values of
β,β + for the examples with positive stability and β for those with negative
stability. In order to keep a small number of parameters, the ratio β +
does not change during training. Thus, the Minimerror algorithm has three
parameters: the learning rate µ , the annealing step δβ and the asymmetry
β ±
x
β + . It proceeds as follows:
Minimerror Algorithm
Parameter Settings
1. learning rate µ (suggested value: 10 2 ),
2. ratio β ± (suggested value: 6)
3. annealing step δβ + (suggested value: 10 2 )
Initialization
1. iteration counter: t =0
2. weights: w (0) (suggested initialization: apply Hebb's rule and then
normalize the weights to
w
= N +1)
Search WWH ::




Custom Search