Information Technology Reference
In-Depth Information
Remark 2.
Even the examples that are correctly classified, i.e., with
γ>
0,
contribute to the cost function. the closer they lie to the hyperplane, the larger
their contribution.
Remark 3.
If
β
is small enough that
βγ
k
1 for all
k
, then all the examples
contribute with almost the same prefactor, like in Hebb's rule discussed before.
Moreover, in the limit
β
0, the stabilities of all the examples are in the
region where the cost function is linear (in the neighborhood of
γ
= 0), and
the prefactor in the gradient of the cost function is the same for all examples.
→
Remark 4.
For intermediate values of
β
, the examples with large stabilities
with respect to the virtual window width 1
/β
(
β
1) do not contribute
significantly to training, since their prefactor in the gradient of cost func-
tion is exponentially small (in the limit
β
|
γ
|
1, one has 1
/
(cosh
2
(
βγ
))
<
|
γ
|
>
5, the prefactor is of order 10
−
4
. Loosely
speaking, the algorithm uses for learning only the examples lying inside a vir-
tual window of width
β|γ|
on both sides of the hyperplane.
4 exp(
−
2
β
|
γ
|
)). For example, if
β
|
γ
|
The above remarks are at the basis of the Minimerror algorithm. The
hyperparameter
β
, which increases throughout the iterations to optimize the
solution, allows one to obtain a linear separation with large margin if it exists,
or finds surfaces that are locally discriminant (with large margins) otherwise.
The weights are initialized using Hebb's rule, which corresponds to
β
=0.
The iterations start with
β
su
ciently small for all the patterns to be inside
the virtual window. If
max
x
is the norm of the example of largest norm,
one can use
β
ini
=10
−
2
/
max
. Then, at each training step (iteration) the
weights are updated and
β
is increased by a small amount
δβ
. This procedure
is known in the literature as
deterministic annealing
, a concept close to that
of
simulated annealing
, used in optimization problems (see the Chap. 8 on
optimization).
A heuristic improvement consists in considering two different values of
β,β
+
for the examples with positive stability and
β
−
for those with negative
stability. In order to keep a small number of parameters, the ratio
β
+
/β
−
does not change during training. Thus, the Minimerror algorithm has three
parameters: the learning rate
µ
, the annealing step
δβ
and the asymmetry
β
±
≡
x
β
+
/β
−
. It proceeds as follows:
Minimerror Algorithm
•
Parameter Settings
1. learning rate
µ
(suggested value: 10
−
2
),
2. ratio
β
±
(suggested value: 6)
3. annealing step
δβ
+
(suggested value: 10
−
2
)
•
Initialization
1. iteration counter:
t
=0
2. weights:
w
(0) (suggested initialization: apply Hebb's rule and then
normalize the weights to
w
=
N
+1)
Search WWH ::
Custom Search