Information Technology Reference
In-Depth Information
In
R
Moller
,
β
is the width of a region,
, of acceptable error around the
desired target and
α
controls the steepness of the risk outside
R
. Both pa-
rameters are positive. If we increase
α
then
R
Moller
becomes more steep
outside
R
forcing the outputs towards the boundary of that region. By de-
creasing
β
, the outputs are pulled towards the desired targets (see [160] for
a detailed discussion). This risk functional was proposed in the framework
of
monotonic
risk functionals [90]. In short, one can say that a risk is mono-
tonic if its minimization (or maximization) implies the minimization of the
number of misclassifications (recall the discussion in Example 2.6 where we
have shown that
L
MSE
is non-monotonic). Møller defines his functional as
being
soft-monotonic
, where the degree of monotonicity is controlled by
α
,
becoming monotonic when
α
R
.
We now point out the main differences between both risks. While in
R
Moller
we compute a sum (over
k
) of the exponentials of the squared errors,
in
R
EXP
we compute the exponential of the sum (over
k
) of the squared
errors. This brings a significant difference in terms of the gradients. In fact,
for
β
=0,wehave
→
+
∞
e
ik
exp
1
τ
∂y
ik
∂w
,
n
c
∂ R
EXP
∂w
e
ik
=
−
2
(5.39)
i
=1
k
=1
n
∂ R
Moller
∂w
αe
ik
exp
αe
ik
∂y
ik
∂w
=
−
.
(5.40)
i
=1
Thus, with
R
EXP
the backpropagated error through the output
y
k
uses in-
formation from
all
the other outputs, while
R
Moller
only uses the error asso-
ciated to that particular output.
5.2.1.1
Gradient Descent
A gradient descent optimization can be applied to minimize
R
EXP
for any
τ
=0. Note that for
τ<
0 we get a negative scaled version of
R
ZED
that
should now be minimized.
Algorithm 5.2 —
R
EXP
Gradient descent algorithm
1. Choose a random initial parameter vector,
w
(with components
w
k
).
2. Compute
R
EXP
using the classifier outputs
y
i
=
ϕ
(
x
i
;
w
) at the
n
avail-
able instances.
3. Compute the partial derivatives of
R
EXP
with respect to the parameters:
e
i
exp
e
i
τ
∂e
i
∂w
k
∂ R
EXP
∂w
k
n
=2
(5.41)
i
=1
where
∂e
i
/∂w
k
depends on the classifier architecture.