Information Technology Reference
In-Depth Information
In R Moller , β is the width of a region,
, of acceptable error around the
desired target and α controls the steepness of the risk outside
R
. Both pa-
rameters are positive. If we increase α then R Moller becomes more steep
outside
R
forcing the outputs towards the boundary of that region. By de-
creasing β , the outputs are pulled towards the desired targets (see [160] for
a detailed discussion). This risk functional was proposed in the framework
of monotonic risk functionals [90]. In short, one can say that a risk is mono-
tonic if its minimization (or maximization) implies the minimization of the
number of misclassifications (recall the discussion in Example 2.6 where we
have shown that L MSE is non-monotonic). Møller defines his functional as
being soft-monotonic , where the degree of monotonicity is controlled by α ,
becoming monotonic when α
R
.
We now point out the main differences between both risks. While in
R Moller we compute a sum (over k ) of the exponentials of the squared errors,
in R EXP we compute the exponential of the sum (over k ) of the squared
errors. This brings a significant difference in terms of the gradients. In fact,
for β =0,wehave
+
e ik exp 1
τ
∂y ik
∂w ,
n
c
∂ R EXP
∂w
e ik
=
2
(5.39)
i =1
k =1
n
∂ R Moller
∂w
αe ik exp αe ik ∂y ik
∂w
=
.
(5.40)
i =1
Thus, with R EXP the backpropagated error through the output y k uses in-
formation from all the other outputs, while R Moller only uses the error asso-
ciated to that particular output.
5.2.1.1
Gradient Descent
A gradient descent optimization can be applied to minimize R EXP for any
τ =0. Note that for τ< 0 we get a negative scaled version of
R ZED that
should now be minimized.
Algorithm 5.2 — R EXP Gradient descent algorithm
1. Choose a random initial parameter vector, w (with components w k ).
2. Compute R EXP using the classifier outputs y i = ϕ ( x i ; w ) at the n avail-
able instances.
3. Compute the partial derivatives of R EXP with respect to the parameters:
e i exp e i
τ
∂e i
∂w k
∂ R EXP
∂w k
n
=2
(5.41)
i =1
where ∂e i /∂w k depends on the classifier architecture.
 
Search WWH ::




Custom Search