EE-Inspired Risks - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

In R Moller , β is the width of a region,

, of acceptable error around the

desired target and α controls the steepness of the risk outside

. Both pa-

rameters are positive. If we increase α then R Moller becomes more steep

outside

forcing the outputs towards the boundary of that region. By de-

creasing β , the outputs are pulled towards the desired targets (see [160] for

a detailed discussion). This risk functional was proposed in the framework

of monotonic risk functionals [90]. In short, one can say that a risk is mono-

tonic if its minimization (or maximization) implies the minimization of the

number of misclassifications (recall the discussion in Example 2.6 where we

have shown that L MSE is non-monotonic). Møller defines his functional as

being soft-monotonic , where the degree of monotonicity is controlled by α ,

becoming monotonic when α

We now point out the main differences between both risks. While in

R Moller we compute a sum (over k ) of the exponentials of the squared errors,

in R EXP we compute the exponential of the sum (over k ) of the squared

errors. This brings a significant difference in terms of the gradients. In fact,

for β =0,wehave

→

∞

e ik exp 1

∂y ik

∂w ,

∂ R EXP

∂w

e ik

−

(5.39)

i =1

k =1

∂ R Moller

∂w

αe ik exp αe ik ∂y ik

∂w

−

(5.40)

i =1

Thus, with R EXP the backpropagated error through the output y k uses in-

formation from all the other outputs, while R Moller only uses the error asso-

ciated to that particular output.

5.2.1.1

Gradient Descent

A gradient descent optimization can be applied to minimize R EXP for any

τ =0. Note that for τ< 0 we get a negative scaled version of

R ZED that

should now be minimized.

Algorithm 5.2 — R EXP Gradient descent algorithm

1. Choose a random initial parameter vector, w (with components w k ).

2. Compute R EXP using the classifier outputs y i = ϕ ( x i ; w ) at the n avail-

able instances.

3. Compute the partial derivatives of R EXP with respect to the parameters:

e i exp e i

∂e i

∂w k

∂ R EXP

∂w k

(5.41)

i =1

where ∂e i /∂w k depends on the classifier architecture.

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home