Applications - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

e jk ) ∂e ik

n 2 h 2

G h (e i −

e j )

∂e jk

∂ w l

Δ w ( m− 1)

= η

( e ik −

∂ w l −

f (e i )

i =1

j =1

k =1

(6.4)

ϕ ( l =0 w lk u il ) .w lk ϕ d m =0 w ml x im x i .

Sometimes a so-called moment factor, dependent on weight differences in

consecutive epochs, is added to expressions (6.3) and (6.4) with the intent to

speed-up convergence (see e.g. [212]). We will not make use of the moment

factor and will instead use other means to be explained later for improving

convergence.

Note that the back-propagation formulas for the ZED and EXP risks are

quite simpler than the above ones, essentially because there is no double sum

on the errors. In fact, one has, for example

∂e ik

∂ w l

with

−

∂ R EXP

∂ w k

i e i e ik ∂e ik

∂ w k

e − 2 τ e

(6.5)

i =1

which gives us an equivalent complexity of these risks when compared to

MSE which has

∂ R MSE

∂ w k

e ik ∂e ik

∂ w k

(6.6)

i =1

We now present an example from [198] of an MLP using Rényi's quadratic

entropy trained with the back-propagation algorithm to discriminate a 4-

class dataset. The example illustrates the convergence towards Dirac- δ error

densities (see 3.1.1). In this example and throughout the present section we

only use one-hidden-layer MLP architectures denoted [ d : n h : c ],with n h the

number of hidden neurons. A 1-of- c coding scheme of the outputs is assumed.

Example 6.1. Consider the two-dimensional artificial dataset shown in Fig. 6.2

consisting of 200 data instances in four separable classes with 52, 54, 42 and

52 instances in each class.

MLPs with one hidden layer, tanh activation function, and initial random

weights in [

0 . 1 , 0 . 1], were trained using the R 2 EE risk functional. Only half

of the dataset (a total of 100 instances with approximately 25 instances per

class) was used in the training process.

Figures 6.3, 6.4 and 6.5 show, for one experiment with n h =2, error graphs

corresponding to training epochs 1, 10 and 40 respectively. Since we have a

neural network with four outputs, the error vectors, e k ,k =1 ,..., 4,forma

100

−

4 array whose off-diagonal cells are

the (e i , e k ) scatter plots of the column-class k error values (in [

4 matrix. Each figure shows a 4

2 , 2])versus

the row-class i error values. The diagonal cells contain the histograms of each

column-class error vector e k .

Analyzing these graphs one can see that the errors converge to Dirac- δ

distributions and moreover with uncorrelated errors for the four classes.

−

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home