Applications - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

Let us start by considering the k th output perceptron with its weights

being adjusted by gradient descent. When using the empirical Shannon's

entropy of the error, H S , we then apply expression (3.8), which we rewrite

below for the k th output perceptron in vector notation:

e jk ) ∂e ik

∂ H S

∂ w k

n 2 h 2

G h (e i −

e j )

∂e jk

∂ w k

( e ik −

∂ w k −

(6.1)

f (e i )

i =1

j =1

Whereas expression (3.8) contemplated the adjustment of a single weight,

we now formulate the adjustment with respect to a whole vector of weights

(including biases): the weight vector w k of an arbitrary k th output per-

ceptron. The derivative of H S with respect to the weights depends on n

c -dimensional error vectors denoted e i and e j . Each component ∂ H S /∂w lk

of vector ∂ H S /∂ w k in (6.1) can be conveniently expressed (namely, for im-

plementation purposes) as the sum of all elements of the matrix resulting

from:

⎡

⎣

⎤

⎦

⎡

⎣

⎤

⎦

f ( e 1 )

f ( e 1 ) ···

G h (e 1 −

e 1 )

···

G h (e 1 −

e n )

n 2 h 2

e n ) ···

G h (e n −

e 1 )

···

G h (e n −

e n )

f (

e n )

⎡

⎣

⎤

⎦

⎡

⎣

⎤

⎦

∂e 1 k

∂e nk

∂w lk

∂w lk −

∂w lk ···

∂w lk −

e 1 k −

e 1 k ···

e 1 k −

e nk

(6.2)

∂e nk

∂e 1 k

∂e nk

∂w lk

e nk −

e 1 k ···

e nk −

e nk

∂w lk −

∂w lk ···

∂w lk −

where '. × ' denotes element-wise product [212]. The first matrix is not present

when Rényi's quadratic entropy or information potential is used (see also

expression (3.9)).

Once all n error vectors (for the n input vectors x i ), relative to the m th

training epoch have been obtained, one is then able to compute the updated

weights for the output perceptron:

w ( m − 1)

∂ H S

∂ w k

w ( m )

= w ( m− 1)

Δ w ( m− 1)

−

= η

, (6.3)

with

where η is the learning rate.

The updating of the weight vector w l , relative to an arbitrary l th percep-

tron of the hidden-layer, is done as usual with the back-propagation algo-

rithm. One needs all back-propagated errors from the output layer (incident

dotted arrows in Fig. 6.1). Denoting by ϕ ( . ) the activation function assumed

the same for all perceptrons, the updating vector for w l at the m th training

epoch is then:

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home