Applications - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

pattern i . The generalization of expression (5.9) to a p -dimensional output

(the dimension of e or the number of network outputs), is given by

K 0 − e( i )

nh p

f (0; w)=

(6.16)

i =1

where h represents the bandwidth of the kernel K . Using the Gaussian kernel

with zero mean and unit covariance in this expression, the estimator for the

error density becomes

exp

e( i ) T e( i )

2 h 2

√ 2 πnh p

f (0; w)=

−

(6.17)

i =1

Due to reasons discussed in [214] and Chap. 5, related to the speed of conver-

gence of ZED, instead of using expression (6.17) the following simplification

shall be used (in line with the EXP risk):

exp

;

e( i ) T e( i )

2 h 2

f (0; w)= h 2

−

(6.18)

i =1

given that the difference relies only on constant terms, the same extrema are

found. The gradient of (6.18) is:

exp

e( i ) ∂ e( i )

∂ w

∂ f (0; w)

∂ w

e( i ) T e( i )

2 h 2

= −

−

(6.19)

i =1

Since the search is for the weights yielding the maximum of the error density

at 0, the network weight update shall be made by

Δ w = η ∂ f (0; w)

∂ w

(6.20)

where η stands for the learning rate.

This adaptation of ZED will change expression (6.19) to

exp

e( t − i ) ∂ e( t

, (6.21)

∂ f (0 ,t ; w)

∂ w

e( t

−

i ) T e( t

−

i )

−

i )

= −

−

2 h 2

∂ w

i =1

where instead of computing the density over the n data instances of a training

set, it is computed over the last L errors of the RNN. Notice that the time

dependency of the gradient of the density is now explicit. This approach is

an approximation to the real gradient of the density since it uses error values

from different time steps to create an estimate of the error density. Given

that learning is online and the weights are adjusted at each time step, the

construction of a density from errors at different time steps is valid if L is

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home