Continuous Risk Functionals - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

α ln

f α ( e ) de, α

H R α ( E )=

≥

0 ,α

=1 .

(2.37)

−

For α

1 one obtains the Shannon entropy. We will find it useful to use

Rényi's quadratic entropy ( α =2), expressing the risk as

→

f 2 ( e ) de .

R R 2 EE ( E )

≡

H R 2 ( E )=

−

(2.38)

Until now we have considered continuous distributions of the errors. In some

problems, however, one has to deal with a discrete error r.v.; one then uses

the discrete versions of the entropies (historically, simply called entropies):

H S ( E )= −

P ( e i )ln P ( e i ) ,

(2.39)

i =1

P 2 ( e i ) ,

H R 2 ( E )=

−

(2.40)

i =1

where P ( e )

P E ( e ) is the error PMF.

We shall later see how to estimate H S ( E ) and H R 2 ( E ) for the continu-

ous error case. We shall also see that when applying H S ( E ) formula (2.36)

to data classification a crude and simple estimation of f ( e ) is all that is

required; moreover, when applying H R 2 ( E ) the estimation of f ( e ) is even

short-circuited. Thus, in both cases we get rid of having to accurately es-

timate a PDF, a problem that has traditionally been considered as being

more dicult, in general, than that of having to design an accurate classi-

fier [41, 227].

Note that indeed one may interpret the above entropies as risk functionals.

R SEE ( E ) is the expectation of the loss function L SEE ( e )=

≡

ln f ( e ).For

Rényi's quadratic error-entropy, instead of minimizing R R 2 EE ( E ) we will

see later that it turns out to be more convenient to maximize V R 2 ( E )=

exp(

−

R R 2 EE ( E )), the so-called information potential [175]. In this case, in-

stead of a loss function we may speak of a gain function: V R 2 ( E ) is the expec-

tation of the gain function f ( e ). One can also, of course, consider

−

V R 2 ( E )

as the risk functional expressed in terms of a loss

f ( e ).

As an initial motivation to use entropic risk functionals, let us recall that

entropy provides a measure of how concentrated a distribution is. For dis-

crete distributions its minimum value (zero) corresponds to a discrete Dirac- δ

PMF. For continuous distributions the minimum value (minus infinite for H S

and zero for H R 2 ) corresponds to a PDF represented by a sequence of con-

tinuous Dirac- δ functions, a Dirac- δ comb. Let us consider the 1-of- c coding

scheme with T

−

∈{

a, b

}

. For an interval-codomain regression-like classifier,

implementing an X

→

T mapping, the class-conditional densities of any out-

put Y k ∈

.Letus

assume equal priors and b as the label value of ω k and a of its complement. As

[ a, b ] are Dirac- δ combs iff they are non-null only at

{

a, b

}

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home