Continuous Risk Functionals - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

d X,Y ( x, y )

=0in the ( X, Y ) domain. The “probability density matching”

expressed by (2.42) is of course important for regression applications.

For data-classification the application of the MEE approach raises three

diculties [219]. We restrict our discussion to two-class problems and use

therefore (2.22) as the error PDF. We also assume the interval codomain

restriction implying that each class-conditional density f Y |t ( t

−

e ) lies in

separate [ t

1 ,t +1] intervals. As a consequence the differential Shannon's

entropy of the error H S ( E ) can be decomposed as

−

H S ( E )= pH S| 1 ( E )+ qH S|− 1 ( E )+ H S ( T ) ,

(2.43)

where H S|t is the Shannon's entropy of the error for class ω t and H S ( T )=

t∈T P ( t )ln P ( t ) is the Shannon's entropy of the priors ( P (1)

≡

q ). Rényi's quadratic entropy also satisfies a similar additive property when

exponentially scaled (see Appendix C for both derivations). Let us recall

that class conditional distributions and entropies depend on the classifier

parameter w , although we have been omitting this dependency for the sake

of simpler notation.

The first diculty when applying MEE to data classification has to

do with expression (2.43). Since H S ( T ) is a constant, min H S ( E ) implies

min[ pH S| 1 ( E )+ qH S|− 1 ( E )]. Thus, in general, one can say nothing about the

minimum (location and value) of H S since it will depend on the particular

shapes of H S|t as functions of w , and the particular value of p . For instance,

an arbitrarily low value of H S canbeachievedifoneofthe f Y |t is arbitrar-

ily close to a Dirac- δ distribution even if the other has the largest possible

entropy (i.e., is a Gaussian distribution, under specified variance).

The second diculty has to do with the fact that for the regression setting

one may write f E ( e )= f Y |x ( d

≡

p, P (

−

e ) as in [67], since there is only one distri-

bution of y values and d can be seen as the average of the y values. However,

for the classification setting one has to write f E|t ( e )= f Y |t,X ( d

−

e, x ).That

is, one has to study what happens to each class-conditional distribution in-

dividually and, therefore, to individually study the KL divergence relative to

each class conditional distribution, that is:

−

d X,Y |t )=

f X,Y |t ( x, y )ln f X,Y |t ( x, y )

D KL ( f X,Y |t

d X,Y |t ( x, y ) dxdy ,

(2.44)

where d X,Y |t ( x, y ) is the desired joint probability density function for class

ω t .

Finally, the third diculty arises from the fact that the KL divergence

does not exist whenever d X,Y |t ( x, y ) has zeros in the domains of X and Y .

This problem, which may or may not be present in the regression setting, is

almost always present in the classification setting, since the desired input-

output probability density functions are usually continuous functions with

zeros in their domains; namely, one may desire the d X,Y |t ( x, y ) to be Dirac- δ

functions.

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home