Continuous Risk Functionals - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

Note that, since the p ik = P ( T k |

x i ) are unknown, (2.15) cannot be used as a risk

estimator. However, the p ik do not depend on the classifier parameter vector w ,

therefore the minimization of (2.15) is equivalent to the minimization of

R CE ( y )=

−

t ik ln( y ik ) .

(2.16)

i =1

k =1

The empirical risk (2.16) is known in the literature as the cross-entropy (CE)

risk. This designation is, however, a misnomer. Despite the similarity between

(2.16) and the cross-entropy of two discrete distributions, − X P ( x )ln Q ( x ),

with PMFs P ( x ) and Q ( x ), one should note that the t ik are not probabilities

(the t i are random vectors with multinomial distribution). There is a tendency

to “interpret” the t ik as P ( T k |

x i ) and some literature is misleading in that

sense. As a matter of fact, since the t ik are binary-valued (in

for the

0-1 coding we are assuming) such “interpretation” is incorrect (no matter

which coding scheme we are using): it would amount to saying that every

object is correctly classified! Briefly, the t ik do not form a valid probability

distribution. They should be interpreted as mere switches: when a particular

t ik is equal to 1 (meaning that x i belongs to class ω k ), y ik should be maximum

and we then just minimize

{

0 , 1

}

−

ln( y ik ), since all the remaining t il ,with l

= k ,

are zero.

Although, as we have explained, the designation of (2.16) as cross-entropy

is incorrect, we will keep it given its wide acceptance.

When applying the empirical R CE risk, one should note that whenever the

classifier outputs are continuous and differentiable, R CE is also continuous

and differentiable. The usual optimization algorithms can then be applied

to the minimization of the empirical cross-entropy risk, namely any gradient

descent algorithm.

From the above discussion it would seem appropriate to always employ

a minimum cross-entropy (MCE) approach to train classifiers because when

interpreting the outputs as probabilities this is the optimal solution (in a

maximum likelihood sense). In fact, R CE takes into account the binary char-

acteristic of the targets. No similar interpretation exists for R MSE .(TheML

equivalence to MSE is only valid for zero-mean and equal variance Gaussian

targets.)

The derivation of R CE can be found in the works of [83, 185], applying

either the maximum likelihood or maximum mutual information principles,

and assuming the classifier outputs are approximations of posterior proba-

bilities. The analysis provided by [89] goes further and presents a general

expression that any loss function should satisfy so that y k = P ( T k |

x ).It

assumes the independence of the target components t k (a condition that is

never fulfilled, since any component is the complement of all other ones) and

in addition that the empirical risk is expressed as a distance functional of

outputs and targets as follows:

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home