Information Technology Reference
In-Depth Information
Note that, since the p ik = P ( T k |
x i ) are unknown, (2.15) cannot be used as a risk
estimator. However, the p ik do not depend on the classifier parameter vector w ,
therefore the minimization of (2.15) is equivalent to the minimization of
n
c
R CE ( y )=
t ik ln( y ik ) .
(2.16)
i =1
k =1
The empirical risk (2.16) is known in the literature as the cross-entropy (CE)
risk. This designation is, however, a misnomer. Despite the similarity between
(2.16) and the cross-entropy of two discrete distributions, X P ( x )ln Q ( x ),
with PMFs P ( x ) and Q ( x ), one should note that the t ik are not probabilities
(the t i are random vectors with multinomial distribution). There is a tendency
to “interpret” the t ik as P ( T k |
x i ) and some literature is misleading in that
sense. As a matter of fact, since the t ik are binary-valued (in
for the
0-1 coding we are assuming) such “interpretation” is incorrect (no matter
which coding scheme we are using): it would amount to saying that every
object is correctly classified! Briefly, the t ik do not form a valid probability
distribution. They should be interpreted as mere switches: when a particular
t ik is equal to 1 (meaning that x i belongs to class ω k ), y ik should be maximum
and we then just minimize
{
0 , 1
}
ln( y ik ), since all the remaining t il ,with l
= k ,
are zero.
Although, as we have explained, the designation of (2.16) as cross-entropy
is incorrect, we will keep it given its wide acceptance.
When applying the empirical R CE risk, one should note that whenever the
classifier outputs are continuous and differentiable, R CE is also continuous
and differentiable. The usual optimization algorithms can then be applied
to the minimization of the empirical cross-entropy risk, namely any gradient
descent algorithm.
From the above discussion it would seem appropriate to always employ
a minimum cross-entropy (MCE) approach to train classifiers because when
interpreting the outputs as probabilities this is the optimal solution (in a
maximum likelihood sense). In fact, R CE takes into account the binary char-
acteristic of the targets. No similar interpretation exists for R MSE .(TheML
equivalence to MSE is only valid for zero-mean and equal variance Gaussian
targets.)
The derivation of R CE can be found in the works of [83, 185], applying
either the maximum likelihood or maximum mutual information principles,
and assuming the classifier outputs are approximations of posterior proba-
bilities. The analysis provided by [89] goes further and presents a general
expression that any loss function should satisfy so that y k = P ( T k |
x ).It
assumes the independence of the target components t k (a condition that is
never fulfilled, since any component is the complement of all other ones) and
in addition that the empirical risk is expressed as a distance functional of
outputs and targets as follows:
 
Search WWH ::




Custom Search