Information Technology Reference
In-Depth Information
Note that, since the
p
ik
=
P
(
T
k
|
x
i
) are unknown, (2.15) cannot be used as a risk
estimator. However, the
p
ik
do not depend on the classifier parameter vector
w
,
therefore the minimization of (2.15) is equivalent to the minimization of
n
c
R
CE
(
y
)=
−
t
ik
ln(
y
ik
)
.
(2.16)
i
=1
k
=1
The empirical risk (2.16) is known in the literature as the
cross-entropy
(CE)
risk. This designation is, however, a misnomer. Despite the similarity between
(2.16) and the cross-entropy of two discrete distributions,
−
X
P
(
x
)ln
Q
(
x
),
with PMFs
P
(
x
) and
Q
(
x
), one should note that the
t
ik
are
not
probabilities
(the
t
i
are random vectors with multinomial distribution). There is a tendency
to “interpret” the
t
ik
as
P
(
T
k
|
x
i
) and some literature is misleading in that
sense. As a matter of fact, since the
t
ik
are binary-valued (in
for the
0-1 coding we are assuming) such “interpretation” is
incorrect
(no matter
which coding scheme we are using): it would amount to saying that every
object is correctly classified! Briefly, the
t
ik
do
not
form a valid probability
distribution. They should be interpreted as mere switches: when a particular
t
ik
is equal to 1 (meaning that
x
i
belongs to class
ω
k
),
y
ik
should be maximum
and we then just minimize
{
0
,
1
}
−
ln(
y
ik
), since all the remaining
t
il
,with
l
=
k
,
are zero.
Although, as we have explained, the designation of (2.16) as cross-entropy
is incorrect, we will keep it given its wide acceptance.
When applying the empirical
R
CE
risk, one should note that whenever the
classifier outputs are continuous and differentiable,
R
CE
is also continuous
and differentiable. The usual optimization algorithms can then be applied
to the minimization of the empirical cross-entropy risk, namely any gradient
descent algorithm.
From the above discussion it would seem appropriate to always employ
a minimum cross-entropy (MCE) approach to train classifiers because when
interpreting the outputs as probabilities this is the optimal solution (in a
maximum likelihood sense). In fact,
R
CE
takes into account the binary char-
acteristic of the targets. No similar interpretation exists for
R
MSE
.(TheML
equivalence to MSE is only valid for zero-mean and equal variance Gaussian
targets.)
The derivation of
R
CE
can be found in the works of [83, 185], applying
either the maximum likelihood or maximum mutual information principles,
and assuming the classifier outputs are approximations of posterior proba-
bilities. The analysis provided by [89] goes further and presents a general
expression that any loss function should satisfy so that
y
k
=
P
(
T
k
|
x
).It
assumes the independence of the target components
t
k
(a condition that is
never fulfilled, since any component is the complement of all other ones) and
in addition that the empirical risk is expressed as a distance functional of
outputs and targets as follows: