Information Technology Reference
In-Depth Information
6
L(e)
4
2
0
e
−2
0
0.5
1
1.5
2
Fig. 2.4
L
MSE
(solid line) and
L
CE
(
for
t
=1)
(dashed line) as functions of
e
.
Thus, for cross-entropy we have a logarithmic loss function,
L
CE
(
t, e
)=
−
te
). Figure 2.4 shows (for
t
=1)the distance functions
L
MSE
(
e
)=
e
2
and
L
CE
(
t, e
)=
ln(2
−
te
). A classifier minimizing the MSE risk functional
is minimizing the second-order moment of the errors, favoring input-output
mappings with low error spread (low variance) and deviation from zero. A
classifier minimizing the CE risk functional is minimizing an average loga-
rithmic distance of the error from its worst value (respectively, 2 for
t
=1
and
−
ln(2
−
1), as shown in Fig. 2.4. As a consequence of the logarithmic
behavior
L
CE
(
t, e
) tends to focus mainly on large errors. Note that one may
have in some cases to restrict
Y
to the open interval ]
−
2 for
t
=
−
−
1
,
1[ in order to satisfy
the integrability condition of (2.31).
Example 2.3.
Let us consider a two-class problem with target set
T
=
{
,P
(0) =
P
(1) = 1
/
2 and a classifier codomain restricted to [0
,
1] ac-
cording to the following family of uniform PDFs:
f
Y
(
y
)=
u
(
y
;0
,d
)
0
,
1
}
if
T
=0
.
(2.32)
u
(
y
;1
−
d,
1) if
T
=1
1
2
u
(
e
;0
,d
)+
2
u
(
e
;
Note that according to (2.22)
f
E
(
e
) is distributed as
−
d,
0) =
u
(
e
;
−
d, d
). Therefore, the MSE risk is simply the variance of this
distribution:
R
MSE
(
d
)=
d
2
3
.
(2.33)
We now compute the cross-entropy risk, which is in this case easier to derive
from
f
Y |t
. First note that (2.28) is
n
times the empirical estimate of
R
CE
(
d
)=
−
P
(0)
E
[ln(1
−
Y
)
|
T
=0]
−
P
(1)
E
[ln(
Y
)
|
T
=1]
.
(2.34)