Information Technology Reference
In-Depth Information
ent at zero raises diculties to iterative optimization algorithms. As a matter
of fact, the large popularity of the MSE risk functional stems from the ex-
istence of ecient optimization algorithms, particularly those based on the
original adaptive training process known as the least-mean-square Widrow-
Hoff algorithm (see e.g., [142]).
2.1.2 The Cross-Entropy Risk
The cross-entropy (CE) loss function was first proposed (although without
naming it that way) in [22]; it can be derived from the maximum likeli-
hood (ML) method applied to the estimation of the posterior probabilities
P
(
T
k
|
X
. Each component
y
k
of the classifier
output vector, assumed as taking value in [0
,
1],isviewedasanestimateof
the posterior probability
P
(
T
k
|
x
)
,k
=1
, ..., c
, for any
x
∈
P
(
T
k
|
x
); i.e.,
y
k
=
x
).
x
) simply by
p
k
. The occurrence of a target vector
t
conditioned on a given input vector
x
, in other words, a realization of the
r.v.
T
Let us denote the
P
(
T
k
|
|
x
, is governed by the joint distribution of (
T
1
|
x, ..., T
c
|
x
). For 0-1
coding the probability mass function of
T
|
x
is multinomial with
x
)=
p
t
1
p
t
2
...p
t
c
P
(
T
|
.
(2.12)
Note that for
c
=2formula (2.12) reduces to a binomial distribution, e.g. of
T
1
,as
x
)=
p
t
1
(1
p
1
)
(1
−t
1
)
.
P
(
T
|
−
(2.13)
Similarly, we assign a probabilistic model to the classifier outputs, by writing
x
)=
y
t
1
y
t
2
...y
t
c
,
with
y
k
=
P
(
Y
k
|
P
(
Y
|
x
)
,
(2.14)
with the assumption that the outputs satisfy the same constraints as true
probabilities do, namely
k
P
(
Y
k
|
x
)=1.
We would like the
Y
|
x
distribution to approximate the target distribution
T
x
. For this purpose we employ a loss function that maximizes the likelihood
of
Y
|
|
x
or, equivalently, minimizes the Kullback-Leibler (KL) divergence of
Y
x
(see Appendix A).
The empirical estimate of the KL divergence for i.i.d. random variables is
writteninthepresentcaseas:
|
x
with respect to
T
|
ln
p
t
i
1
=
n
n
...p
t
ic
ic
y
)=
1
n
ln
P
(
T
i
|
x
i
)
1
n
D
KL
(
p
i
1
x
i
)
=
y
t
i
1
i
1
...y
t
ic
P
(
Y
i
|
ic
i
=1
i
=1
n
c
n
c
1
n
t
ik
ln(
y
ik
)+
1
n
=
−
t
ik
ln(
p
ik
)
.
(2.15)
i
=1
k
=1
i
=1
k
=1