Information Technology Reference
In-Depth Information
ent at zero raises diculties to iterative optimization algorithms. As a matter
of fact, the large popularity of the MSE risk functional stems from the ex-
istence of ecient optimization algorithms, particularly those based on the
original adaptive training process known as the least-mean-square Widrow-
Hoff algorithm (see e.g., [142]).
2.1.2 The Cross-Entropy Risk
The cross-entropy (CE) loss function was first proposed (although without
naming it that way) in [22]; it can be derived from the maximum likeli-
hood (ML) method applied to the estimation of the posterior probabilities
P ( T k |
X . Each component y k of the classifier
output vector, assumed as taking value in [0 , 1],isviewedasanestimateof
the posterior probability P ( T k |
x ) ,k =1 , ..., c , for any x
P ( T k |
x ); i.e., y k =
x ).
x ) simply by p k . The occurrence of a target vector
t conditioned on a given input vector x , in other words, a realization of the
r.v. T
Let us denote the P ( T k |
|
x , is governed by the joint distribution of ( T 1 |
x, ..., T c |
x ). For 0-1
coding the probability mass function of T
|
x is multinomial with
x )= p t 1 p t 2
...p t c
P ( T
|
.
(2.12)
Note that for c =2formula (2.12) reduces to a binomial distribution, e.g. of
T 1 ,as
x )= p t 1 (1
p 1 ) (1 −t 1 ) .
P ( T
|
(2.13)
Similarly, we assign a probabilistic model to the classifier outputs, by writing
x )= y t 1 y t 2 ...y t c , with y k = P ( Y k |
P ( Y
|
x ) ,
(2.14)
with the assumption that the outputs satisfy the same constraints as true
probabilities do, namely k P ( Y k |
x )=1.
We would like the Y
|
x distribution to approximate the target distribution
T
x . For this purpose we employ a loss function that maximizes the likelihood
of Y
|
|
x or, equivalently, minimizes the Kullback-Leibler (KL) divergence of
Y
x (see Appendix A).
The empirical estimate of the KL divergence for i.i.d. random variables is
writteninthepresentcaseas:
|
x with respect to T
|
ln p t i 1
=
n
n
...p t ic
ic
y )= 1
n
ln P ( T i |
x i )
1
n
D KL ( p
i 1
x i ) =
y t i 1
i 1 ...y t ic
P ( Y i |
ic
i =1
i =1
n
c
n
c
1
n
t ik ln( y ik )+ 1
n
=
t ik ln( p ik ) .
(2.15)
i =1
k =1
i =1
k =1
 
Search WWH ::




Custom Search