Information Technology Reference
In-Depth Information
Even if we relax the conditions on the desired probability density func-
tions, for instance, by choosing functions with no zeros on the
Y
support but
conveniently close to Dirac-
δ
functions, we may not yet reach the MEE con-
dition for classification because of (2.43): attaining the KL minimum for one
class conditional distribution, says nothing about the other class conditional
distribution and about
H
S
.
2.3.4 The Quest for Minimum Entropy
We have presented some important properties of
R
MSE
and
R
CE
in Sect. 2.2.
We now discuss the properties of the
R
SEE
risk functional for classification
problems. Restricting ourselves to the two-class setting with codomain re-
striction and
T
=
{−
1
,
1
}
, we rewrite (2.43) as
ln
1
f
E|t
(
e
)
f
E|t
(
e
)
de
+
H
S
(
T
)
.
P
(
t
)
t
+1
t−
1
H
S
(
E
)=
t∈{−
1
,
1
}
(2.45)
We see that
L
EE
t
(
e
)=
ln
f
E|t
(
e
) are here the loss functions for the two
classes. The difference relative to
L
SE
and
L
CE
(and other conventional,
distance-like, loss functions) is that in this case the loss functions are ex-
pressed in terms of the unknown
f
E|t
(
e
). Furthermore, in adaptive training
of a classifier
f
E|t
(
e
) will change in unforeseeable ways. The same can be
said of Rényi's quadratic entropy, with gain function
f
E|t
(
e
). Therefore, the
properties of the entropy risk functionals have to be analyzed not in terms
of loss functions but of the entropies themselves.
Although pattern recognition is a quest for minimum entropy [237], the
topic of entropy-minimizing distributions has only occasionally been studied,
namely in relation to finding optimal locations of PDFs in a mixture [115,
38] and applying the MinMax information measure to discrete distributions
[251]. Whereas entropy-maximizing distributions obeying given constraints
are well known, minimum entropy distributions on the real line are often
di
cult to establish [125]. The only basic known result is that the minimum
entropy of unconstrained continuous densities corresponds to Dirac-
δ
combs
(sequences of Dirac-
δ
functions, including the single Dirac-
δ
function); for
discrete distributions the minimum entropy is zero and corresponds to a
single discrete Dirac-
δ
function.
Entropy magnitude is often thought to be associated with the magnitude
of the PDF tails, in the sense that larger tails imply larger entropy. (A PDF
f
(
−
x>x
0
,f
(
x
)
>
g
(
x
); similarly, for left tail.) However, this presumption fails even in simple
cases of constrained densities: the unit-variance Gaussian PDF,
g
(
x
;0
,
1),has
·
) has larger right tail than PDF
g
(
·
) for positive
x
if
∃
x
0
,
∀
smaller tails than the unit-variance bilateral-exponential PDF,
e
(
x
;
√
2) =
exp(
−
√
2
)
/
√
2; however, the former has larger Shannon entropy,
√
2
πe
=
|
x
|