Information Technology Reference
In-Depth Information
Even if we relax the conditions on the desired probability density func-
tions, for instance, by choosing functions with no zeros on the Y support but
conveniently close to Dirac- δ functions, we may not yet reach the MEE con-
dition for classification because of (2.43): attaining the KL minimum for one
class conditional distribution, says nothing about the other class conditional
distribution and about H S .
2.3.4 The Quest for Minimum Entropy
We have presented some important properties of R MSE and R CE in Sect. 2.2.
We now discuss the properties of the R SEE risk functional for classification
problems. Restricting ourselves to the two-class setting with codomain re-
striction and T =
{−
1 , 1
}
, we rewrite (2.43) as
ln 1
f E|t ( e )
f E|t ( e ) de + H S ( T ) .
P ( t ) t +1
t− 1
H S ( E )=
t∈{− 1 , 1 }
(2.45)
We see that L EE t ( e )=
ln f E|t ( e ) are here the loss functions for the two
classes. The difference relative to L SE and L CE (and other conventional,
distance-like, loss functions) is that in this case the loss functions are ex-
pressed in terms of the unknown f E|t ( e ). Furthermore, in adaptive training
of a classifier f E|t ( e ) will change in unforeseeable ways. The same can be
said of Rényi's quadratic entropy, with gain function f E|t ( e ). Therefore, the
properties of the entropy risk functionals have to be analyzed not in terms
of loss functions but of the entropies themselves.
Although pattern recognition is a quest for minimum entropy [237], the
topic of entropy-minimizing distributions has only occasionally been studied,
namely in relation to finding optimal locations of PDFs in a mixture [115,
38] and applying the MinMax information measure to discrete distributions
[251]. Whereas entropy-maximizing distributions obeying given constraints
are well known, minimum entropy distributions on the real line are often
di cult to establish [125]. The only basic known result is that the minimum
entropy of unconstrained continuous densities corresponds to Dirac- δ combs
(sequences of Dirac- δ functions, including the single Dirac- δ function); for
discrete distributions the minimum entropy is zero and corresponds to a
single discrete Dirac- δ function.
Entropy magnitude is often thought to be associated with the magnitude
of the PDF tails, in the sense that larger tails imply larger entropy. (A PDF
f (
x>x 0 ,f ( x ) >
g ( x ); similarly, for left tail.) However, this presumption fails even in simple
cases of constrained densities: the unit-variance Gaussian PDF, g ( x ;0 , 1),has
·
) has larger right tail than PDF g (
·
) for positive x if
x 0 ,
smaller tails than the unit-variance bilateral-exponential PDF, e ( x ; 2) =
exp(
2
) / 2; however, the former has larger Shannon entropy, 2 πe =
|
x
|
 
Search WWH ::




Custom Search