Information Technology Reference
In-Depth Information
α ln
E
1
f α ( e ) de, α
H R α ( E )=
0
=1 .
(2.37)
1
For α
1 one obtains the Shannon entropy. We will find it useful to use
Rényi's quadratic entropy ( α =2), expressing the risk as
ln
f 2 ( e ) de .
R R 2 EE ( E )
H R 2 ( E )=
(2.38)
E
Until now we have considered continuous distributions of the errors. In some
problems, however, one has to deal with a discrete error r.v.; one then uses
the discrete versions of the entropies (historically, simply called entropies):
m
H S ( E )=
P ( e i )ln P ( e i ) ,
(2.39)
i =1
m
P 2 ( e i ) ,
H R 2 ( E )=
ln
(2.40)
i =1
where P ( e )
P E ( e ) is the error PMF.
We shall later see how to estimate H S ( E ) and H R 2 ( E ) for the continu-
ous error case. We shall also see that when applying H S ( E ) formula (2.36)
to data classification a crude and simple estimation of f ( e ) is all that is
required; moreover, when applying H R 2 ( E ) the estimation of f ( e ) is even
short-circuited. Thus, in both cases we get rid of having to accurately es-
timate a PDF, a problem that has traditionally been considered as being
more dicult, in general, than that of having to design an accurate classi-
fier [41, 227].
Note that indeed one may interpret the above entropies as risk functionals.
R SEE ( E ) is the expectation of the loss function L SEE ( e )=
ln f ( e ).For
Rényi's quadratic error-entropy, instead of minimizing R R 2 EE ( E ) we will
see later that it turns out to be more convenient to maximize V R 2 ( E )=
exp(
R R 2 EE ( E )), the so-called information potential [175]. In this case, in-
stead of a loss function we may speak of a gain function: V R 2 ( E ) is the expec-
tation of the gain function f ( e ). One can also, of course, consider
V R 2 ( E )
as the risk functional expressed in terms of a loss
f ( e ).
As an initial motivation to use entropic risk functionals, let us recall that
entropy provides a measure of how concentrated a distribution is. For dis-
crete distributions its minimum value (zero) corresponds to a discrete Dirac- δ
PMF. For continuous distributions the minimum value (minus infinite for H S
and zero for H R 2 ) corresponds to a PDF represented by a sequence of con-
tinuous Dirac- δ functions, a Dirac- δ comb. Let us consider the 1-of- c coding
scheme with T
∈{
a, b
}
. For an interval-codomain regression-like classifier,
implementing an X
T mapping, the class-conditional densities of any out-
put Y k
.Letus
assume equal priors and b as the label value of ω k and a of its complement. As
[ a, b ] are Dirac- δ combs iff they are non-null only at
{
a, b
}
 
Search WWH ::




Custom Search