Information Technology Reference
In-Depth Information
α
ln
E
1
f
α
(
e
)
de, α
H
R
α
(
E
)=
≥
0
,α
=1
.
(2.37)
1
−
For
α
1 one obtains the Shannon entropy. We will find it useful to use
Rényi's quadratic entropy (
α
=2), expressing the risk as
→
ln
f
2
(
e
)
de .
R
R
2
EE
(
E
)
≡
H
R
2
(
E
)=
−
(2.38)
E
Until now we have considered continuous distributions of the errors. In some
problems, however, one has to deal with a discrete error r.v.; one then uses
the discrete versions of the entropies (historically, simply called entropies):
m
H
S
(
E
)=
−
P
(
e
i
)ln
P
(
e
i
)
,
(2.39)
i
=1
m
P
2
(
e
i
)
,
H
R
2
(
E
)=
−
ln
(2.40)
i
=1
where
P
(
e
)
P
E
(
e
) is the error PMF.
We shall later see how to estimate
H
S
(
E
) and
H
R
2
(
E
) for the continu-
ous error case. We shall also see that when applying
H
S
(
E
) formula (2.36)
to data classification a crude and simple estimation of
f
(
e
) is all that is
required; moreover, when applying
H
R
2
(
E
) the estimation of
f
(
e
) is even
short-circuited. Thus, in both cases we get rid of having to accurately es-
timate a PDF, a problem that has traditionally been considered as being
more dicult, in general, than that of having to design an accurate classi-
fier [41, 227].
Note that indeed one may interpret the above entropies as risk functionals.
R
SEE
(
E
) is the expectation of the loss function
L
SEE
(
e
)=
≡
ln
f
(
e
).For
Rényi's quadratic error-entropy, instead of minimizing
R
R
2
EE
(
E
) we will
see later that it turns out to be more convenient to maximize
V
R
2
(
E
)=
exp(
−
R
R
2
EE
(
E
)), the so-called
information potential
[175]. In this case, in-
stead of a loss function we may speak of a
gain
function:
V
R
2
(
E
) is the expec-
tation of the gain function
f
(
e
). One can also, of course, consider
−
−
V
R
2
(
E
)
as the risk functional expressed in terms of a loss
f
(
e
).
As an initial motivation to use entropic risk functionals, let us recall that
entropy provides a measure of how concentrated a distribution is. For dis-
crete distributions its minimum value (zero) corresponds to a discrete Dirac-
δ
PMF. For continuous distributions the minimum value (minus infinite for
H
S
and zero for
H
R
2
) corresponds to a PDF represented by a sequence of con-
tinuous Dirac-
δ
functions, a Dirac-
δ
comb. Let us consider the 1-of-
c
coding
scheme with
T
−
∈{
a, b
}
. For an interval-codomain regression-like classifier,
implementing an
X
→
T
mapping, the class-conditional densities of any out-
put
Y
k
∈
.Letus
assume equal priors and
b
as the label value of
ω
k
and
a
of its complement. As
[
a, b
] are Dirac-
δ
combs iff they are non-null only at
{
a, b
}