Information Technology Reference
In-Depth Information
d
X,Y
(
x, y
)
=0in the (
X, Y
) domain. The “probability density matching”
expressed by (2.42) is of course important for regression applications.
For data-classification the application of the MEE approach raises three
diculties [219]. We restrict our discussion to two-class problems and use
therefore (2.22) as the error PDF. We also assume the interval codomain
restriction implying that each class-conditional density
f
Y |t
(
t
−
e
) lies in
separate [
t
1
,t
+1] intervals. As a consequence the differential Shannon's
entropy of the error
H
S
(
E
) can be decomposed as
−
H
S
(
E
)=
pH
S|
1
(
E
)+
qH
S|−
1
(
E
)+
H
S
(
T
)
,
(2.43)
where
H
S|t
is the Shannon's entropy of the error for class
ω
t
and
H
S
(
T
)=
t∈T
P
(
t
)ln
P
(
t
) is the Shannon's entropy of the priors (
P
(1)
≡
q
). Rényi's quadratic entropy also satisfies a similar additive property when
exponentially scaled (see Appendix C for both derivations). Let us recall
that class conditional distributions and entropies depend on the classifier
parameter
w
, although we have been omitting this dependency for the sake
of simpler notation.
The first diculty when applying MEE to data classification has to
do with expression (2.43). Since
H
S
(
T
) is a constant, min
H
S
(
E
) implies
min[
pH
S|
1
(
E
)+
qH
S|−
1
(
E
)]. Thus, in general, one can say nothing about the
minimum (location and value) of
H
S
since it will depend on the particular
shapes of
H
S|t
as functions of
w
, and the particular value of
p
. For instance,
an arbitrarily low value of
H
S
canbeachievedifoneofthe
f
Y |t
is arbitrar-
ily close to a Dirac-
δ
distribution even if the other has the largest possible
entropy (i.e., is a Gaussian distribution, under specified variance).
The second diculty has to do with the fact that for the regression setting
one may write
f
E
(
e
)=
f
Y |x
(
d
≡
p, P
(
−
1)
e
) as in [67], since there is only one distri-
bution of
y
values and
d
can be seen as the average of the
y
values. However,
for the classification setting one has to write
f
E|t
(
e
)=
f
Y |t,X
(
d
−
e, x
).That
is, one has to study what happens to each class-conditional distribution in-
dividually and, therefore, to individually study the KL divergence relative to
each class conditional distribution, that is:
−
d
X,Y |t
)=
X
f
X,Y |t
(
x, y
)ln
f
X,Y |t
(
x, y
)
D
KL
(
f
X,Y |t
d
X,Y |t
(
x, y
)
dxdy ,
(2.44)
Y
where
d
X,Y |t
(
x, y
) is the desired joint probability density function for class
ω
t
.
Finally, the third diculty arises from the fact that the KL divergence
does not exist whenever
d
X,Y |t
(
x, y
) has zeros in the domains of
X
and
Y
.
This problem, which may or may not be present in the regression setting, is
almost always present in the classification setting, since the desired input-
output probability density functions are usually continuous functions with
zeros in their domains; namely, one may desire the
d
X,Y |t
(
x, y
) to be Dirac-
δ
functions.