Information Technology Reference
In-Depth Information
d X,Y ( x, y )
=0in the ( X, Y ) domain. The “probability density matching”
expressed by (2.42) is of course important for regression applications.
For data-classification the application of the MEE approach raises three
diculties [219]. We restrict our discussion to two-class problems and use
therefore (2.22) as the error PDF. We also assume the interval codomain
restriction implying that each class-conditional density f Y |t ( t
e ) lies in
separate [ t
1 ,t +1] intervals. As a consequence the differential Shannon's
entropy of the error H S ( E ) can be decomposed as
H S ( E )= pH S| 1 ( E )+ qH S|− 1 ( E )+ H S ( T ) ,
(2.43)
where H S|t is the Shannon's entropy of the error for class ω t and H S ( T )=
t∈T P ( t )ln P ( t ) is the Shannon's entropy of the priors ( P (1)
q ). Rényi's quadratic entropy also satisfies a similar additive property when
exponentially scaled (see Appendix C for both derivations). Let us recall
that class conditional distributions and entropies depend on the classifier
parameter w , although we have been omitting this dependency for the sake
of simpler notation.
The first diculty when applying MEE to data classification has to
do with expression (2.43). Since H S ( T ) is a constant, min H S ( E ) implies
min[ pH S| 1 ( E )+ qH S|− 1 ( E )]. Thus, in general, one can say nothing about the
minimum (location and value) of H S since it will depend on the particular
shapes of H S|t as functions of w , and the particular value of p . For instance,
an arbitrarily low value of H S canbeachievedifoneofthe f Y |t is arbitrar-
ily close to a Dirac- δ distribution even if the other has the largest possible
entropy (i.e., is a Gaussian distribution, under specified variance).
The second diculty has to do with the fact that for the regression setting
one may write f E ( e )= f Y |x ( d
p, P (
1)
e ) as in [67], since there is only one distri-
bution of y values and d can be seen as the average of the y values. However,
for the classification setting one has to write f E|t ( e )= f Y |t,X ( d
e, x ).That
is, one has to study what happens to each class-conditional distribution in-
dividually and, therefore, to individually study the KL divergence relative to
each class conditional distribution, that is:
d X,Y |t )=
X
f X,Y |t ( x, y )ln f X,Y |t ( x, y )
D KL ( f X,Y |t
d X,Y |t ( x, y ) dxdy ,
(2.44)
Y
where d X,Y |t ( x, y ) is the desired joint probability density function for class
ω t .
Finally, the third diculty arises from the fact that the KL divergence
does not exist whenever d X,Y |t ( x, y ) has zeros in the domains of X and Y .
This problem, which may or may not be present in the regression setting, is
almost always present in the classification setting, since the desired input-
output probability density functions are usually continuous functions with
zeros in their domains; namely, one may desire the d X,Y |t ( x, y ) to be Dirac- δ
functions.
 
Search WWH ::




Custom Search