Information Technology Reference
In-Depth Information
Theorem 2.2. Given any loss function L ( e ) , which satisfies lim |e|→ + L ( e )
=+
, one has R L ( E )
∝{
H S ( E )+ D KL ( f E ( e )
q L ( e ))
}
,where q L ( e ) is a
PDF related to L ( e ) by q L ( e )=exp(
γ 0
γ 1 L ( e )) .
The proof of the theorem and the existence of q L ( e ) is demonstrated in the
cited work, which also provides the means of computing the γ 0 and γ 1 con-
stants. Note that all loss functions we have seen so far, with the exception
of the one corresponding to Rényi's quadratic entropy, satisfy the condition
lim |e|→ + L ( e )=+
. Theorem 2.2 also provides an interesting bound on
R L ( E ). Since the Kullback-Leibler divergence is always non-negative, we have
H S ( E )+ D KL ( f E ( e )
q L ( e ))
H S ( E ) ,
(2.41)
with equality iff f E ( e )= q L ( e ). Therefore, minimizing any risk functional
R L ( E ),with L ( e ) satisfying the above condition, is equivalent to minimizing
an upper bound of the error entropy H S ( E ). Moreover, Theorem 2.2 allows us
to interpret the minimization of any risk functional R L ( E ) as being driven by
two “forces”: one, D KL ( f E ( e )
q L ( e )), that attempts to shape the error PDF
in a way that reflects the loss function itself; the other, H S ( E ), providing the
decrease of the dispersion of the error, its uncertainty.
There is an abundant literature on information theoretic topics. For the
reader unfamiliar with this area an overview on definitions and properties
of entropies (Shannon, generalized Rényi, and others) can be found in the
following works: [131, 183, 168, 48, 184, 164, 96, 62]. Appendix B presents a
short survey of properties that are particularly important throughout the
topic.
2.3.3 MEE Is Harder for Classification than for
Regression
An important result concerning the minimization of error entropy was shown
in [67] for a machine solving a regression task, approximating its output y to
some desired continuous function d ( x ). The authors showed that the MEE
approach corresponds to the minimum of the Kullback-Leibler divergence of
f X,Y (the joint PDF when the output is y w ( x )) with respect to d X,Y (the
joint PDF when the output is the desired d ( x )). Concretely, they showed that
d X,Y )=
X
f X,Y ( x, y )ln f X,Y ( x, y )
d X,Y ( x, y ) dxdy .
(2.42)
Their demonstration was, in fact, presented for the generalized family of Rényi
entropies, which includes the Shannon entropy H S ( E ) as a special asymptotic
case. Moreover, although not explicit in their demonstration, the above result
is only valid if the above integrals exist, which among other things require
min H S ( E )
min D KL ( f X,Y
Y
 
Search WWH ::




Custom Search