Information Technology Reference
In-Depth Information
The asymptotic and generalization behaviors of Example 3.6 can be con-
firmed for other Gaussian datasets with equal and unequal covariances. There
is a theoretical justification for the good generalization of the MEE linear dis-
criminant with independent Gaussian inputs. It is based on the following
Theorem 3.1. The minimization of Shannon's or Rényi's quadratic entropy
of a weighted sum of d independent Gaussian distributions implies the mini-
mization of the norm of the weights.
Proof. The weighted sum of d independent Gaussian distributions y = w T x
has the PDF
f ( y )= g ( y ; w T
, w T Σw ) ,
μ
(3.35)
with Σ a diagonal matrix of the variances since the distributions are inde-
pendent. But:
H S ( Y )=ln 2 πe w T Σw ;
(3.36)
H R 2 ( Y )=ln 2 π w T Σw . (3.37)
The quadratic form w T Σw canbewrittenas i =1 w i σ i ; therefore, the min-
imization of either H S or H R 2
2 .
implies the minimization of
w
Whenever the error PDF approaches a Gaussian distribution in the final
stages of the training process, Theorem 3.1 applies and we expect the min-
imization of
2 to take place. As is known from the theory of SVMs,
the minimization of
w
2 is desirable since it implies a smaller Vapnik-
Chervonenkis distance, therefore smaller classifier complexity with better
generalization [228, 43]. As a matter of fact, for Rényi's quadratic entropy
a stronger assertion can be made:
w
Corollary 3.1. The minimization of Rényi's quadratic entropy of the error
of a linear discriminant for independent Gaussian input distributions implies
the minimization of the norm of the weights.
Proof. We have: f Y |t ( y )= g ( y ; m t t ) with m t = w T
μ t + w 0 2 = w T Σw ;
2 πσ exp
m t ) 2 .
1
1
2 σ 2 ( t
f E|t ( e )= f Y |t ( t
e )=
e
(3.38)
Therefore:
1+exp
,
(2 + w T (
μ 1 μ 1 )) 2
4 σ 2
1
4 πσ
V R 2 ( E )=
(3.39)
which is an increasing function for a decreasing σ , thus, as we saw in Theorem
3.1, with decreasing w
2 .
 
Search WWH ::




Custom Search