Information Technology Reference
In-Depth Information
The asymptotic and generalization behaviors of Example 3.6 can be con-
firmed for other Gaussian datasets with equal and unequal covariances. There
is a theoretical justification for the good generalization of the MEE linear dis-
criminant with independent Gaussian inputs. It is based on the following
Theorem 3.1.
The minimization of Shannon's or Rényi's quadratic entropy
of a weighted sum of
d
independent Gaussian distributions implies the mini-
mization of the norm of the weights.
Proof.
The weighted sum of
d
independent Gaussian distributions
y
=
w
T
x
has the PDF
f
(
y
)=
g
(
y
;
w
T
,
w
T
Σw
)
,
μ
(3.35)
with
Σ
a diagonal matrix of the variances since the distributions are inde-
pendent. But:
H
S
(
Y
)=ln
√
2
πe
w
T
Σw
;
(3.36)
H
R
2
(
Y
)=ln
2
√
π
w
T
Σw
.
(3.37)
The quadratic form
w
T
Σw
canbewrittenas
i
=1
w
i
σ
i
; therefore, the min-
imization of either
H
S
or
H
R
2
2
.
implies the minimization of
w
Whenever the error PDF approaches a Gaussian distribution in the final
stages of the training process, Theorem 3.1 applies and we expect the min-
imization of
2
to take place. As is known from the theory of SVMs,
the minimization of
w
2
is desirable since it implies a smaller Vapnik-
Chervonenkis distance, therefore smaller classifier complexity with better
generalization [228, 43]. As a matter of fact, for Rényi's quadratic entropy
a stronger assertion can be made:
w
Corollary 3.1.
The minimization of Rényi's quadratic entropy of the error
of a linear discriminant for independent Gaussian input distributions implies
the minimization of the norm of the weights.
Proof.
We have:
f
Y |t
(
y
)=
g
(
y
;
m
t
,σ
t
) with
m
t
=
w
T
μ
t
+
w
0
,σ
2
=
w
T
Σw
;
√
2
πσ
exp
m
t
)
2
.
1
1
2
σ
2
(
t
f
E|t
(
e
)=
f
Y |t
(
t
−
e
)=
−
−
e
−
(3.38)
Therefore:
1+exp
,
(2 +
w
T
(
μ
−
1
−
μ
1
))
2
4
σ
2
1
4
√
πσ
V
R
2
(
E
)=
−
(3.39)
which is an increasing function for a decreasing
σ
, thus, as we saw in Theorem
3.1, with decreasing
w
2
.