Information Technology Reference
In-Depth Information
and
−
w
T
w
0
,
w
T
Σ
X|t
w
)
.
f
E|t
(
e
)=
f
Y |t
(
t
−
e
)=
g
(
e
;
t
μ
X|t
−
(3.31)
We now proceed to compute the information potential as in Example 3.1:
+
∞
f
E|t
(
e
)
de
=
g
(0; 0
,
√
2
σ
Y |t
) and
−∞
+
∞
d,
√
2
σ
m
)
,
f
E|−
1
(
e
)
f
E|
1
(
e
)
de
=
g
(0; 2
−
(3.32)
−∞
with
σ
Y |t
=
w
T
Σ
X|t
w
,σ
m
=
σ
Y |−
1
+
σ
Y |
1
,and
d
=
w
T
(
μ
X|
1
−
μ
X|−
1
).
Hence, for equal priors:
V
R
2
≡
V
R
2
(
d, σ
Y |−
1
,σ
Y |
1
)=
1
8
√
π
1
σ
Y |−
1
+
σ
m
exp
.
d
)
2
4
σ
2
m
1
σ
Y |−
1
+
2
(2
−
−
(3.33)
It is clear that Rényi's quadratic entropy doesn't depend on
w
0
.Thisisa
direct consequence of the invariance of entropy to translations, since from
(3.31) we observe that
−
w
T
μ
X|−
1
−
w
T
w
0
. Shannon's entropy
and
α
-order Rényi's entropies are insensitive to the constant
w
0
term.
Things are different, however, when a linear classifier is trained with gradi-
ent descent using empirical entropies. Off the convergent solution, the
e
i
−
E
[
E
]=
μ
X|
1
−
e
j
deviations in formula (3.3) are scattered, and the estimate
f
(
e
) doesn't usu-
ally reproduce well a sum of Gaussians with the above mean value. As a
consequence, the bias term of the solution will undergo adjustments. Near
the convergent solution, with the
e
i
−
e
j
deviations crowding a small interval,
the
f
(
e
) estimate then provides a close approximation of the theoretical error
PDF and the insensitivity to bias adjustments plays its role.
This empirical MEE behavior is illustrated in the following bivariate two-
class example, where Shannon's entropy gradient descent is used.
Example 3.3.
Consider two normally distributed class-conditional PDFs,
g
(
x
;
μ
t
,
Σ
t
),with
μ
−
1
=[0 0]
T
,
μ
1
=[2 0]
T
,
Σ
−
1
=
Σ
1
=
I
.
Independent training and test datasets with
n
= 250-instances (125 instances
per class) were generated and the Shannon MEE algorithm applied with
h
=1and
η
=0
.
001
2
. Note that according to formula (E.19) the optimal
bandwidth for the number of instances being used is
h
IMSE
=0
.
4.Weare,
therefore, using fat estimation of the error PDF.
2
From now on the indicated
η
values are initial values of an adaptive rule to be
described in Sect. 6.1.1.