Information Technology Reference
In-Depth Information
e
jk
)
∂e
ik
,
n
n
c
1
n
2
h
2
G
h
(e
i
−
e
j
)
∂e
jk
∂
w
l
Δ
w
(
m−
1)
l
=
η
(
e
ik
−
∂
w
l
−
f
(e
i
)
i
=1
j
=1
k
=1
(6.4)
ϕ
(
l
=0
w
lk
u
il
)
.w
lk
ϕ
d
m
=0
w
ml
x
im
x
i
.
Sometimes a so-called moment factor, dependent on weight differences in
consecutive epochs, is added to expressions (6.3) and (6.4) with the intent to
speed-up convergence (see e.g. [212]). We will not make use of the moment
factor and will instead use other means to be explained later for improving
convergence.
Note that the back-propagation formulas for the ZED and EXP risks are
quite simpler than the above ones, essentially because there is no double sum
on the errors. In fact, one has, for example
∂e
ik
∂
w
l
with
=
−
∂ R
EXP
∂
w
k
n
i
e
i
e
ik
∂e
ik
∂
w
k
e
−
2
τ
e
T
,
=
(6.5)
i
=1
which gives us an equivalent complexity of these risks when compared to
MSE which has
∂ R
MSE
∂
w
k
n
2
n
e
ik
∂e
ik
∂
w
k
.
=
(6.6)
i
=1
We now present an example from [198] of an MLP using Rényi's quadratic
entropy trained with the back-propagation algorithm to discriminate a 4-
class dataset. The example illustrates the convergence towards Dirac-
δ
error
densities (see 3.1.1). In this example and throughout the present section we
only use one-hidden-layer MLP architectures denoted [
d
:
n
h
:
c
],with
n
h
the
number of hidden neurons. A 1-of-
c
coding scheme of the outputs is assumed.
Example 6.1.
Consider the two-dimensional artificial dataset shown in Fig. 6.2
consisting of 200 data instances in four separable classes with 52, 54, 42 and
52 instances in each class.
MLPs with one hidden layer, tanh activation function, and initial random
weights in [
0
.
1
,
0
.
1], were trained using the R
2
EE risk functional. Only half
of the dataset (a total of 100 instances with approximately 25 instances per
class) was used in the training process.
Figures 6.3, 6.4 and 6.5 show, for one experiment with
n
h
=2, error graphs
corresponding to training epochs 1, 10 and 40 respectively. Since we have a
neural network with four outputs, the error vectors,
e
k
,k
=1
,...,
4,forma
100
−
4 array whose off-diagonal cells are
the (e
i
,
e
k
) scatter plots of the column-class
k
error values (in [
×
4 matrix. Each figure shows a 4
×
2
,
2])versus
the row-class
i
error values. The diagonal cells contain the histograms of each
column-class error vector e
k
.
Analyzing these graphs one can see that the errors converge to Dirac-
δ
distributions and moreover with uncorrelated errors for the four classes.
−