Information Technology Reference
In-Depth Information
Let us start by considering the
k
th output perceptron with its weights
being adjusted by gradient descent. When using the empirical Shannon's
entropy of the error,
H
S
, we then apply expression (3.8), which we rewrite
below for the
k
th output perceptron in vector notation:
e
jk
)
∂e
ik
.
∂ H
S
∂
w
k
n
n
1
n
2
h
2
G
h
(e
i
−
e
j
)
∂e
jk
∂
w
k
=
(
e
ik
−
∂
w
k
−
(6.1)
f
(e
i
)
i
=1
j
=1
Whereas expression (3.8) contemplated the adjustment of a single weight,
we now formulate the adjustment with respect to a whole vector of weights
(including biases): the weight vector w
k
of an arbitrary
k
th output per-
ceptron. The derivative of
H
S
with respect to the weights depends on
n
c
-dimensional error vectors denoted e
i
and e
j
. Each component
∂ H
S
/∂w
lk
of vector
∂ H
S
/∂
w
k
in (6.1) can be conveniently expressed (namely, for im-
plementation purposes) as the sum of all elements of the matrix resulting
from:
⎡
⎣
⎤
⎦
⎡
⎣
⎤
⎦
1
1
f
(
e
1
)
f
(
e
1
)
···
G
h
(e
1
−
e
1
)
···
G
h
(e
1
−
e
n
)
1
n
2
h
2
.
.
.
.
.
×
1
1
e
n
)
···
G
h
(e
n
−
e
1
)
···
G
h
(e
n
−
e
n
)
f
(
f
(
e
n
)
⎡
⎣
⎤
⎦
⎡
⎣
⎤
⎦
∂e
1
k
∂e
1
k
∂e
1
k
∂e
nk
∂w
lk
∂w
lk
−
∂w
lk
···
∂w
lk
−
e
1
k
−
e
1
k
···
e
1
k
−
e
nk
.
.
.
.
.
×
.
×
(6.2)
∂e
nk
∂e
1
k
∂e
nk
∂e
nk
∂w
lk
e
nk
−
e
1
k
···
e
nk
−
e
nk
∂w
lk
−
∂w
lk
···
∂w
lk
−
where '.
×
' denotes element-wise product [212]. The first matrix is not present
when Rényi's quadratic entropy or information potential is used (see also
expression (3.9)).
Once all
n
error vectors (for the
n
input vectors x
i
), relative to the
m
th
training epoch have been obtained, one is then able to compute the updated
weights for the output perceptron:
w
(
m
−
1)
∂ H
S
∂
w
k
w
(
m
)
k
= w
(
m−
1)
k
Δ
w
(
m−
1)
k
Δ
w
(
m−
1)
k
−
,
=
η
,
(6.3)
with
where
η
is the learning rate.
The updating of the weight vector w
l
, relative to an arbitrary
l
th percep-
tron of the hidden-layer, is done as usual with the back-propagation algo-
rithm. One needs all back-propagated errors from the output layer (incident
dotted arrows in Fig. 6.1). Denoting by
ϕ
(
.
) the activation function assumed
the same for all perceptrons, the updating vector for w
l
at the
m
th training
epoch is then: