Information Technology Reference
In-Depth Information
The gradient-based learning algorithm of LSTM presented in [103] can be
modified such that the empirical MMSE risk functional is replaced by
H
R
2
.
The change in the derivation presented in [103] occurs in the following ex-
pression (the backpropagation error seen at the output neuron
k
):
E
k
=
f
(
net
k
)(
T
k
−
Y
k
)
,
(6.25)
where
f
(
) is the sigmoid transfer function,
net
k
is the activation of the
output neuron
k
at time
τ
,
T
k
is the target variable for the output neuron
k
at time
τ
and
Y
k
is the output of neuron
k
at time
τ
.Theterm(
T
k
−
·
Y
k
)
n
i
=1
(
t
i
−
y
i
)
2
,
w.r.t. the output
y
k
. This same derivative is computed now, using expression
(6.24). Note that, since the logarithm in (6.24) is a monotonically increasing
function, to minimize it is the same as to minimize its operand. So, the partial
derivative of the operand will be derived, which is
1
in equation (6.25) comes from the derivative of the MSE,
∂y
k
exp
=
n
n
e
j
)
2
2
h
2
1
n
2
h
√
2
π
∂
(
e
i
−
−
i
=1
j
=1
∂y
k
exp
.
n
n
t
j
+
y
j
)
2
2
h
2
1
n
2
h
√
2
π
∂
(
t
i
−
y
i
−
=
−
(6.26)
i
=1
j
=1
Now, when
i
=
k
the derivative of the term inside the summation becomes
exp
−
2(
t
k
−
(
t
k
−
y
k
−
t
j
+
y
j
)
2
2
h
2
1
2
h
2
−
y
k
−
t
j
+
y
j
)(
−
1)
.
(6.27)
Likewise, if
j
=
k
, the derivative becomes
exp
−
2(
t
i
−
(
t
i
−
y
i
−
t
k
+
y
k
2
h
2
1
2
h
2
−
y
i
−
t
k
+
y
k
)
.
(6.28)
Expressions (6.27) and (6.28) yield the same values allowing writing the
derivative of the operand of (6.24) as
exp
(
t
i
−
n
t
k
+
y
k
)
2
2
h
2
(
t
i
−
y
i
−
Q
−
y
i
−
t
k
+
y
k
)
,
(6.29)
i
=1
where
2
n
2
h
3
√
2
π
Q
=
.
So expression (6.25) becomes
exp
2
h
2
a
ik
,
n
a
ik
E
k
=
Qf
k
(
net
k
)
−
(6.30)
i
=1