Information Technology Reference
In-Depth Information
π ( i, 0)
jkl = 0 . Parallel to the update of the unit's activities, the sensitivities are updated
as well:
" X
#
df j
( i,t )
π ( i,t +1)
jkl
w ( i,t )
jm
· π ( i,t )
mkl
+ δ kj o ( i,t )
=
(6.16)
,
j
j
m
where δ kj denotes the Kronecker delta function. Gradient descent updates on the
weights are then achieved by the learning rule:
X
∆w ( i,t )
kl
e ( i,t )
j
· π ( i,t )
= η
(6.17)
jkl .
j
RTRL does not need to go back in time. Hence, it can be applied in an online
fashion. However, it is computationally more expensive than BPTT since O ( n 4 )
operations are needed per time step, with n representing the number of units in the
network. If the network is fully connected, this corresponds to O ( n 2 ) operations for
each weight per time step. It also needs O ( n 3 ) memory cells to store the sensitivities
π jkl .
6.3.3 Difficulty of Learning Long-Term Dependencies
Although the above algorithms for training RNNs have been known as long as the
backpropagation algorithm for FFNNs, RNNs are used less often for real-world ap-
plications than FFNNs. One of the reasons might be that training RNNs is difficult.
Since RNNs are nonlinear dynamical systems, they can expand or contract the
gradient flow. If in a network the magnitude of a loop's gain is larger than one for
multiple consecutive time steps, the gradient will explode exponentially. In contrast,
if the magnitude of a loop's gain is smaller than one the gradient will decay expo-
nentially.
The gain of a loop in a RNN depends on the magnitudes of the weights involved,
and on the derivatives of the transfer functions. Since the networks are frequently
initialized with small weights and use sigmoidal transfer function with small deriva-
tives, most of the gradients decay in time.
This affects the learning of long-term dependencies, where in long sequences
early inputs determine late desired outputs. Bengio et al. [28] showed that the gradi-
ent decay is the reason why gradient-based learning algorithms face an increasingly
difficult problem as the duration of the dependencies to be captured increases. They
showed that it is either impossible to store long-term memories or that the gradient
is vanishing. Learning with vanishing long-term gradients is difficult since the total
gradient, which is a sum of short-term and long-term gradient components, will be
dominated by the short-term influences.
Long Short-Term Memory. Several proposals have been made to overcome this
difficulty. One is the use of long short-term memory (LSTM), proposed by Hoch-
reiter and Schmidhuber [100]. This algorithm works in networks that include
special-purpose units, called memory cells, that are used to latch information. The
Search WWH ::




Custom Search