Supervised Learning - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

π ( i, 0)

jkl = 0 . Parallel to the update of the unit's activities, the sensitivities are updated

as well:

" X

#

df j

dξ ( i,t )

π ( i,t +1)

jkl

w ( i,t )

jm

· π ( i,t )

mkl

+ δ kj o ( i,t )

=

(6.16)

,

j

m

where δ kj denotes the Kronecker delta function. Gradient descent updates on the

weights are then achieved by the learning rule:

X

∆w ( i,t )

kl

e ( i,t )

j

· π ( i,t )

= − η

(6.17)

jkl .

j

RTRL does not need to go back in time. Hence, it can be applied in an online

fashion. However, it is computationally more expensive than BPTT since O ( n 4 )

operations are needed per time step, with n representing the number of units in the

network. If the network is fully connected, this corresponds to O ( n 2 ) operations for

each weight per time step. It also needs O ( n 3 ) memory cells to store the sensitivities

π jkl .

6.3.3 Difficulty of Learning Long-Term Dependencies

Although the above algorithms for training RNNs have been known as long as the

backpropagation algorithm for FFNNs, RNNs are used less often for real-world ap-

plications than FFNNs. One of the reasons might be that training RNNs is difficult.

Since RNNs are nonlinear dynamical systems, they can expand or contract the

gradient flow. If in a network the magnitude of a loop's gain is larger than one for

multiple consecutive time steps, the gradient will explode exponentially. In contrast,

if the magnitude of a loop's gain is smaller than one the gradient will decay expo-

nentially.

The gain of a loop in a RNN depends on the magnitudes of the weights involved,

and on the derivatives of the transfer functions. Since the networks are frequently

initialized with small weights and use sigmoidal transfer function with small deriva-

tives, most of the gradients decay in time.

This affects the learning of long-term dependencies, where in long sequences

early inputs determine late desired outputs. Bengio et al. [28] showed that the gradi-

ent decay is the reason why gradient-based learning algorithms face an increasingly

difficult problem as the duration of the dependencies to be captured increases. They

showed that it is either impossible to store long-term memories or that the gradient

is vanishing. Learning with vanishing long-term gradients is difficult since the total

gradient, which is a sum of short-term and long-term gradient components, will be

dominated by the short-term influences.

Long Short-Term Memory. Several proposals have been made to overcome this

difficulty. One is the use of long short-term memory (LSTM), proposed by Hoch-

reiter and Schmidhuber [100]. This algorithm works in networks that include

special-purpose units, called memory cells, that are used to latch information. The

Hierarchical Neural Networks for Image Interpretation

Search WWH ::

Custom Search

Home