Error-Driven Task Learning - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

is equal to h j (1 h j ) . We also continue to use the cross

entropy error function.

The main idea behind backpropagation is that we can

train the weights from the input to the hidden layer by

applying the chain rule all the way down through the

network to this point. This chain of derivatives is:

these errors by the strength of that unit's contribution to

the output units. You might recognize that this is the

first term in the computation of Æ j from equation 5.24.

For now, we continue the great chain of backprop-

agation and compute the remaining derivatives. These

are actually quite similar to those we have already com-

puted, consisting just of the derivative of the sigmoidal

activation function itself, and the derivative of the net

input (to the hidden unit) with respect to the weights

(which you know to be the sending activation):

(5.25)

Because this is quite a mouthful, let's break it up into

smaller bites. First, let's look at the first three derivative

terms:

(5.28)

(5.26)

so the overall result of the entire chain of derivatives is:

This expression tells us how much a change in the ac-

tivation of this hidden unit would affect the resulting

error value on the output layer . Thus, you might imag-

ine that this expression has something to do with the

weights from this hidden unit to the output units, be-

cause it is through these weights that the hidden unit

influences the outputs.

Indeed, as you might have already noticed, this ex-

pression is very similar to the chain rule that gave us

the delta rule, because it shares the first two steps with

it. The only difference is that the final term in the delta

rule is the derivative of the net input into the output

unit with respect to the weights (which is equal to the

sending unit activation), whereas the final term here is

the derivative of this same net input with respect to the

sending unit activation h j (which is equal to the weight

(5.29)

As before, we can use the negative of equation 5.29

to adjust the weights in such a way as to minimize

the overall error. Thus, in a three-layer network,

backpropagation is just the delta rule for the output

unit's weights, and equation 5.29 for the hidden unit's

weights.

5.6.2

Generic Recursive Formulation

If we just use the chain rule, we miss out on the elegant

recursiveness of backpropagation expressed in equa-

tions 5.21 and 5.24.

Thus, we need to introduce the

term to achieve this more elegant formulation. As we

mentioned previously, equation 5.27 only contains the

first term in the expression for Æ . Thus, Æj is equal to

@CE

jk ). In other words, the derivative of h j w jk with re-

spect to w jk is h j , and its derivative with respect to h j

is w jk .

Thus the expression for the derivative in equa-

tion 5.26 is just like the delta rule, except with a weight

at the end instead of an activation:

. The reason is simply that the com-

putation breaks up more cleanly if we formalize Æ in this

way. Thus, for the hidden units, Æ j is:

, and not @CE

(5.27)

This computation is very interesting, because it suggests

that hidden units can compute their contribution to the

overall output error by just summing over the errors of

all the output units that they project to, and weighting

(5.30)

( t

) w

(1 h

where it then becomes apparent that we can express Æ j

in terms of the Æ k variables on the layer above it (the

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home