Information Technology Reference
In-Depth Information
is equal to h j (1￿ h j ) . We also continue to use the cross
entropy error function.
The main idea behind backpropagation is that we can
train the weights from the input to the hidden layer by
applying the chain rule all the way down through the
network to this point. This chain of derivatives is:
these errors by the strength of that unit's contribution to
the output units. You might recognize that this is the
first term in the computation of Æ j from equation 5.24.
For now, we continue the great chain of backprop-
agation and compute the remaining derivatives. These
are actually quite similar to those we have already com-
puted, consisting just of the derivative of the sigmoidal
activation function itself, and the derivative of the net
input (to the hidden unit) with respect to the weights
(which you know to be the sending activation):
(5.25)
Because this is quite a mouthful, let's break it up into
smaller bites. First, let's look at the first three derivative
terms:
(5.28)
(5.26)
so the overall result of the entire chain of derivatives is:
This expression tells us how much a change in the ac-
tivation of this hidden unit would affect the resulting
error value on the output layer . Thus, you might imag-
ine that this expression has something to do with the
weights from this hidden unit to the output units, be-
cause it is through these weights that the hidden unit
influences the outputs.
Indeed, as you might have already noticed, this ex-
pression is very similar to the chain rule that gave us
the delta rule, because it shares the first two steps with
it. The only difference is that the final term in the delta
rule is the derivative of the net input into the output
unit with respect to the weights (which is equal to the
sending unit activation), whereas the final term here is
the derivative of this same net input with respect to the
sending unit activation h j (which is equal to the weight
)s
i
(5.29)
As before, we can use the negative of equation 5.29
to adjust the weights in such a way as to minimize
the overall error. Thus, in a three-layer network,
backpropagation is just the delta rule for the output
unit's weights, and equation 5.29 for the hidden unit's
weights.
5.6.2
Generic Recursive Formulation
If we just use the chain rule, we miss out on the elegant
recursiveness of backpropagation expressed in equa-
tions 5.21 and 5.24.
Thus, we need to introduce the
term to achieve this more elegant formulation. As we
mentioned previously, equation 5.27 only contains the
first term in the expression for Æ . Thus, Æj is equal to
@CE
jk ). In other words, the derivative of h j w jk with re-
spect to w jk is h j , and its derivative with respect to h j
is w jk .
Thus the expression for the derivative in equa-
tion 5.26 is just like the delta rule, except with a weight
at the end instead of an activation:
. The reason is simply that the com-
putation breaks up more cleanly if we formalize Æ in this
way. Thus, for the hidden units, Æ j is:
, and not @CE
@h
jk
(5.27)
This computation is very interesting, because it suggests
that hidden units can compute their contribution to the
overall output error by just summing over the errors of
all the output units that they project to, and weighting
(5.30)
( t
￿ o
) w
(1 ￿ h
where it then becomes apparent that we can express Æ j
in terms of the Æ k variables on the layer above it (the
Search WWH ::




Custom Search