Information Technology Reference
In-Depth Information
This equation shows that the backpropagation of error
involves two terms. The first term passes back the Æ
terms from the output layer in much the same way that
the activation values are passed forward in the network:
by computing a weighted sum of Æ k w jk over the out-
put units. This weighted sum is then multiplied by the
second term, h j (1 ￿ h j ) , which is the derivative of the
activation function ￿ 0 (￿ j ) . This multiplication by the
derivative is analogous to passing the net input through
the activation function in the forward propagation of ac-
tivation. Thus, it is useful to think of error backpropaga-
tion as roughly the inverse of forward activation propa-
gation.
The multiplication by the derivative of the activation
function in equation 5.24 has some important implica-
tions for the qualitative behavior of learning. Specifi-
cally, this derivative can be understood in terms of how
much difference a given amount of weight change is ac-
tually going to make on the activation value of the unit
— if the unit is in the sensitive middle range of the sig-
moid function, then a weight change will make a big
difference. This is where the activation derivative is at
its maximum (e.g., : 5(1￿ : 5) = : 25 ). Conversely, when
a unit is “pegged” against one of the two extremes (0
or 1), then a weight change will make relatively little
difference, which is consistent with the derivative being
very small (e.g., :99(1￿:99) = :0099 ). Thus, the learn-
ing rule will effectively focus learning on those units
which are more labile (think of lobbying undecided sen-
ators versus those who are steadfast in their opinions —
you want to focus your efforts where they have a chance
of paying off).
One final note before we proceed with the details of
the derivation is that one can iteratively apply equa-
tion 5.24 for as many hidden layers as there are in the
network, and all the math works out correctly. Thus,
backpropagation allows many hidden layers to be used.
Box 5.2 summarizes the backpropagation algorithm.
Box 5.2: The Backpropagation Algorithm
Targets
t k
o k
δ k
= −(
t k o k )
Output
. . .
w jk =
δ k
h j
Σ δ k w jk
ση j
δ j =
Hidden
. . .
h j
'()
δ j s i
w ij =
s i
Input
. . .
Activations are computed using a standard sigmoidal acti-
vation function (see box 2.1 in chapter 2). We use the cross
entropy error function (though sum-squared error, SSE, is
also commonly used):
CE = ￿
+(1 ￿ t
) log(1 ￿ o
log o
where t k is the target value, and o k is the actual output acti-
vation. Minimizing this error is the objective of backpropa-
gation.
Error backpropagation takes place via the same kinds of
weighted-sum equations that produce the activation, except
everything happens in reverse, so it is like the inverse of the
activation propagation.
For the output units, the delta ( Æ )
error term is:
= ￿ (t
￿ o
)
which is backpropagated down to the hidden units using the
following equation (which can be iteratively applied for as
many hidden layers as exist in the network):
(1 ￿ h
where h j (1 ￿ h j ) is the derivative of the sigmoidal activation
function for hidden unit activation h j , also written as ￿ 0 (￿ j )
where ￿ j is the net input for unit j .
The weights are then updated to minimize this error (by
taking the negative of it):
5.6.1
Derivation of Backpropagation
= ￿ ￿Æ
For this derivation, we again need the net input and sig-
moidal activation equations. ￿ j = P
i
where ￿ is the learning rate parameter ( lrate ), and x i is the
sending unit activation.
is the net
input for unit j .
The sigmoidal activation function is
, and its derivative is written as ￿ 0 (￿ j ) ,and
Search WWH ::




Custom Search