Information Technology Reference
In-Depth Information
5.3.1
Deriving the Delta Rule
Notice that the sums over events and different out-
put units drop out when considering how to change
the weights for a particular output unit for a particular
event. The learning rule is thus “local” in the sense that
it only depends on the single output unit and a single
input/output pattern.
Now we can consider the second term,
Now that we can see how the delta rule works, we will
show how it can be derived directly from the derivative
of the sum-squared error measure in equation 5.2 with
respect to the weights from the input units. Box 5.1 pro-
vides a primer on derivatives for readers unfamiliar with
the mathematical details of derivatives. The most im-
portant thing is to understand the effects of these deriva-
tives in terms of credit assignment as explained above.
The mathematical expression of the derivative of the
error in terms of its components is written as follows:
.A -
though we will subsequently be able to accommodate
more complex activation functions, we start with a lin-
ear activation function to make the derivation simpler:
(5.6)
(5.4)
The derivative of the linear activation function is sim-
ply:
which is simply to say that the derivative of the er-
ror with respect to the weights is the product of two
terms: the derivative of the error with respect to the out-
put, and the derivative of the output with respect to the
weights. In other words, we can understand how the er-
ror changes as the weights change in terms of how the
error changes as the output changes, together with how
the output changes as the weights change. We can break
down derivatives in this way ad infinitum (as we will see
later in the chapter in the derivation of backpropagation)
according to the chain rule from calculus, in which the
value in the denominator in one term must be the nu-
merator in the next term. Then we can consider each
term separately.
We first consider @SSE
@o
(5.7)
In other words, the input signal indicates how the out-
put changes with changes in the weights. Notice again
that all the elements of the sum that do not involve the
particular weight drop out.
Putting the two terms together, the full derivative for
the case with linear activations is:
(5.8)
= ￿2( t
￿ o
k
) s
We can use the negative of this derivative for our learn-
ing rule, because we minimize functions (e.g., error) by
moving the relevant variable (e.g., the weight) opposite
the direction of the derivative. Also, we can either ab-
sorb the factor of 2 into the arbitrary learning rate con-
stant ￿ , or introduce a factor of
,where SSE = P
t
. To break this more complicated function into
more tractable parts, we use the fact that the deriva-
tive of a function h(x) that can be written in terms of
two component functions, h(x)=f(g(x)) , is the prod-
uct of the derivatives of the two component functions:
in the error measure
(as is often done), canceling it out.
This gives us the
delta rule shown previously:
(which is actually just another
instantiation of the chain rule). In the case of SSE, using
(5.9)
k as the variable instead of x , f(g(o k )) = g(o k ) 2 and
k ,and f 0 ( g ( ok )) = 2 g ( ok ) (because the
derivative of x 2 is 2 x )and g 0 ( ok )=￿1 (because tk is
a constant with respect to changes in o k , its derivative is
0, so it disappears, and then the derivative of ￿1 x with
respect to x is ￿1 x 0 = ￿1 ). Multiplying these terms
gives us:
5.3.2
Learning Bias Weights
g ( o
k
)= t
￿ o
One issue we have not focused on yet is how the bias
weights learn. Recall that the bias weights provide a
constant additional input to the neuron (section 2.5.1),
and that proper bias weight values can be essential for
(5.5)
Search WWH ::




Custom Search