Error-Driven Task Learning - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

5.3.1

Deriving the Delta Rule

Notice that the sums over events and different out-

put units drop out when considering how to change

the weights for a particular output unit for a particular

event. The learning rule is thus “local” in the sense that

it only depends on the single output unit and a single

input/output pattern.

Now we can consider the second term,

Now that we can see how the delta rule works, we will

show how it can be derived directly from the derivative

of the sum-squared error measure in equation 5.2 with

respect to the weights from the input units. Box 5.1 pro-

vides a primer on derivatives for readers unfamiliar with

the mathematical details of derivatives. The most im-

portant thing is to understand the effects of these deriva-

tives in terms of credit assignment as explained above.

The mathematical expression of the derivative of the

error in terms of its components is written as follows:

.A -

though we will subsequently be able to accommodate

more complex activation functions, we start with a lin-

ear activation function to make the derivation simpler:

(5.6)

(5.4)

The derivative of the linear activation function is sim-

ply:

which is simply to say that the derivative of the er-

ror with respect to the weights is the product of two

terms: the derivative of the error with respect to the out-

put, and the derivative of the output with respect to the

weights. In other words, we can understand how the er-

ror changes as the weights change in terms of how the

error changes as the output changes, together with how

the output changes as the weights change. We can break

down derivatives in this way ad infinitum (as we will see

later in the chapter in the derivation of backpropagation)

according to the chain rule from calculus, in which the

value in the denominator in one term must be the nu-

merator in the next term. Then we can consider each

term separately.

We first consider @SSE

(5.7)

In other words, the input signal indicates how the out-

put changes with changes in the weights. Notice again

that all the elements of the sum that do not involve the

particular weight drop out.

Putting the two terms together, the full derivative for

the case with linear activations is:

(5.8)

= 2( t

) s

We can use the negative of this derivative for our learn-

ing rule, because we minimize functions (e.g., error) by

moving the relevant variable (e.g., the weight) opposite

the direction of the derivative. Also, we can either ab-

sorb the factor of 2 into the arbitrary learning rate con-

stant , or introduce a factor of

,where SSE = P

. To break this more complicated function into

more tractable parts, we use the fact that the deriva-

tive of a function h(x) that can be written in terms of

two component functions, h(x)=f(g(x)) , is the prod-

uct of the derivatives of the two component functions:

in the error measure

(as is often done), canceling it out.

This gives us the

delta rule shown previously:

(which is actually just another

instantiation of the chain rule). In the case of SSE, using

(5.9)

k as the variable instead of x , f(g(o k )) = g(o k ) 2 and

k ,and f 0 ( g ( ok )) = 2 g ( ok ) (because the

derivative of x 2 is 2 x )and g 0 ( ok )=1 (because tk is

a constant with respect to changes in o k , its derivative is

0, so it disappears, and then the derivative of 1 x with

respect to x is 1 x 0 = 1 ). Multiplying these terms

gives us:

5.3.2

Learning Bias Weights

g ( o

)= t

One issue we have not focused on yet is how the bias

weights learn. Recall that the bias weights provide a

constant additional input to the neuron (section 2.5.1),

and that proper bias weight values can be essential for

(5.5)

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home