Error-Driven Task Learning - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

This equation shows that the backpropagation of error

involves two terms. The first term passes back the Æ

terms from the output layer in much the same way that

the activation values are passed forward in the network:

by computing a weighted sum of Æ k w jk over the out-

put units. This weighted sum is then multiplied by the

second term, h j (1 h j ) , which is the derivative of the

activation function 0 ( j ) . This multiplication by the

derivative is analogous to passing the net input through

the activation function in the forward propagation of ac-

tivation. Thus, it is useful to think of error backpropaga-

tion as roughly the inverse of forward activation propa-

gation.

The multiplication by the derivative of the activation

function in equation 5.24 has some important implica-

tions for the qualitative behavior of learning. Specifi-

cally, this derivative can be understood in terms of how

much difference a given amount of weight change is ac-

tually going to make on the activation value of the unit

— if the unit is in the sensitive middle range of the sig-

moid function, then a weight change will make a big

difference. This is where the activation derivative is at

its maximum (e.g., : 5(1 : 5) = : 25 ). Conversely, when

a unit is “pegged” against one of the two extremes (0

or 1), then a weight change will make relatively little

difference, which is consistent with the derivative being

very small (e.g., :99(1:99) = :0099 ). Thus, the learn-

ing rule will effectively focus learning on those units

which are more labile (think of lobbying undecided sen-

ators versus those who are steadfast in their opinions —

you want to focus your efforts where they have a chance

of paying off).

One final note before we proceed with the details of

the derivation is that one can iteratively apply equa-

tion 5.24 for as many hidden layers as there are in the

network, and all the math works out correctly. Thus,

backpropagation allows many hidden layers to be used.

Box 5.2 summarizes the backpropagation algorithm.

Box 5.2: The Backpropagation Algorithm

Targets

t k

o k

δ k

= −(

t k − o k )

Output

. . .

∆

w jk =

−

δ k

h j

Σ δ k w jk

ση j

δ j =

Hidden

. . .

h j

'()

∆

δ j s i

w ij =

−

s i

Input

. . .

Activations are computed using a standard sigmoidal acti-

vation function (see box 2.1 in chapter 2). We use the cross

entropy error function (though sum-squared error, SSE, is

also commonly used):

CE =

+(1 t

) log(1 o

log o

where t k is the target value, and o k is the actual output acti-

vation. Minimizing this error is the objective of backpropa-

gation.

Error backpropagation takes place via the same kinds of

weighted-sum equations that produce the activation, except

everything happens in reverse, so it is like the inverse of the

activation propagation.

For the output units, the delta ( Æ )

error term is:

= (t

)

which is backpropagated down to the hidden units using the

following equation (which can be iteratively applied for as

many hidden layers as exist in the network):

(1 h

where h j (1 h j ) is the derivative of the sigmoidal activation

function for hidden unit activation h j , also written as 0 ( j )

where j is the net input for unit j .

The weights are then updated to minimize this error (by

taking the negative of it):

5.6.1

Derivation of Backpropagation

= Æ

For this derivation, we again need the net input and sig-

moidal activation equations. j = P

where is the learning rate parameter ( lrate ), and x i is the

sending unit activation.

is the net

input for unit j .

The sigmoidal activation function is

, and its derivative is written as 0 ( j ) ,and

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home