Error-Driven Task Learning - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

Box 5.1: How to Compute a Derivative

(x=−2)

(x=1)

−3

−2

−1

To see how derivatives work, we will first consider a very simple function, y = x 2 , shown in the above figure, with the goal of

minimizing this function. Remember that eventually we will want to minimize the more complex function that relates a network's

error and weights. In this simple case of y = x 2 , it is clear that to minimize y ,wewant x to be 0, because y goes to 0 as x goes to

0. However, it is not always so apparent how to minimize a function, so we will go through the more formal process of doing so.

To minimize y = x 2 , we need to understand how y changes with changes to x . In other words, we need the derivative of y with

respect to x . It turns out that the derivative of x 2 is 2x (and more generally, the derivative of bx a is bax (a 1) ). This means that the

slope of the function y = x 2 at any given point equals 2x . The above figure shows the slope of the function y = x 2 at two points,

where x =1 (so the slope equals 2) and where x = 2 (so the slope = -4). It should be clear from the figure and the formula for

the derivative that when x is positive the slope is positive, indicating that y increases as x increases; and, when x is negative the

slope is negative, indicating that y decreases as x increases. Moreover, the further x is from 0, the further we are from minimizing

, and the larger the derivative. To minimize y , we thus want to change x in the direction opposite the derivative, and in step sizes

according to the absolute magnitude of the derivative. Doing this for any value of x will eventually take x to 0, thereby minimizing

The logic behind minimizing the network's error with respect to its weights is identical, but the process is complicated by the fact

that the relationship between error and weights is not as simple as y = x 2 . More specifically, the weights do not appear directly in

the error equation 5.2 in the way that x appears in the equation for y in our simple y = x 2 example. To deal with this complexity,

we can break the relationship between error and weights into its components. Rather than determining in a single step how the error

changes as the weights change, we can determine in separate steps how the error changes as the network output changes (because

the output does appear directly in the error equation), and how the network output changes as the weights change (because the

weights appear directly in the equation for the network output, simplified for the present as described below). Then we can put

these terms together to determine how the error changes as the weights change.

producing useful representations, for example by allow-

ing units to represent weaker inputs (e.g., section 3.3.1).

There really is no straightforward way to train bias

weights in the Hebbian model-learning paradigm, be-

cause they do not reflect any correlational information

(i.e., the correlation between a unit and itself, which is

the only kind of correlational information a bias weight

could represent, is uninformative). However, error-

driven task learning using the delta rule algorithm can

put the bias weights to good use.

The standard way to train bias weights is to treat them

as weights coming from a unit that is always active (i.e.,

). If we substitute this 1 into equation 5.9, the

bias weight ( k ) change is just:

(5.10)

Thus, the bias weight will always adjust to directly de-

crease the error. For example, to the extent that a unit

is often active when it shouldn't be, the bias weight

change will be more negative than positive, causing the

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home