Combined Model and Task Learning, and Other Mechanisms - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

is a large TD error when the reward occurs at t = 16

because it is completely unpredicted. Thus, if you refer

to equation 6.5, V (16) = 0 , V (17) = 0 ,and r (16) = 1 .

This means that Æ(16) = 1 (notewehaveset =1 in

this case), and thus that the weights that produce V (16)

will increase so that this value will be larger next time.

This weight increase has two effects. First, it will re-

duce the value of Æ(16) next time around, because this

reward will be better predicted. Second, it will start to

propagate the reward backward one time step. Thus, at

1−

0.8−

0.6−

0.4−

0.2−

0−

02468 0 2 4 6 8 0

Time

1−

, Æ(15) will be .2 because the equation at time

15 includes V (16) . Figure 6.20b shows how this prop-

agation occurs all the way back to t =2 . Finally, fig-

ure 6.20c shows the “final” state where the network has

learned as much as it can. It cannot propagate any fur-

ther back because there is no predictive stimulus earlier

in time. Thus, the network is always “surprised” when

this tone occurs, but not surprised when the reward fol-

lows it.

The general properties of this TD model of condition-

ing provide a nice fit to the neural data shown in fig-

ure 6.17, suggesting that the VTA is computing some-

thing like TD error. One important discrepancy how-

ever is that evidence for a continuous transition like

that shown in figure 6.20b is lacking. This has some

important implications, and can still be explained from

within the basic TD framework, as we will discuss fur-

ther when we explore the simulation that produced these

figures later.

We also need to specify something about the other

half of TD learning, the actor . The TD error signal

can easily be used to train an actor network to produce

actions that increase the total expected reward. To see

this, let's imagine that the actor network has produced

a given action a at time t . If this action either leads

directly to a reward, or leads to a previously unpre-

dicted increase in estimated future rewards, then Æ(t)

will be positive. Thus, if Æ(t) is used to adjust the

weights in the actor network in a similar way as in the

AC network, then this will increase the likelihood that

this action will be produced again under similar circum-

stances. If there was another possible action at that time

step that led to even greater reward, then it would pro-

duce larger weight changes, and would thus dominate

over the weaker reward.

0.8−

0.6−

0.4−

0.2−

0−

02468 0 2 4 6 8 0

Time

1−

0.8−

0.6−

0.4−

0.2−

0−

02468 0 2 4 6 8 0

Time

Figure 6.20: Three stages of learning in a simple condition-

ing experiment, showing TD error ( Æ ( t ) ) as a function of time.

a) Shows the initial trial, where the reward is not predicted. b)

Shows the transition as the estimate of reward gets predicted

earlier and earlier. c) Shows the final trial, where the tone

onset completely predicts the reward.

can be reliably used to produce the correct V (t) value,

the network will learn this as a function of the TD er-

ror. Where no such reliable stimulus predictor exists,

the reward will remain unpredictable.

Now, let's see how TD learning works in practice by

revisiting the same kind of simple conditioning exper-

iment shown in figure 6.17, where a reward was pre-

dicted by a tone which precedes it by some fixed time

interval. We can simulate this by having a “tone” stim-

ulus starting at t =2 ,followedbyarewardat t =16 .

Figure 6.20a shows what happens to the TD error Æ ( t )

as a function of time on the first trial of learning. There

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home