Combined Model and Task Learning, and Other Mechanisms - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

is actually clamped by external reward, and intermedi-

ate plus phases prior to that in time are all ultimately

driven by that later plus phase by the requirement of

consistency in reward prediction. Thus, this settling in

the plus phase is a kind of “estimated plus phase” in lieu

of actually having an external plus phase value.

The minus phase AC unit value is always clamped to

the undiscounted value of reward that we estimated on

the plus phase of the previous time step. In other words,

the minus phase of the next time step is equal to the

plus phase of the previous time step. To account for the

fact that we assumed that the plus phase computed the

discounted estimated reward, we have to multiply that

plus phase value by

by combining it with the context representations used

in the SRN model. As we will explain further in chap-

ters 9 and 11, we can use the TD error signal to control

when the context representations get updated, and the

use of these context representations simplifies the issues

of time skew that we just discussed. Thus, the version of

the algorithm that we actually think the brain is imple-

menting is somewhat more biologically plausible than

it might otherwise seem.

6.7.3

Exploration of TD: Classical Conditioning

To explore the TD learning rule (using the phase-based

implementation just described), we use the simple clas-

sical conditioning task discussed above. Thus, the net-

work will learn that a stimulus (tone) reliably predicts

the reward (and then that another stimulus reliably pre-

dicts that tone). First, we need to justify the use of the

TD algorithm in this context, and motivate the nature of

the stimulus representations used in the network.

You might recall that we said that the delta rule

(aka the Rescorla-Wagner rule) provides a good model

of classical conditioning, and thus wonder why TD is

needed. It all has to do with the issue of timing . f

one ignores the timing of the stimulus relative to the re-

sponse, then in fact the TD rule becomes equivalent to

the delta rule when everything happens at one time step

(itjusttrains V ( t ) to match r ( t ) ). However, animals are

sensitive to the timing relationship, and, more impor-

tantly for our purposes, modeling this timing provides a

particularly clear and simple demonstration of the basic

properties of TD learning.

The only problem is that this simple demonstration

involves a somewhat unrealistic representation of tim-

ing. Basically, the stimulus representation has a dis-

tinct unit for each stimulus for each point in time, so

that there is something unique for the AC unit's weights

to learn from. This representation is the complete se-

rial compound (CSC) proposed by Sutton and Barto

(1990), and we will see exactly how it works when we

look at the model. As we have noted, we will explore

a more plausible alternative in chapter 9 where the TD

error signal controls the updating of a context represen-

tation that maintains the stimulus over time.

to undiscount it when we copy it

over to the next time step as our estimate of V (t) .

In practice, we typically use a value of 1, which

simplifies the implementational picture somewhat by al-

lowing the next minus phase state to be a direct copy of

the prior plus phase state. Thus, one could imagine that

this just corresponds to a single maintained activation

value across the previous plus phase and the next minus

phase. By also using absorbing rewards with =1 ,we

avoid the problem of accounting for an infinity of future

states — our horizon extends only to the point at which

we receive our next reward. We will discuss in chap-

ter 11 how the effective choosing of greater delayed re-

wards over lesser immediate rewards can be achieved

by simultaneously performing TD-like learning at mul-

tiple time scales (Sutton, 1995).

Figure 6.21 also makes it clear that the weight ad-

justment computation must use the sending activations

at time t but the TD error (plus-minus phase difference)

at time t +1 . This is because while the AC unit is com-

puting V (t +1) based on stimulus activities at time t ,

the TD error for updating V (t +1) is not actually com-

puted until the next time step. It is important to note

that this skewing of time is not artifactual to the phase-

based implementation of TD, but is rather an intrinsic

aspect of the algorithm, which requires the use of fu-

ture states (i.e., V (t +1) ) to adapt prior estimates. It is

this spanning of contingencies across the time step that

allows the network to propagate information from the

future back in time.

The implementation of TD that we have explained

here can be made somewhat more biologically plausible

Open project rl_cond.proj.gz in chapter_6 .

, !

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home