Combined Model and Task Learning, and Other Mechanisms - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

Intuitively, this thought experiment makes clear that

the TD error signal provides a useful means of training

both the AC system itself, and also the actor. This dual

use of the TD signal is reflected in the biology by the

fact that the dopamine signal (which putatively repre-

sents the TD error Æ(t) ) projects to both the areas that

control the dopamine signal itself, and the other areas

that can be considered the actor network (figure 6.16).

It should also be noted that many different varieties of

TD learning exist (and even more so within the broader

category of reinforcement learning algorithms), and that

extensive mathematical analysis has been performed

showing that the algorithm will converge to the correct

result (e.g., Dayan, 1992). One particularly important

variation has to do with the use of something called an

eligibility trace , which is basically a time-averaged ac-

tivation value that is used for learning instead of the

instantaneous activation value. The role of this trace

is analogous to the hysteresis parameter fm prv in the

SRN context units (where fm hid is then 1 fm prv ).

The value of the trace parameter is usually represented

with the symbol , and the form of TD using this pa-

rameter as TD ( ) . The case we have (implicitly) been

considering is TD (0) , because we have not included

any trace activations.

Stimulus

V(t)

V(t+1)

V(1)

V(2)

V(3)

r(3)

Time

Figure 6.21: Computation of TD using minus-plus phase

framework. The minus phase value of the AC unit for each

time step is clamped to be the prior estimate of future rewards,

, and the plus phase is either a computed estimate of dis-

counted future rewards V ( t +1) , or an actual injected reward

V ( t )

(but not both — requiring an absorbing reward assump-

tion as described in the text).

r ( t )

First, let's consider what happens when the network

experiences a reward. In this case, the plus phase acti-

vation should be equal to the reward value, plus any ad-

ditional discounted expected reward beyond this current

reward (recall that the plus phase value is V ( t +1)+

). It would be much simpler if we could consider

this additional V ( t +1) term to be zero, because then

we could just clamp the unit with the reward value in the

plus phase. In fact, in many applications of reinforce-

ment learning, the entire network is reset after reward

is achieved, and a new trial is begun, which is often re-

ferred to as an absorbing reward . We use this absorbing

reward assumption, and just clamp the reward value in

the plus phase when an external reward is delivered.

In the absence of an external reward, the plus phase

should represent the estimated discounted future re-

wards, V (t +1) . This estimate is computed by the

AC unit in the plus phase by standard activation updat-

ing as a function of the weights. Thus, in the absence

of external reward, the plus phase for the AC unit is ac-

tually an unclamped settling phase (represented by the

forward-going arrow in figure 6.21), which is in contrast

with the usual error-driven phase schema, but consistent

with the needs of the TD algorithm. Essentially, the ul-

timate plus phase comes later in time when the AC unit

r ( t )

6.7.2

Phase-Based Temporal Differences

Just as we were able to use phase-based activation dif-

ferences to implement error-driven learning, it is rela-

tively straightforward to do this with TD learning, mak-

ing it completely transparent to introduce TD learning

within the overall Leabra framework. As figure 6.19

makes clear, there are two values whose difference con-

stitutes the TD error Æ , V (t) and V (t+1)+r(t) . Thus,

we can implement TD by setting the minus phase acti-

vation of the AC unit to

, and the plus phase to

(figure 6.21). Thus, to the extent that

there is a network of units supporting the ultimate com-

putation of the AC Æ value, these weights will be auto-

matically updated to reduce the TD error just by doing

the standard GeneRec learning on these two phases of

activation. In this section, we address some of the issues

the phase-based implementation raises.

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home