Combined Model and Task Learning, and Other Mechanisms - Computational Explorations in Cognitive Neuroscience

Information Technology Reference

In-Depth Information

Of the two TD components, the adaptive critic has the

harder job, because the actor can just use the estimated

value function for different alternative actions to select

which action to perform next (e.g., “if I go right, how

much reward will I receive compared to going left..”).

Thus, we will focus on the adaptive critic.

The adaptive critic (AC) uses sensory cues to esti-

mate the value of V (t) . We will call this estimated value

δ (t)

−

V(t+1) + r(t)

V(t)

Hidden

Stimuli

to distinguish it from the actual value V (t) .The

AC needs to learn which sensory cues are predictive of

reward, just as in conditioning. Obviously, the AC only

ever knows about a reward when it actually receives it.

Thus, the trick is to propagate the reward information

backwards in time to the point where it could have been

reliably predicted by a sensory cue. This is just what

TD does, by using at each point in time the prediction

for the next point in time (i.e., V (t +1) ) to adjust the

prediction for the current point in time.

In other words, the AC “looks ahead” one time step

and updates its estimate to predict this look-ahead value.

Thus, it will initially learn to predict a reward just im-

mediately (one time step) before the reward happens,

and then, the next time around, it will be able to pre-

dict this prediction of the reward, and then, predict that

prediction, and so on backward in time from the reward.

Note that this propagation takes place over repeated

trials , and not within one trial (which is thus unlike the

error backpropagation procedure, which propagates er-

ror all the way through the network at the point that

the error was received). Practically, this means that we

do not require that the AC magically remember all the

information leading up to the point of reward, which

would otherwise be required to propagate the reward

information back in time all at the point of reward.

To see how the temporal backpropagation happens

mathematically, we start by noting that equation 6.3 can

be written recursively in terms of V (t +1) as follows:

Figure 6.19: The adaptive critic computes estimated total

expected future reward ( V ( t ) ) based on the current stimuli,

and learns by adjusting the weights to minimize the difference

between this estimate and its value based on a one time step

look-ahead.

) between the value that this estimate should be ac-

cording to equation 6.4 and our current estimate V (t) :

(6.5)

Note that we got rid of the expected value notation h:::i ,

because we will compute this on each trial and incre-

ment the changes slowly over time to compute the ex-

pected value (much as we did with our Hebbian learning

rule). Note too, that this equation is based on the notion

that the predictions of future reward have to be consis-

tent over time (i.e., the prediction at time t has to agree

with that at time t +1 ), and that the error signal is a

measure of the residual inconsistency. Thus, TD learn-

ing is able to span temporal delays by building a bridge

of consistency in its predictions across time.

The last thing we have to specify for the AC is ex-

actly how V ( t ) is computed directly from external stim-

uli, and then how this TD error signal can be used to

adapt these estimates. As you might expect, we will do

this computation using a neural network that computes

based on weights from representations of the stim-

uli (and potentially processed by one or more hidden

layers) (figure 6.19). The TD error can then be used to

train the weights of the network that computes V (t) by

treating it the same as the error signal one would get

from sum-squared error or cross-entropy error (i.e., like

(6.4)

Given that this same relationship should hold of our es-

timates V (t) , we can now define a TD error which will

tell us how to update our current estimate in terms of

this look-ahead estimate at the next point in time. We

do this by just computing the difference (represented as

in equation 5.23 from section 5.6). Thus, to the ex-

tent that there is some stimulus in the environment that

Computational Explorations in Cognitive Neuroscience

Search WWH ::

Custom Search

Home