Information Technology Reference
In-Depth Information
a)
is a large TD error when the reward occurs at t = 16
because it is completely unpredicted. Thus, if you refer
to equation 6.5, V (16) = 0 , V (17) = 0 ,and r (16) = 1 .
This means that Æ(16) = 1 (notewehaveset ￿ =1 in
this case), and thus that the weights that produce V (16)
will increase so that this value will be larger next time.
This weight increase has two effects. First, it will re-
duce the value of Æ(16) next time around, because this
reward will be better predicted. Second, it will start to
propagate the reward backward one time step. Thus, at
1−
0.8−
0.6−
0.4−
0.2−
0−
02468 0 2 4 6 8 0
Time
b)
1−
, Æ(15) will be .2 because the equation at time
15 includes V (16) . Figure 6.20b shows how this prop-
agation occurs all the way back to t =2 . Finally, fig-
ure 6.20c shows the “final” state where the network has
learned as much as it can. It cannot propagate any fur-
ther back because there is no predictive stimulus earlier
in time. Thus, the network is always “surprised” when
this tone occurs, but not surprised when the reward fol-
lows it.
The general properties of this TD model of condition-
ing provide a nice fit to the neural data shown in fig-
ure 6.17, suggesting that the VTA is computing some-
thing like TD error. One important discrepancy how-
ever is that evidence for a continuous transition like
that shown in figure 6.20b is lacking. This has some
important implications, and can still be explained from
within the basic TD framework, as we will discuss fur-
ther when we explore the simulation that produced these
figures later.
We also need to specify something about the other
half of TD learning, the actor . The TD error signal
can easily be used to train an actor network to produce
actions that increase the total expected reward. To see
this, let's imagine that the actor network has produced
a given action a at time t . If this action either leads
directly to a reward, or leads to a previously unpre-
dicted increase in estimated future rewards, then Æ(t)
will be positive. Thus, if Æ(t) is used to adjust the
weights in the actor network in a similar way as in the
AC network, then this will increase the likelihood that
this action will be produced again under similar circum-
stances. If there was another possible action at that time
step that led to even greater reward, then it would pro-
duce larger weight changes, and would thus dominate
over the weaker reward.
0.8−
0.6−
0.4−
0.2−
0−
02468 0 2 4 6 8 0
Time
c)
1−
0.8−
0.6−
0.4−
0.2−
0−
02468 0 2 4 6 8 0
Time
Figure 6.20: Three stages of learning in a simple condition-
ing experiment, showing TD error ( Æ ( t ) ) as a function of time.
a) Shows the initial trial, where the reward is not predicted. b)
Shows the transition as the estimate of reward gets predicted
earlier and earlier. c) Shows the final trial, where the tone
onset completely predicts the reward.
can be reliably used to produce the correct V (t) value,
the network will learn this as a function of the TD er-
ror. Where no such reliable stimulus predictor exists,
the reward will remain unpredictable.
Now, let's see how TD learning works in practice by
revisiting the same kind of simple conditioning exper-
iment shown in figure 6.17, where a reward was pre-
dicted by a tone which precedes it by some fixed time
interval. We can simulate this by having a “tone” stim-
ulus starting at t =2 ,followedbyarewardat t =16 .
Figure 6.20a shows what happens to the TD error Æ ( t )
as a function of time on the first trial of learning. There
Search WWH ::




Custom Search