Information Technology Reference
In-Depth Information
is actually clamped by external reward, and intermedi-
ate plus phases prior to that in time are all ultimately
driven by that later plus phase by the requirement of
consistency in reward prediction. Thus, this settling in
the plus phase is a kind of “estimated plus phase” in lieu
of actually having an external plus phase value.
The minus phase AC unit value is always clamped to
the undiscounted value of reward that we estimated on
the plus phase of the previous time step. In other words,
the minus phase of the next time step is equal to the
plus phase of the previous time step. To account for the
fact that we assumed that the plus phase computed the
discounted estimated reward, we have to multiply that
plus phase value by
by combining it with the context representations used
in the SRN model. As we will explain further in chap-
ters 9 and 11, we can use the TD error signal to control
when the context representations get updated, and the
use of these context representations simplifies the issues
of time skew that we just discussed. Thus, the version of
the algorithm that we actually think the brain is imple-
menting is somewhat more biologically plausible than
it might otherwise seem.
6.7.3
Exploration of TD: Classical Conditioning
To explore the TD learning rule (using the phase-based
implementation just described), we use the simple clas-
sical conditioning task discussed above. Thus, the net-
work will learn that a stimulus (tone) reliably predicts
the reward (and then that another stimulus reliably pre-
dicts that tone). First, we need to justify the use of the
TD algorithm in this context, and motivate the nature of
the stimulus representations used in the network.
You might recall that we said that the delta rule
(aka the Rescorla-Wagner rule) provides a good model
of classical conditioning, and thus wonder why TD is
needed. It all has to do with the issue of timing . f
one ignores the timing of the stimulus relative to the re-
sponse, then in fact the TD rule becomes equivalent to
the delta rule when everything happens at one time step
(itjusttrains V ( t ) to match r ( t ) ). However, animals are
sensitive to the timing relationship, and, more impor-
tantly for our purposes, modeling this timing provides a
particularly clear and simple demonstration of the basic
properties of TD learning.
The only problem is that this simple demonstration
involves a somewhat unrealistic representation of tim-
ing. Basically, the stimulus representation has a dis-
tinct unit for each stimulus for each point in time, so
that there is something unique for the AC unit's weights
to learn from. This representation is the complete se-
rial compound (CSC) proposed by Sutton and Barto
(1990), and we will see exactly how it works when we
look at the model. As we have noted, we will explore
a more plausible alternative in chapter 9 where the TD
error signal controls the updating of a context represen-
tation that maintains the stimulus over time.
to undiscount it when we copy it
over to the next time step as our estimate of V (t) .
In practice, we typically use a ￿ value of 1, which
simplifies the implementational picture somewhat by al-
lowing the next minus phase state to be a direct copy of
the prior plus phase state. Thus, one could imagine that
this just corresponds to a single maintained activation
value across the previous plus phase and the next minus
phase. By also using absorbing rewards with ￿ =1 ,we
avoid the problem of accounting for an infinity of future
states — our horizon extends only to the point at which
we receive our next reward. We will discuss in chap-
ter 11 how the effective choosing of greater delayed re-
wards over lesser immediate rewards can be achieved
by simultaneously performing TD-like learning at mul-
tiple time scales (Sutton, 1995).
Figure 6.21 also makes it clear that the weight ad-
justment computation must use the sending activations
at time t but the TD error (plus-minus phase difference)
at time t +1 . This is because while the AC unit is com-
puting V (t +1) based on stimulus activities at time t ,
the TD error for updating V (t +1) is not actually com-
puted until the next time step. It is important to note
that this skewing of time is not artifactual to the phase-
based implementation of TD, but is rather an intrinsic
aspect of the algorithm, which requires the use of fu-
ture states (i.e., V (t +1) ) to adapt prior estimates. It is
this spanning of contingencies across the time step that
allows the network to propagate information from the
future back in time.
The implementation of TD that we have explained
here can be made somewhat more biologically plausible
Open project rl_cond.proj.gz in chapter_6 .
, !
Search WWH ::




Custom Search