Information Technology Reference
In-Depth Information
Of the two TD components, the adaptive critic has the
harder job, because the actor can just use the estimated
value function for different alternative actions to select
which action to perform next (e.g., “if I go right, how
much reward will I receive compared to going left..”).
Thus, we will focus on the adaptive critic.
The adaptive critic (AC) uses sensory cues to esti-
mate the value of V (t) . We will call this estimated value
δ (t)
^
V(t+1) + r(t)
γ
V(t)
Hidden
Stimuli
to distinguish it from the actual value V (t) .The
AC needs to learn which sensory cues are predictive of
reward, just as in conditioning. Obviously, the AC only
ever knows about a reward when it actually receives it.
Thus, the trick is to propagate the reward information
backwards in time to the point where it could have been
reliably predicted by a sensory cue. This is just what
TD does, by using at each point in time the prediction
for the next point in time (i.e., V (t +1) ) to adjust the
prediction for the current point in time.
In other words, the AC “looks ahead” one time step
and updates its estimate to predict this look-ahead value.
Thus, it will initially learn to predict a reward just im-
mediately (one time step) before the reward happens,
and then, the next time around, it will be able to pre-
dict this prediction of the reward, and then, predict that
prediction, and so on backward in time from the reward.
Note that this propagation takes place over repeated
trials , and not within one trial (which is thus unlike the
error backpropagation procedure, which propagates er-
ror all the way through the network at the point that
the error was received). Practically, this means that we
do not require that the AC magically remember all the
information leading up to the point of reward, which
would otherwise be required to propagate the reward
information back in time all at the point of reward.
To see how the temporal backpropagation happens
mathematically, we start by noting that equation 6.3 can
be written recursively in terms of V (t +1) as follows:
Figure 6.19: The adaptive critic computes estimated total
expected future reward ( V ( t ) ) based on the current stimuli,
and learns by adjusting the weights to minimize the difference
between this estimate and its value based on a one time step
look-ahead.
) between the value that this estimate should be ac-
cording to equation 6.4 and our current estimate V (t) :
(6.5)
Note that we got rid of the expected value notation h:::i ,
because we will compute this on each trial and incre-
ment the changes slowly over time to compute the ex-
pected value (much as we did with our Hebbian learning
rule). Note too, that this equation is based on the notion
that the predictions of future reward have to be consis-
tent over time (i.e., the prediction at time t has to agree
with that at time t +1 ), and that the error signal is a
measure of the residual inconsistency. Thus, TD learn-
ing is able to span temporal delays by building a bridge
of consistency in its predictions across time.
The last thing we have to specify for the AC is ex-
actly how V ( t ) is computed directly from external stim-
uli, and then how this TD error signal can be used to
adapt these estimates. As you might expect, we will do
this computation using a neural network that computes
based on weights from representations of the stim-
uli (and potentially processed by one or more hidden
layers) (figure 6.19). The TD error can then be used to
train the weights of the network that computes V (t) by
treating it the same as the error signal one would get
from sum-squared error or cross-entropy error (i.e., like
(6.4)
Given that this same relationship should hold of our es-
timates V (t) , we can now define a TD error which will
tell us how to update our current estimate in terms of
this look-ahead estimate at the next point in time. We
do this by just computing the difference (represented as
in equation 5.23 from section 5.6). Thus, to the ex-
tent that there is some stimulus in the environment that
Search WWH ::




Custom Search