Information Technology Reference
In-Depth Information
Notice that the weights have increased for the unit
representing the stimulus in its last position just before
it went off (at t =15 ). Thus, the reward caused the AC
unit to go from 0 in the minus phase to .95 in the plus
phase, and this Æ(t =16) updated the weights based on
the sending activations at the previous time step ( t =
AC
Input
odor
light
), just as discussed in the previous section.
We can monitor the Æ(t) values (i.e., the plus-minus
phase difference) for the AC unit as a function of time
step using a graph log.
tone
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
time
Do View and select GRAPH_LOG . Then, Step once
and the graph log should update.
This log clearly shows the blip at t =16 , which goes
back down to 0 as you continue to Step .Thisisbe-
cause we are maintaining the reward active until the end
of the entire sequence (at t =20 ), so there is no change
in the AC unit, and therefore Æ =0 .
Figure 6.22: The reinforcement learning network with CSC
input representation of stimuli by time.
, !
Let's start by examining the network (figure 6.22).
The input layer contains three rows of 20 units each.
This is the CSC, where the rows each represent a differ-
ent stimulus, and the columns represent points in time.
Then, there is a single AC unit that receives weights
from all of these input units.
Now, switch back to act and Step again until you
get to t =15 again on the second pass through.
Recall that the weight for this unit has been increased
— but there is no activation of the AC unit as one might
have expected. This is due to the thresholded nature of
the units.
, !
Click on r.wt and then on the AC unit to see that
the weights start out initialized to zero. Then, click back
to act .
Let's see how the CSC works in action.
, !
To see this, click on net .
You will see that the unit did receive some positive
net input.
, !
Do Step on the control panel.
Nothing should happen, because no stimulus or re-
ward was present at t =0 . However, you can monitor
the time steps from the tick: value displayed at the
bottom of the network view (a tick is one time step in a
sequence of events in the simulator).
, !
Continue to Step until you get to trial 3 (also shown
at the bottom of the network as trial:3 ), time step
, !
.
Due to accumulating weight changes from the previ-
ous 3 trials, the weight into the AC unit is now strong
enough to activate over threshold. If you look at the
graph log, you will see that there is now a positive Æ(t)
at time step 15.
Thus, the network is now anticipating the reward one
time step earlier. This anticipation has two effects.
t =15
Continue to Step until you see an activation in the
input layer (should be 10 more steps).
This input activation represents the fact that the first
stimulus (i.e., the “tone” stimulus in row 1) came on at
, !
.
Continue to Step some more.
You will see that this stimulus remains active for 6
more time steps (through t =15 ). Then, notice that just
as the stimulus disappears, the AC unit becomes acti-
vated (at t =16 ). This activation reflects the fact that
a reward was received, and the plus-phase activation of
this unit was clamped to the reward value (.95 here).
Now, let's see what this reward did to the weights.
First, click on r.wt .
You should notice that the weight from the previous
time step (now t = 14 ) is increased as a result of this
positive Æ(t = 15) . These weight changes will even-
tually lead to the reward being anticipated earlier and
earlier.
, !
, !
Now, do one more Step , and observe the graph log.
The second effect is that this anticipation reduced the
magnitude of the Æ ( t = 16) .
, !
Click on r.wt and then on the AC unit.
, !
Search WWH ::




Custom Search