Information Technology Reference
In-Depth Information
This is the minus phase for the beginning of a se-
quence (one pass through the FSA grammar), which al-
ways starts with the letter B, and the context units ze-
roed. The network will produce some random expecta-
tion of which letters are coming next. Note that there
is some noise in the unit activations — this helps them
pick one unit out of the two possible ones at random.
of them. This situation will be important later as we
consider how networks can efficiently represent multi-
ple items (see chapter 7 for further discussion).
To monitor the network's performance over learning,
we need an error statistic that converges to zero when
the network has learned the task perfectly (which is not
the case with the standard SSE, due to the randomness
of the task). Thus, we have a new statistic that reports
an error (of 1) if the output unit was not one of the two
possible outputs (i.e., as shown in the Targets layer).
This is labeled as sum_fsa_err in the log displays.
Then, Step again to see the plus phase.
You should see that one of the two possible subse-
quent letters (T or P) is strongly activated — this unit in-
dicates which letter actually came next in the sequence.
Thus, the network only ever learns about one of the two
possible subsequent letters on each trial (because they
are chosen at random). It has to learn that a given node
has two possible outputs by integrating experience over
different trials, which is one of the things that makes
this a somewhat challenging task to learn.
An interesting aspect of this task is that even when
the network has done as well as it possibly could, it
should still make roughly 50 percent “errors,” because
it ends up making a discrete guess as to which output
will come next, which can only be right 50 percent of
the time. This could cause problems for learning if it in-
troduced a systematic error signal that would constantly
increase or decrease the bias weights. This is not a prob-
lem because a unit will be correctly active about as of-
ten as it will be incorrectly inactive, so the overall net
error will be zero. Note that if we allowed both units
to become active this would not be the case, because
one of the units would always be incorrectly active, and
this would introduce a net negative error and large neg-
ative bias weights (which would eventually shut down
the activation of the output units).
One possible objection to having the network pick
one output at random instead of allowing both to be
on, is that it somehow means that the network will be
“surprised” by the actual response when it differs from
the guess (i.e., about 50% of the time). This is actu-
ally not the case, because the hidden layer representa-
tion remains essentially the same for both outputs (re-
flecting the node identity, more or less), and thus does
not change when the actual output is presented in the
plus phase. Thus, the “higher level” internal repre-
sentation encompasses both possible outputs, while the
lower-level output representation randomly chooses one
, !
Now, continue to Step into the minus phase of the
next event in the sequence.
You should see now that the Context units are up-
dated with a copy of the prior hidden unit activations.
, !
To verify this, click on act_p .
This will show the plus phase activations from the
previous event.
, !
Now you can continue to Step through the rest of
the sequence. We can open up a training graph log by
doing View , TRAIN_GRAPH_LOG , and then we can Run .
As the network runs, a special type of environment
(called a ScriptEnv ) dynamically creates 25 new se-
quences of events every other epoch (to speed the com-
putation, because the script is relatively slow). Thus,
instead of creating a whole bunch of training examples
from the underlying FSA in advance, they are created
on-line with a script that implements the Reber gram-
mar FSA.
Because it takes a while to train, you can opt to load
a fully trained network and its training log.
, !
To d o s o, Stop the network at any time. To
load the network, do Object/Load in the network
window, and select fsa.trained.net.gz . To load
the log file, go to the Epoch_0_GraphLog , and do
LogFile/Load File and select fsa.epc.log .
The network should take anywhere between 13 and
80 epochs to learn the problem to the point where it
gets zero errors in one epoch (this was the range for
ten random networks we ran). The pre-trained network
took 15 epochs to get to this first zero, but we trained
it longer (54 epochs total) to get it to the point where it
got 4 zeros in a row. This stamping in of the representa-
tions makes them more robust to the noise, but the net-
work still makes occasional errors even with this extra
, !
Search WWH ::




Custom Search