Biomedical Engineering Reference
In-Depth Information
the squares of the sine and cosine inner products of the logons of the same scale and rotational
orientation in each jet (which reduces the total dimensionality of V to half that of the total
number of logons). (Note: Other mathematical transformations are then applied to each
of these sums to make their values insensitive to lighting gradient slopes and other lighting-
dependent effects — but these details go beyond the scope of this sketch and so are left out —
see Hecht-Nielsen and Zhou, 1995 for examples of such transformations.)
Each component of V essentially represents an estimate of the localized spatial frequency
content of the camera image (at the position of the associated gridpoint) at the spatial frequency
of the involved logon pair, in the direction of oscillation of that pair. It is on the basis of local spatial
frequency structure (which V accurately defines) that fixation points are chosen by the gaze
controller.
The job of the gaze controller is to learn to mimic the performance of a skilled human observer
performing the visual task that is to be mechanized. The manner in which the gaze controller works
and the method used to train it are now described.
The gaze controller (a perceptron; Hecht-Nielsen, 2004) has 224 inputs and two outputs. The
inputs represent the components of V corresponding to the jet at a particular image gridpoint (the
current position of regard of the gaze controller). The outputs of the gaze controller are estimates of
the a posteriori probability of this gridpoint being chosen by the skilled human as a fixation point
along with the a posteriori probability of this gridpoint not being chosen by the skilled human as a
fixation point. Training of the gaze controller is discussed below; but, to set the stage, the manner in
which the gaze controller is used operationally is described first.
Once trained, the gaze controller is used to select a fixation point in a newly acquired video
frame by evaluating each of the V component sets from each of the 263,169 gridpoints of the frame.
If the first output of the controller is above a fixed threshold (say, 0.8), and the second output is
below a fixed threshold (say, 0.2), then that gridpoint is selected as a candidate fixation point .If
there are no candidate fixation points for the frame, then that frame is skipped. If there are one or
more, the one with the highest first output value is selected as the fixation point. The gaze controller
also has provisions for creating multiple successive ''looks'' at the same object during visual
training to facilitate learning of pose insensitivity (see below). In operational use, when a visual
object of interest has been fixated on and described, the gaze controller tracks that object's fixation
points and prevents return to it until the other visual objects of interest in the scene have been
described.
To train the gaze controller, each fixation point example (for which a reference frame is selected
as the definitive ''image input'' that the human used — by taking a frame a fixed time increment
right before the beginning of their saccade) has its pixel coordinates (supplied by the frequently-
recalibrated eye tracker) stored with its reference frame. Eventually, many thousands of such
fixation point and reference frame pairs are produced, randomly scrambled to remove possible
content correlations between them, and stored. The V vector for each reference frame is also
calculated and stored with it.
The gaze controller perceptron is trained by marching through the fixation point or reference
frame examples, in sequence, many times. At each training episode, the next fixation point and
reference frame example in sequence is selected and the gridpoint nearest to the fixation point is
located. The jet components of the reference frame V vector for that gridpoint are then extracted
and provided to the perceptron, along with desired outputs 1 and 0, and one backpropagation
training episode using these specified inputs and outputs is carried out. Another gridpoint, distant
from the fixation point, is then selected and its jet V components are provided to the perceptron,
along with desired outputs 0 and 1, and a second perceptron training episode is carried out using
these inputs and outputs. The training process then moves on to the next fixation point or reference
image example. Thus, this training procedure beneficially utilizes oversampling of the examples of
the class of human-supplied fixation points (Hecht-Nielsen, 2004).
Search WWH ::




Custom Search