Biomedical Engineering Reference
In-Depth Information
19.6.1
Is scene understanding purely attentional?
Psychophysical experiments pioneered by Biederman and colleagues [3] have demon-
strated how we can derive coarse understanding of a visual scene from a single pre-
sentation that is so brief (80 ms or less) that it precludes any attentional scanning or
eye movement. A particularly striking example of such experiments consists of pre-
senting to an observer a rapid succession of unrelated photographs of natural scenes
at a high frame rate (over 10 scenes/s). After presentation of the stimuli for sev-
eral tens of seconds, observers are asked whether a particular scene, for example an
outdoors market scene, was present among the several hundred photographs shown.
Although the observers are not made aware in advance of the question, they are able
to provide a correct answer with an overall performance well over chance (Bieder-
man, personal communication). Furthermore, observers are able to recall a number
of coarse details about the scene of interest, such as whether it contained humans, or
whether it was highly colorful or rather dull.
These and many related experiments clearly demonstrate that scene understanding
does not exclusively rely on attentional analysis. Rather, a very fast visual subsystem
which operates in parallel with attention allows us to rapidly derive the gist and
coarse layout of a novel visual scene. This rapid subsystem certainly is one of the
key components by which attention may be guided top-down towards specific visual
locations.
19.6.2
Cooperation between where and what
Several computer vision models have been proposed for extended object and scene
analysis that rely on a cooperation between an attentional (where) and localized
recognition (what) subsystems.
A very interesting instance was recently provided by Schill et al. [46]. Their
model aims at performing scene (or object) recognition, using attention (or eye
movements) to focus on those parts of the scene being analyzed which are most
informative in disambiguating its identity. To this end, a hierarchical knowledge tree
is trained into the model, in which leaves represent identified objects, intermediary
nodes represent more general object classes, and links between nodes contain senso-
rimotor information used for discrimination between possible objects (i.e., bottom-
up feature responses to be expected for particular points in the object, and eye move-
ment vectors targeted at those points). During the iterative recognition of an object,
the system programs its next fixation towards the location which will maximally in-
crease information gain about the object being recognized, and thus will best allow
the model to discriminate between the various candidate object classes.
Several related models have been proposed [13, 23, 44, 49, 50], in which scan-
paths (containing motor control directives stored in a “where” memory and locally
expected bottom-up features stored in a “what” memory) are learned for each scene
or object to be recognized. The difference between the various models comes from
the algorithm used to match the sequences of where/what information to the visual
scene. These include using a deterministic matching algorithm (i.e., focus next onto
Search WWH ::




Custom Search