Information Technology Reference
In-Depth Information
tions), resulting in an underdetermined combinatorial
many-to-many search problem. A successful imple-
mentation of this type of approach has yet to be demon-
strated. Furthermore, the evidence from neural record-
ing in monkeys suggests that visual object representa-
tions are somewhat more view-specific (e.g., Tanaka,
1996), and not fully 3-D invariant or canonical. For
example, although IT neurons appear to be relatively
location and size invariant, they are not fully invariant
with respect to rotations either in the plane or in depth.
Behavioral studies in humans appear to provide some
support for view-specific object representations, but this
issue is still strongly debated (e.g. Tarr & Bulthoff,
1995; Biederman & Gerhardstein, 1995; Biederman &
Cooper, 1992; Burgund & Marsolek, in press).
For these reasons, we and others have taken a differ-
ent approach to object recognition based on the gradual,
hierarchical, parallel transformations that the brain is so
well suited for performing. Instead of casting object
recognition as a massive dynamic search problem, we
can think of it in terms of a gradual sequence of trans-
formations (operating in parallel) that emphasize cer-
tain distinctions and collapse across others. If the end
result of this sequence of transformations retains suffi-
cient distinctions to disambiguate different objects, but
collapses across irrelevant differences produced by dif-
ferent viewing perspectives, then invariant object recog-
nition has been achieved. This approach is consider-
ably simpler because it does not try to recover the com-
plete 3-D structural information or form complex inter-
nal models. It simply strives to preserve sufficient dis-
tinctions to disambiguate different objects, while allow-
ing lots of other information to be discarded. Note that
we are not denying that people perceive 3-D informa-
tion, just that object recognition is not based on canoni-
cal, structural representations of this information.
One of the most important challenges for the grad-
ual transformation approach to spatially invariant object
recognition is the binding problem discussed in chap-
ter 7. In recognizing an object, one must both encode
the spatial relationship between different features of the
object (e.g., it matters if a particular edge is on the right
or left hand side of the object), while at the same time
collapsing across the overall spatial location of the ob-
ject as it appears on the retina. If you simply encoded
Figure 8.10: Hierarchical sequence of transformations that
produce spatially invariant representations. The first level en-
codes simple feature conjunctions across a relatively small
range of locations. The next level encodes more complex fea-
ture conjunctions in a wider range of locations. Finally, in this
simple case, the third level can integrate across all locations of
the same object, producing a fully invariant representation.
each feature completely separately in a spatially invari-
ant fashion, and then tried to recognize objects on the
basis of the resulting collection of features, you would
lose track of the spatial arrangement (binding) of these
features relative to each other, and would thus confuse
objects that have the same features but in different ar-
rangements. For example the capital letters “T” and “L”
are both composed of a horizontal and a vertical line, so
one needs to represent the way these lines intersect to
disambiguate the letters.
As perhaps most clearly enunciated by Mozer
(1987) (see also Mozer, 1991; Fukushima, 1988; Le-
Cun, Boser, Denker, Henderson, Howard, Hubbard, &
Jackel, 1989), the binding problem for shape recogni-
tion can be managed by encoding limited combinations
of features in a way that reflects their spatial arrange-
ment, while at the same time recognizing these feature
combinations in a range of different spatial locations.
By repeatedly performing this type of transformation
over many levels of processing, one ends up with spa-
Search WWH ::




Custom Search