Information Technology Reference
In-Depth Information
Fig. 3.7. HMAX model of object recognition proposed by Riesenhuber and Poggio. The net-
work consists of alternating S-layers and C-layers that extract features of increasing complex-
ity, size, and invariance. S-cells extract features by template matching while C-cells produce
invariance by pooling of S-cells with a maximum operator (image from [192]).
Again, when going up the hierarchy, the receptive field size of the feature detec-
tors is enlarged, the feature complexity rises, and the responses become more and
more invariant to input transformations, such as shifts or rotations. Cells in layer
S1 correspond to V1 simple cells. They analyze the 160 × 160 input image and ex-
tract oriented features at different positions, scales, and orientations. Space is sam-
pled at every pixel, 12 scales are used, and four orientations are extracted, yielding
1,228,800 cells. The huge number of S1-cells is reduced in layer C1 to 46,000 by
pooling cells with the same orientation, similar position, and similar scale. C1 cells
correspond to V1 complex cells that detect oriented image structure invariant to the
phase. S2 cells receive input from 2 × 2 neighboring C1 units of arbitrary orientation,
yielding a total of almost three million S2 cells of 256 different types. They detect
composite features, such as corners and line crossings. All cells of a certain type are
pooled to a single C2 cell that is now totally invariant to stimulus position. At the
top of the hierarchy reside view-tuned cells that have Gaussian transfer functions.
They receive input from a subset of typically 40 of the 256 C2 cells.
Almost all weights in the network are prewired. Only the weights of the view-
tuned cells can be adapted to a dataset. They are chosen such that a view-tuned unit
receives inputs from the C2 cells most active when the associated object view is
presented at the input of the network.
Riesenhuber and Poggio showed that these view-tuned cells have properties sim-
ilar to the cells found in the inferotemporal cortex (IT). They also demonstrated that
view-invariant recognition of 3D paper clips is possible by combining the outputs
of units tuned to different views of an object. In addition, the model was used re-
Search WWH ::




Custom Search