Related Work - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

Fig. 3.7. HMAX model of object recognition proposed by Riesenhuber and Poggio. The net-

work consists of alternating S-layers and C-layers that extract features of increasing complex-

ity, size, and invariance. S-cells extract features by template matching while C-cells produce

invariance by pooling of S-cells with a maximum operator (image from [192]).

Again, when going up the hierarchy, the receptive field size of the feature detec-

tors is enlarged, the feature complexity rises, and the responses become more and

more invariant to input transformations, such as shifts or rotations. Cells in layer

S1 correspond to V1 simple cells. They analyze the 160 × 160 input image and ex-

tract oriented features at different positions, scales, and orientations. Space is sam-

pled at every pixel, 12 scales are used, and four orientations are extracted, yielding

1,228,800 cells. The huge number of S1-cells is reduced in layer C1 to 46,000 by

pooling cells with the same orientation, similar position, and similar scale. C1 cells

correspond to V1 complex cells that detect oriented image structure invariant to the

phase. S2 cells receive input from 2 × 2 neighboring C1 units of arbitrary orientation,

yielding a total of almost three million S2 cells of 256 different types. They detect

composite features, such as corners and line crossings. All cells of a certain type are

pooled to a single C2 cell that is now totally invariant to stimulus position. At the

top of the hierarchy reside view-tuned cells that have Gaussian transfer functions.

They receive input from a subset of typically 40 of the 256 C2 cells.

Almost all weights in the network are prewired. Only the weights of the view-

tuned cells can be adapted to a dataset. They are chosen such that a view-tuned unit

receives inputs from the C2 cells most active when the associated object view is

presented at the input of the network.

Riesenhuber and Poggio showed that these view-tuned cells have properties sim-

ilar to the cells found in the inferotemporal cortex (IT). They also demonstrated that

view-invariant recognition of 3D paper clips is possible by combining the outputs

of units tuned to different views of an object. In addition, the model was used re-

Search WWH ::

Custom Search

Home