Information Technology Reference
In-Depth Information
cently for categorization tasks, such as the distinction of images showing dogs and
cats. Riesenhuber and Poggio argue that in such an architecture the binding prob-
lem might not be as severe as originally perceived [192]. Since the lower levels
of the hierarchy contain retinotopic representations, features of spatially separated
objects do not interact and hence are bound by spatial proximity. Features in the
higher levels are complex combinations of simple features. Since there are many
such combinations, it is unlikely that the features of two objects can be combined to
a valid third object. However, the experiments showed that recognition performance
decreased slightly when two non-overlapping objects were present, but recognition
was impaired severely if the two objects overlapped.
The HMAX architecture was designed to recognize a single object in a feed-
forward manner. The use of the maximum operation for pooling makes the cell re-
sponses invariant to input transformations and also suppresses noise. The response
of a C-cell that reacts to a feature is not changed by nearby clutter, as long as the
strongest S-cell response to the feature is stronger than the S-responses to the dis-
tractor. However, a C-cell cannot tell the difference between one or more instances
of the same feature within its receptive field.
Convolutional Networks. The creation of features by enumeration of all possi-
ble subfeature-combinations is easy, but computationally inefficient. For practical
applications, such as optical character recognition (OCR) and the interpretation of
handwritten text, the network size plays an important role since real-time conditions
must be met for the network recall.
If more of the network parameters can be adapted to a specific task, smaller net-
works suffice to extract the relevant features. One example of a fully adaptable hier-
archical neural network is the convolutional network proposed by LeCun et al. [133]
for the recognition of isolated normalized digits. A recent version of such a network,
which is called LeNet-5 [134], is illustrated in Figure 3.8.
The network consists of seven layers and an input plane that contains a digit. It
has been normalized to 20 × 20 pixels and centered in the 32 × 32 frame. The input
intensities are scaled such that the white background becomes 0 . 1 and the black
Fig. 3.8. Convolutional neural network LeNet-5 developed by LeCun et al. [134] for digit
recognition. The first layers compute an increasing number of feature maps with decreas-
ing resolution by convolution with 5 × 5 kernels and subsampling. At the higher layers, the
resolution drops to 1 × 1 and the weights are fully connected (image adapted from [134]).
Search WWH ::




Custom Search