Information Technology Reference
In-Depth Information
11. Summary and Conclusions
11.1 Short Summary of Contributions
In order to overcome limitations of current computer vision systems, this thesis pro-
posed an architecture for image interpretation, called Neural Abstraction Pyramid.
This hierarchical architecture consists of simple processing elements that interact
with their neighbors. The recurrent interactions are described be weight templates.
Weighted links form horizontal and vertical feedback loops that mediate contextual
influences. Images are transformed into a sequence of representations that become
increasingly abstract as their spatial resolution decreases, while feature diversity as
well as invariance increase. This process works iteratively. If the interpretation of
an image patch cannot be decided locally, the decision is deferred, until contextual
evidence arrives that can be used as bias. Local ambiguities are resolved in this way.
The proposed architecture defines a hierarchical recurrent neural network with
shared weights. Unsupervised and supervised learning techniques can be applied to
it. It turned out that the combination of the RPROP learning and backpropagation
through time ensures stable and fast training, despite the difficulties involved in
training recurrent neural networks.
The proposed architecture was applied to example problems, including the bi-
narization of handwriting, local contrast normalization, and shift-invariant feature
extraction. Unsupervised learning was used to produce a hierarchy of sparse digit
features. The extracted features were meaningful and facilitated digit recognition.
Supervised learning was applied to several computer vision tasks. Meter values
were recognized by a block classifier without the need for prior digit segmentation.
The binarization of matrix codes was learned. The recurrent network discovered the
cell structure of the code and used it to improve binarization.
The architecture was also applied for the learning of several image reconstruc-
tion tasks. Images were degraded and recurrent networks were trained to reproduce
the originals iteratively. For a super-resolution problem, small recurrent networks
outperformed feed-forward networks of similar complexity. A larger network was
used for the filling-in of occlusions, the removal of noise, and the enhancement of
image contrast.
Finally, the proposed architecture was used to localize faces in complex office
environments. It developed a top-down strategy to produce blobs that indicate eye
positions. The localization performance compared well to the hybrid system, pro-
Search WWH ::




Custom Search