Face Localization - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

per layer increases when going from Layer 0 (4 + 2) to Layer 2 (16+8). Layer 3

contains 10 excitatory and 5 inhibitory feature cells. In addition, an input feature

array is present in all layers except the topmost one.

Most projections in the network are either excitatory or inhibitory. Weights in

projections that access excitatory units are non-negative. Weights from inhibitory

units are non-positive. In contrast, weights of projections accessing the input feature

array can have any sign. They have a window size of 5 × 5 and lead to excitatory

features in the same layer or belong to forward projections of excitatory feature

cells in the next higher layer.

The excitatory feature cells of Layer 1 and Layer 2 receive forward projections

from the 4 × 4 hyper-neighborhood in the layer below them. Connections between

Layer 2 and the topmost Layer 3 are different since the resolution drops from 12 × 9

to 1 × 1. Here, the forward and backward projections implement a full connectivity

between the excitatory feature cells of one layer and all feature cells of the other

layer. The backward projections of Layer 0 and Layer 1 access all feature cells of

a single hypercolumn in the next higher layer. 2 × 2 different backward projections

exist for each excitatory feature. In all layers except the topmost one lateral projec-

tions access all features of the 3 × 3 hyper-neighborhood around a feature cell. In

Layer 3 lateral projections are smaller because all feature cells are contained in a

1 × 1 hyper-neighborhood.

The projections of the inhibitory features are simpler. They access 5 × 5 windows

of all excitatory feature arrays within the same layer. In Layer 3, of course, this win-

dow size reduces to 1 × 1. While all projection units have linear transfer functions, a

smooth rectifying transfer function f st ( β = 10 , see Fig. 4.6(a) in Section 4.2.4) is

used for the output units of all feature cells.

The feature arrays are surrounded by a two pixel wide border. The activities of

the border cells are copied from feature cells using wrap-around.

10.4 Experimental Results

Because the BioID dataset does not specify which images constitute a training set

and a testing set, the dataset was divided randomly into 1000 training images (TRN)

and 521 test examples (TST). The network was trained for ten iterations on random

subsets of the training set with increasing size using backpropagation through time

(BPTT) and RPROP, as described in Chapter 6. The weighting of the quadratic error

increased linearly in time.

The two first excitatory feature arrays on the three lower layers are trained to

produce the desired output blobs that indicate the eye positions. All other features

are hidden. They are forced to have low mean activity.

Figure 10.5 shows the development of the trained network's output over time

when the test image from Fig. 10.3 is presented as input. One can observe that

the blobs signaling the locations of the eyes develop in a top-down fashion. After

the first iteration they appear only in the lowest resolution. This coarse localization

is used to bias the development of blobs in lower layers. After five iterations, the

Search WWH ::

Custom Search

Home