Related Work - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

Hierarchical Products of Experts. Another approach that makes the learning

of multi-level statistical image models possible, is the products of experts (PoE)

method that Hinton [97] recently proposed. Each expert specifies a probability dis-

tribution p m ( d | θ m ) over the visible variables d and the n experts are combined

by multiplying these distributions together and renormalizing: p ( d | θ 1 ,...,θ n ) =

m p m ( c | θ m ) , where c enumerates all possible vectors in data

space. The motivation for multiplying the experts is that the combined distribution

can be much sharper than the individual expert models. For example, each expert

can constrain only a small subset of the many image space dimensions and the prod-

uct will constrain all of them if the subsets cover the dimensions. Furthermore, the

PoE construction makes it easy to infer the values of the latent variables of each ex-

pert because the latent variables of different experts are conditionally independent,

given the data.

One expert type for which this inference is tractable are restricted Boltzman

machines (RBM) [218]. These networks consist of one visible layer and one hidden

layer. They have no intralayer connections. The vertical connections between the

binary stochastic units are symmetrical. Each hidden unit can be viewed as an expert

since the probability of the network generating a data vector is proportional to the

product of the probabilities that the data vector is generated by each of the hidden

units alone [74].

Because it is time-consuming to train RBMs with the standard Boltzman ma-

chine learning algorithm, Hinton proposed to minimize not the Kullback-Leibler

divergence, Q 0 k Q ∞

m p m ( d | θ m ) /

, between the data distribution Q 0

and the equilibrium distri-

bution of fantasies over the visible units Q ∞

, but to minimize the difference, called

contrastive divergence, between Q 0 k Q ∞

. Q 1 is the distribution of one-

step reconstructions of the data that are produced by first choosing hidden states

according to their conditional distribution, given the data, and then choosing bi-

nary visible states, given the hidden states. For image data this leads to the learning

rule: ∆w ij ∝h p i p j i

and Q 1 k Q ∞

Q 1 , where p i are the pixel intensities that have been

scaled to [0,1], p j = 1 / (1 + exp ( −

−h p i p j i

Q 0

i w ij p i )) is the expected value of the hidden

units, and h . i

Q k denotes the expected value after k network updates.

Since the hidden-unit activities are not independent, they can also be viewed as

data generated by a second PoE network. The hidden units of this second network

will then capture some of the remaining structure, but may still have dependencies

which can be analyzed by a third PoE network. Mayraz and Hinton [154] applied

this idea to the recognition of handwritten digits. They trained a separate hierarchy

of three PoE networks for each digit class using the MNIST [132] dataset. After

training, they observed that the units of the first hidden layer had localized receptive

fields, which described common local deviations from a class prototype. They used

m p m ( d | θ m ) as log-probability scores to measure the deviation of a digit from

a class-model. All 30 scores were fed to a linear classifier which was trained on a

validation set. When 500 hidden units were used in each layer, a test set error rate

of 1.7% was observed.

Hierarchical Neural Networks for Image Interpretation

Search WWH ::

Custom Search

Home