Information Technology Reference
In-Depth Information
Hierarchical Products of Experts. Another approach that makes the learning
of multi-level statistical image models possible, is the products of experts (PoE)
method that Hinton [97] recently proposed. Each expert specifies a probability dis-
tribution p m ( d | θ m ) over the visible variables d and the n experts are combined
by multiplying these distributions together and renormalizing: p ( d | θ 1 ,...,θ n ) =
Q
P
c
Q
m p m ( c | θ m ) , where c enumerates all possible vectors in data
space. The motivation for multiplying the experts is that the combined distribution
can be much sharper than the individual expert models. For example, each expert
can constrain only a small subset of the many image space dimensions and the prod-
uct will constrain all of them if the subsets cover the dimensions. Furthermore, the
PoE construction makes it easy to infer the values of the latent variables of each ex-
pert because the latent variables of different experts are conditionally independent,
given the data.
One expert type for which this inference is tractable are restricted Boltzman
machines (RBM) [218]. These networks consist of one visible layer and one hidden
layer. They have no intralayer connections. The vertical connections between the
binary stochastic units are symmetrical. Each hidden unit can be viewed as an expert
since the probability of the network generating a data vector is proportional to the
product of the probabilities that the data vector is generated by each of the hidden
units alone [74].
Because it is time-consuming to train RBMs with the standard Boltzman ma-
chine learning algorithm, Hinton proposed to minimize not the Kullback-Leibler
divergence, Q 0 k Q
m p m ( d | θ m ) /
, between the data distribution Q 0
and the equilibrium distri-
bution of fantasies over the visible units Q
, but to minimize the difference, called
contrastive divergence, between Q 0 k Q
. Q 1 is the distribution of one-
step reconstructions of the data that are produced by first choosing hidden states
according to their conditional distribution, given the data, and then choosing bi-
nary visible states, given the hidden states. For image data this leads to the learning
rule: ∆w ij ∝h p i p j i
and Q 1 k Q
Q 1 , where p i are the pixel intensities that have been
scaled to [0,1], p j = 1 / (1 + exp (
−h p i p j i
Q 0
P
i w ij p i )) is the expected value of the hidden
units, and h . i
Q k denotes the expected value after k network updates.
Since the hidden-unit activities are not independent, they can also be viewed as
data generated by a second PoE network. The hidden units of this second network
will then capture some of the remaining structure, but may still have dependencies
which can be analyzed by a third PoE network. Mayraz and Hinton [154] applied
this idea to the recognition of handwritten digits. They trained a separate hierarchy
of three PoE networks for each digit class using the MNIST [132] dataset. After
training, they observed that the units of the first hidden layer had localized receptive
fields, which described common local deviations from a class prototype. They used
Q
m p m ( d | θ m ) as log-probability scores to measure the deviation of a digit from
a class-model. All 30 scores were fed to a linear classifier which was trained on a
validation set. When 500 hidden units were used in each layer, a test set error rate
of 1.7% was observed.
Search WWH ::




Custom Search