Information Technology Reference
In-Depth Information
dark regions of facial images, such as the eyes and the shadow of the nose. The
figure also shows the encoding h of a face and its reconstruction. Because both the
weights and the coefficients of h contain a large number of vanishing components,
the encoding is sparse. The reason for this is that the model is only allowed to add
positively weighted non-negative basis-vectors to the reconstruction. Thus, different
contributions do not cancel out, as for instance in principal components analysis.
Although the generative model is linear, inference of the hidden representation
h from an image v is highly non-linear. The reason for this is the non-negativity
constraint. It is not clear how the best hidden representation could be computed
directly from W and v . However, as seen above, h can be computed by a simple
iterative scheme. Because learning of weights should occur on a much slower time-
scale than this inference, W can be regarded as constant. Then only the update-
equations for H remain. When minimizing k v Wh k 2 , h is sent in the top-down
direction through W. Wh has dimension n and is passed in the bottom-up direction
through W T . The resulting vector W T Wh has the same number r of components
as h . It is compared to W T v , which is the image v passed in the bottom-up direction
through W T . The comparison is done by element-wise division yielding a vector of
ones if the reconstruction is perfect: v = Wh. In this case, h is not changed.
When minimizing D ( v k Wh ) , the similarity of v and its top-down reconstruc-
tion Wh is measured in the bottom-layer of the network by element-wise division
v i / ( Wh ) i . The n -dimensional similarity-vector is passed in the bottom-up direc-
tion through W T , yielding a vector of dimension r . Its components are scaled down
with the element-wise inverse of the vector of ones passed through W T
to make the
update factors for h unity if the reconstruction is perfect.
This scheme of expanding the hidden representation to the visible layer, mea-
suring differences to the observations in the visible layer, contracting the deviations
to the hidden layer, and updating the estimate resembles the operation of a Kalman
filter [116]. The difference is that in a Kalman filter deviations are measured as
differences and update is additive, while in the non-negative matrix factorization
deviations are measured with quotients and updates are multiplicative. Because the
optimized function is convex for a fixed W , the iterative algorithm is guaranteed to
find the optimal solution.
Learning Continuous Attractors. In most models of associative memories, pat-
terns are stored as attractive fixed points at discrete locations in state space, as
(a)
(b)
Fig. 3.17. Representing objects by attractors: (a) discrete attractors represent isolated patterns;
(b) continuous attractors represent pattern manifolds (images after [209]).
Search WWH ::




Custom Search