Information Technology Reference
In-Depth Information
Hierarchical Kalman Filters. If one does not use binary stochastic processing
units, but the generation model is a weighted sum of basis functions with added
Gaussian noise, inference is tractable as well. The Kalman filter [116] allows to in-
fer the hidden causes from data, even if the causes change in time according to a
linear dynamical system. Rao [186] proposed using Kalman filters to learn image
models. Segmentation and recognition of objects and image sequences was demon-
strated in the presence of occlusions and clutter.
To account for extra-classical receptive-field effects in the early visual system,
Rao and Ballard [187] combined several simplified Kalman filters in a hierarchical
fashion. In this model, static images I are represented in terms of potential causes
r : I = U r + n , where n is zero mean Gaussian noise with variance σ 2 . The matrix
U contains the basis vectors U j that mediate between the causes and the image. To
make the model hierarchical, the causes r are represented in terms of higher-level
causes r h : r = r td + n td , where r td = U h r h is a top-down prediction of r and n td
is zero mean Gaussian noise with variance σ td .
The goal is now to estimate, for each hierarchical level, the coefficients r for a
given image and, on a longer time scale, learn appropriate basis vectors U j . This is
achieved by minimizing:
1
σ 2 ( I
1
σ td
U r ) T ( I
U r ) +
( r r td ) T ( r r td ) + g ( r ) + h ( U ) ,
E =
P
i,j U i,j are the negative logarithms of
the Gaussian prior probabilities of r and U , respectively. The two first terms of E
describe the negative logarithms of the probability of the data, given the parameters.
They are the squared prediction errors for Level 1 and Level 2, weighted with the
inverse variances.
An optimal estimate of r can be obtained by gradient descent on E with respect
to r :
P
i
r i and h ( U ) = λ
where g ( r ) = α
d r
dt = k 1
∂E
r
= k 1
k 1
σ td
σ 2 U T ( I
U r ) +
( r td r ) k 1 α r ,
2
where k 1 is a positive constant. This computation is done in the predictive estimator
(PE) module, sketched in Figure 3.11(a). It combines the bottom-up residual error
( I
U r ) that has been passed through U T with the top-down error ( r td r ) to
improve r . Note that all the information required is available locally at each level.
A synaptic learning rule for adapting the weights U can be obtained by perform-
ing gradient descent on E with respect to U after the estimate r becomes stable:
d U
dt
= k 2
2
∂E
U
= k 2
σ 2 ( I
U r ) r T k 2 λ U ,
where k 2 is the learning rate. This is a Hebbian [91] type of learning with weight
decay.
Rao and Ballard applied this optimization to the three-layered network sketched
in Figure 3.11(b). In Level 0, three 16 × 16 image patches enter the network which
Search WWH ::




Custom Search