Information Technology Reference
In-Depth Information
Hierarchical Kalman Filters.
If one does not use binary stochastic processing
units, but the generation model is a weighted sum of basis functions with added
Gaussian noise, inference is tractable as well. The Kalman filter [116] allows to in-
fer the hidden causes from data, even if the causes change in time according to a
linear dynamical system. Rao [186] proposed using Kalman filters to learn image
models. Segmentation and recognition of objects and image sequences was demon-
strated in the presence of occlusions and clutter.
To account for extra-classical receptive-field effects in the early visual system,
Rao and Ballard [187] combined several simplified Kalman filters in a hierarchical
fashion. In this model, static images
I
are represented in terms of potential causes
r
:
I
=
U
r
+
n
, where
n
is zero mean Gaussian noise with variance
σ
2
. The matrix
U
contains the basis vectors
U
j
that mediate between the causes and the image. To
make the model hierarchical, the causes
r
are represented in terms of higher-level
causes
r
h
:
r
=
r
td
+
n
td
, where
r
td
=
U
h
r
h
is a top-down prediction of
r
and
n
td
is zero mean Gaussian noise with variance
σ
td
.
The goal is now to estimate, for each hierarchical level, the coefficients
r
for a
given image and, on a longer time scale, learn appropriate basis vectors
U
j
. This is
achieved by minimizing:
1
σ
2
(
I
1
σ
td
−
U
r
)
T
(
I
−
U
r
) +
(
r
−
r
td
)
T
(
r
−
r
td
) +
g
(
r
) +
h
(
U
)
,
E
=
P
i,j
U
i,j
are the negative logarithms of
the Gaussian prior probabilities of
r
and
U
, respectively. The two first terms of
E
describe the negative logarithms of the probability of the data, given the parameters.
They are the squared prediction errors for Level 1 and Level 2, weighted with the
inverse variances.
An optimal estimate of
r
can be obtained by gradient descent on
E
with respect
to
r
:
P
i
r
i
and
h
(
U
) =
λ
where
g
(
r
) =
α
d
r
dt
=
−
k
1
∂E
∂
r
=
k
1
k
1
σ
td
σ
2
U
T
(
I
U
r
) +
(
r
td
−
r
)
−
k
1
α
r
,
−
2
where
k
1
is a positive constant. This computation is done in the predictive estimator
(PE) module, sketched in Figure 3.11(a). It combines the bottom-up residual error
(
I
U
r
)
that has been passed through
U
T
with the top-down error
(
r
td
−
r
)
to
improve
r
. Note that all the information required is available locally at each level.
A synaptic learning rule for adapting the weights
U
can be obtained by perform-
ing gradient descent on
E
with respect to
U
after the estimate
r
becomes stable:
d
U
dt
−
=
−
k
2
2
∂E
∂
U
=
k
2
σ
2
(
I
U
r
)
r
T
−
k
2
λ
U
,
−
where
k
2
is the learning rate. This is a Hebbian [91] type of learning with weight
decay.
Rao and Ballard applied this optimization to the three-layered network sketched
in Figure 3.11(b). In Level 0, three 16
×
16 image patches enter the network which
Search WWH ::
Custom Search