Information Technology Reference
In-Depth Information
The average
hi
0
can be readily computed using the sample data
v
, but the average
hi
1
involves the normalization constant
Z
, which cannot generally be computed
efficiently (being a sum of an exponential number of terms).
To avoid the difficulty in computing the log-likelihood gradient, Hinton (
2002
)
proposed the
Contrastive Divergence
(CD) algorithm which approximately follows
the gradient of the difference of two divergences:
@
log
p.v/
@
w
ij
hv
i
h
j
i
0
hv
i
h
j
i
k
(19.8)
The expectation
hi
k
represents a distribution from running a Gibbs sampler
(cf. (
19.2
) and (
19.3
)) initialized at the data for
k
full steps. This process is
shown in Fig.
19.2
. In practice, we typically choose
k D 1
. This is a rather crude
approximation of the true log maximum likelihood gradient, but it works well in
practice.
19.2.2
Gaussian-Bernoulli RBM
In most speech applications the input data are real-valued. A popular approach to
modeling this kind of data is normalizing each input variable to fall into the range
[0;1] and treating it as a probability. However, even though this approach might
seem appropriate at first glance, it bears serious difficulties, as it poorly models
the underlying data distribution of true real-valued processes (Wang et al.
2013
).
In order to cope with this property we use a slightly modified RBM in the input
layer, referred to as Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM).
In this model we substitute the binary visible units with visible units sampled from
a Gaussian distribution, where we use a modified energy function:
V
V
H
H
X
X
X
X
.v
i
b
i
/
2
2
i
v
i
i
h
j
w
ij
h
j
b
j
E.v;h/ D
(19.9)
iD1
iD1
jD1
jD1
Fig. 19.2
Illustration of
k
-step Gibbs sampling for approximating the model data distribution,
initialized at the data
v
0
Search WWH ::
Custom Search