Information Technology Reference
In-Depth Information
The average hi 0 can be readily computed using the sample data v , but the average
hi 1 involves the normalization constant Z , which cannot generally be computed
efficiently (being a sum of an exponential number of terms).
To avoid the difficulty in computing the log-likelihood gradient, Hinton ( 2002 )
proposed the Contrastive Divergence (CD) algorithm which approximately follows
the gradient of the difference of two divergences:
@ log p.v/
@ w ij
hv i h j i 0 hv i h j i k
(19.8)
The expectation hi k represents a distribution from running a Gibbs sampler
(cf. ( 19.2 ) and ( 19.3 )) initialized at the data for k full steps. This process is
shown in Fig. 19.2 . In practice, we typically choose k D 1 . This is a rather crude
approximation of the true log maximum likelihood gradient, but it works well in
practice.
19.2.2
Gaussian-Bernoulli RBM
In most speech applications the input data are real-valued. A popular approach to
modeling this kind of data is normalizing each input variable to fall into the range
[0;1] and treating it as a probability. However, even though this approach might
seem appropriate at first glance, it bears serious difficulties, as it poorly models
the underlying data distribution of true real-valued processes (Wang et al. 2013 ).
In order to cope with this property we use a slightly modified RBM in the input
layer, referred to as Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM).
In this model we substitute the binary visible units with visible units sampled from
a Gaussian distribution, where we use a modified energy function:
V
V
H
H
X
X
X
X
.v i b i / 2
2 i
v i
i h j w ij
h j b j
E.v;h/ D
(19.9)
iD1
iD1
jD1
jD1
Fig. 19.2 Illustration of k -step Gibbs sampling for approximating the model data distribution,
initialized at the data v 0
Search WWH ::




Custom Search