Information Technology Reference
In-Depth Information
of the generatively learned features may be irrelevant for the discrimination task,
but those that are relevant are usually much more useful than the input features,
because they capture the complex higher-order statistical structure that is present in
the input data.
It was shown in Erhan et al. ( 2010 ) that greedy layer-wise unsupervised pre-
training is crucial in deep learning by introducing a useful prior to the supervised
fine-tuning training procedure. The regularization effect is claimed to be a con-
sequence of the pre-training procedure establishing an initialization point of the
fine-tuning procedure inside a region of parameter space in which the parameters
are henceforth restricted. Furthermore, overfitting can be substantially reduced if a
generative model is used to find sensible features without making any use of the
labels.
Strictly speaking, a DBN is a generative model consisting of several RBM layers.
However, a DBN can be used to initialize the hidden layers of a standard feed-
forward DNN. An additional output layer is then built on top of the DNN, typically
a softmax layer for classification tasks or a linear layer for regression tasks. In the
literature the terms DBN and DNN are often used interchangeably.
19.2.4
Dropout
Despite their big success, DNNs suffer from a major weakness. Due to their many
non-linear hidden layers they are very expressive models and are thus very prone
to the phenomenon of overfitting. This term describes the effect that a large feed-
forward neural network typically performs poorly on held-out test data when trained
on a small training set, as is often the case in computational paralinguistics.
Dropout was introduced by Hinton et al. ( 2012 ) as a powerful technique for
reducing overfitting and for improving generalization of large neural networks. It
prevents complex co-adaptations in which a hidden unit is only helpful in the context
of several other specific hidden units by randomly omitting each hidden unit from
the network with a probability p for each training case, so that a hidden unit cannot
rely on other hidden units being present. Instead, each unit learns to detect a feature
that is generally helpful for producing the correct answer given the combinatorially
large variety of internal contexts in which it must operate.
This is equivalent to adding a particular type of noise to the hidden unit
activations during the forward pass in training, similar to the noise added to the input
units in the denoising auto-encoder approach presented in Vincent et al. ( 2008 ).
However, unlike in the auto-encoder pre-training, dropout can be used in all hidden
and input layers of a network and even during the fine-tuning stage of training.
An interesting way to view dropout is to consider it as a very efficient method of
model averaging. Averaging the predictions of a large number of different networks
is a well-known approach to reduce the test error. With neural networks this can
be achieved by training many separate networks and then applying each of them
to the test data, but especially with deep networks this is computationally very
Search WWH ::




Custom Search