Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech - Conflict and Multimodal Communication: Social Research and Machine Intelligence

Information Technology Reference

In-Depth Information

of the generatively learned features may be irrelevant for the discrimination task,

but those that are relevant are usually much more useful than the input features,

because they capture the complex higher-order statistical structure that is present in

the input data.

It was shown in Erhan et al. ( 2010 ) that greedy layer-wise unsupervised pre-

training is crucial in deep learning by introducing a useful prior to the supervised

fine-tuning training procedure. The regularization effect is claimed to be a con-

sequence of the pre-training procedure establishing an initialization point of the

fine-tuning procedure inside a region of parameter space in which the parameters

are henceforth restricted. Furthermore, overfitting can be substantially reduced if a

generative model is used to find sensible features without making any use of the

labels.

Strictly speaking, a DBN is a generative model consisting of several RBM layers.

However, a DBN can be used to initialize the hidden layers of a standard feed-

forward DNN. An additional output layer is then built on top of the DNN, typically

a softmax layer for classification tasks or a linear layer for regression tasks. In the

literature the terms DBN and DNN are often used interchangeably.

19.2.4

Dropout

Despite their big success, DNNs suffer from a major weakness. Due to their many

non-linear hidden layers they are very expressive models and are thus very prone

to the phenomenon of overfitting. This term describes the effect that a large feed-

forward neural network typically performs poorly on held-out test data when trained

on a small training set, as is often the case in computational paralinguistics.

Dropout was introduced by Hinton et al. ( 2012 ) as a powerful technique for

reducing overfitting and for improving generalization of large neural networks. It

prevents complex co-adaptations in which a hidden unit is only helpful in the context

of several other specific hidden units by randomly omitting each hidden unit from

the network with a probability p for each training case, so that a hidden unit cannot

rely on other hidden units being present. Instead, each unit learns to detect a feature

that is generally helpful for producing the correct answer given the combinatorially

large variety of internal contexts in which it must operate.

This is equivalent to adding a particular type of noise to the hidden unit

activations during the forward pass in training, similar to the noise added to the input

units in the denoising auto-encoder approach presented in Vincent et al. ( 2008 ).

However, unlike in the auto-encoder pre-training, dropout can be used in all hidden

and input layers of a network and even during the fine-tuning stage of training.

An interesting way to view dropout is to consider it as a very efficient method of

model averaging. Averaging the predictions of a large number of different networks

is a well-known approach to reduce the test error. With neural networks this can

be achieved by training many separate networks and then applying each of them

to the test data, but especially with deep networks this is computationally very

Search WWH ::

Custom Search

Home