Be at Odds? Deep and Hierarchical Neural Networks for Classification and Regression of Conflict in Speech - Conflict and Multimodal Communication: Social Research and Machine Intelligence

Information Technology Reference

In-Depth Information

expensive both during training and testing. Using the dropout technique training a

huge number of neural networks in reasonable time becomes feasible. By randomly

dropping out a certain percentage of hidden units, almost certainly a different

network is used for each training case. Note that all of these networks share the

same weights for the hidden units that are not omitted, which explains the strong

regularization effect of dropout.

At test time dropout is not used and we use the “average network” with all hidden

units active. However, at this stage more hidden units are active than during training.

In order to compensate for this fact, during training we multiply the net input from

the layer below by a factor of 1

1p

as in Dahl et al. ( 2013 ), where p is the probability

of the hidden units in the lower layer being dropped out. Thus the activation y of

layer during the forward pass becomes

1

1 p y 1 ˇ M W C b

y D g

;

(19.12)

where g ./ is the activation function of layer , W and b are the weights and

biases of the layer, respectively, ˇ denotes element-wise multiplication, and M is

a binary mask matrix, whose elements are sampled i.i.d. from a Bernoulli( 1 p )

distribution. The factor 1

1p

, which is used during training, ensures that at test time

the layer inputs are scaled correctly.

As mentioned above, dropout strongly reduces overfitting and leads to more

robust models. However, applying dropout also increases the training time of the

networks. The advantage is that larger networks can be used to obtain better results.

This observation will be confirmed in our experiments. In the past, dropout has

resulted in substantial improvements on many benchmark tasks in speech and object

recognition and we will see that dropout also yields improvements in the automatic

prediction of conflict levels.

19.2.5

Rectified Linear Units

The key computational unit of a deep network is a linear projection followed by a

point-wise non-linearity, typically a logistic sigmoid or tanh function. Substituting

this function with the recently proposed rectified linear unit (ReLu) has been

shown to improve generalization and to make training of deep networks faster and

simpler (Zeiler et al. 2013 ; Maas et al. 2013 ), and has become state of the art in

speech and object recognition. A ReLu is linear when its input is positive and zero

otherwise and is given by

x;

if

x>0

g.x/ D max .x;0/ D

(19.13)

0;

else

Conflict and Multimodal Communication: Social Research and Machine Intelligence

Search WWH ::

Custom Search

Home