Information Technology Reference
In-Depth Information
expensive both during training and testing. Using the dropout technique training a
huge number of neural networks in reasonable time becomes feasible. By randomly
dropping out a certain percentage of hidden units, almost certainly a different
network is used for each training case. Note that all of these networks share the
same weights for the hidden units that are not omitted, which explains the strong
regularization effect of dropout.
At test time dropout is not used and we use the “average network” with all hidden
units active. However, at this stage more hidden units are active than during training.
In order to compensate for this fact, during training we multiply the net input from
the layer below by a factor of 1
1p
as in Dahl et al. ( 2013 ), where p is the probability
of the hidden units in the lower layer being dropped out. Thus the activation y of
layer during the forward pass becomes
1
1 p y 1 ˇ M W C b
y D g
;
(19.12)
where g ./ is the activation function of layer , W and b are the weights and
biases of the layer, respectively, ˇ denotes element-wise multiplication, and M is
a binary mask matrix, whose elements are sampled i.i.d. from a Bernoulli( 1 p )
distribution. The factor 1
1p
, which is used during training, ensures that at test time
the layer inputs are scaled correctly.
As mentioned above, dropout strongly reduces overfitting and leads to more
robust models. However, applying dropout also increases the training time of the
networks. The advantage is that larger networks can be used to obtain better results.
This observation will be confirmed in our experiments. In the past, dropout has
resulted in substantial improvements on many benchmark tasks in speech and object
recognition and we will see that dropout also yields improvements in the automatic
prediction of conflict levels.
19.2.5
Rectified Linear Units
The key computational unit of a deep network is a linear projection followed by a
point-wise non-linearity, typically a logistic sigmoid or tanh function. Substituting
this function with the recently proposed rectified linear unit (ReLu) has been
shown to improve generalization and to make training of deep networks faster and
simpler (Zeiler et al. 2013 ; Maas et al. 2013 ), and has become state of the art in
speech and object recognition. A ReLu is linear when its input is positive and zero
otherwise and is given by
x;
if
x>0
g.x/ D max .x;0/ D
(19.13)
0;
else
 
Search WWH ::




Custom Search