Databases Reference
In-Depth Information
This function is also referred to as a squashing function , because it maps a large input
domain onto the smaller range of 0 to 1. The logistic function is nonlinear and
differentiable, allowing the backpropagation algorithm to model classification problems
that are linearly inseparable.
We compute the output values, O j , for each hidden layer, up to and including the
output layer, which gives the network's prediction. In practice, it is a good idea to
cache (i.e., save) the intermediate output values at each unit as they are required again
later when backpropagating the error. This trick can substantially reduce the amount of
computation required.
Backpropagate the error: The error is propagated backward by updating the weights
and biases to reflect the error of the network's prediction. For a unit j in the output
layer, the error Err j is computed by
Err j D O j .
1 O j /.
T j O j /
,
(9.6)
where O j is the actual output of unit j , and T j is the known target value of the given
training tuple. Note that O j .
is the derivative of the logistic function.
To compute the error of a hidden layer unit j , the weighted sum of the errors of the
units connected to unit j in the next layer are considered. The error of a hidden layer
unit j is
1 O j /
X
Err j D O j .
1 O j /
Err k w jk ,
(9.7)
k
where w jk is the weight of the connection from unit j to a unit k in the next higher layer,
and Err k is the error of unit k .
The weights and biases are updated to reflect the propagated errors. Weights are
updated by the following equations, where
1
w ij is the change in weight w ij :
1
w ij D.
l
/
Err j O i .
(9.8)
w ij D w ij C1
w ij .
(9.9)
What is l in Eq. (9.8)? ” The variable l is the learning rate , a constant typically having
a value between 0.0 and 1.0. Backpropagation learns using a gradient descent method
to search for a set of weights that fits the training data so as to minimize the mean-
squared distance between the network's class prediction and the known target value of
the tuples. 1 The learning rate helps avoid getting stuck at a local minimum in decision
space (i.e., where the weights appear to converge, but are not the optimum solution) and
encourages finding the global minimum. If the learning rate is too small, then learning
will occur at a very slow pace. If the learning rate is too large, then oscillation between
1 A method of gradient descent was also used for training Bayesian belief networks, as described in
Section 9.1.2.
 
Search WWH ::




Custom Search