Classification: Advanced Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

This function is also referred to as a squashing function , because it maps a large input

domain onto the smaller range of 0 to 1. The logistic function is nonlinear and

differentiable, allowing the backpropagation algorithm to model classification problems

that are linearly inseparable.

We compute the output values, O j , for each hidden layer, up to and including the

output layer, which gives the network's prediction. In practice, it is a good idea to

cache (i.e., save) the intermediate output values at each unit as they are required again

later when backpropagating the error. This trick can substantially reduce the amount of

computation required.

Backpropagate the error: The error is propagated backward by updating the weights

and biases to reflect the error of the network's prediction. For a unit j in the output

layer, the error Err j is computed by

Err j D O j .

1 O j /.

T j O j /

,

(9.6)

where O j is the actual output of unit j , and T j is the known target value of the given

training tuple. Note that O j .

is the derivative of the logistic function.

To compute the error of a hidden layer unit j , the weighted sum of the errors of the

units connected to unit j in the next layer are considered. The error of a hidden layer

unit j is

1 O j /

X

Err j D O j .

1 O j /

Err k w jk ,

(9.7)

k

where w jk is the weight of the connection from unit j to a unit k in the next higher layer,

and Err k is the error of unit k .

The weights and biases are updated to reflect the propagated errors. Weights are

updated by the following equations, where

1

w ij is the change in weight w ij :

1

w ij D.

l

/

Err j O i .

(9.8)

w ij D w ij C1

w ij .

(9.9)

“ What is l in Eq. (9.8)? ” The variable l is the learning rate , a constant typically having

a value between 0.0 and 1.0. Backpropagation learns using a gradient descent method

to search for a set of weights that fits the training data so as to minimize the mean-

squared distance between the network's class prediction and the known target value of

the tuples. 1 The learning rate helps avoid getting stuck at a local minimum in decision

space (i.e., where the weights appear to converge, but are not the optimum solution) and

encourages finding the global minimum. If the learning rate is too small, then learning

will occur at a very slow pace. If the learning rate is too large, then oscillation between

1 A method of gradient descent was also used for training Bayesian belief networks, as described in

Section 9.1.2.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home