Information Technology Reference
In-Depth Information
Fig. 22.1
Neural networks
22.2.1 Neural Networks
Neural networks provide a general method for learning real-valued, discrete-valued,
and vector-valued functions from examples. Algorithms such as Back Propaga-
tion (BP) use gradient descent to tune the network parameters to best fit a training
set, and has proven surprisingly successful in many practical problems like speech
recognition and handwriting recognition.
In this subsection, we will take multi-layer neural networks as example to illus-
trate the basic idea of neural networks and the BP algorithm.
A typical multi-layer neural network is shown in Fig. 22.1 . The first layer corre-
sponds to the input vectors; the second layer contains hidden nodes which determine
the representability of the network; the third layer corresponds to the output. Note
that if we define each node in the network as a linear function, the multi-layer struc-
ture actually will not bring too much difference to a single-layer network, since it
can still provide only linear functions. To avoid this problem, a sigmoid unit is used
in the network. The sigmoid unit first computes the linear combination of its inputs,
then applies a sigmoid function to the result. Specifically,
1
o =
e w T x .
1
+
Usually the BP method is used to learn the weight for neural networks, given
the fixed set of nodes and edges in the networks. Specifically, the BP algorithm
repeatedly iterates over the training examples. For each training example, it applies
the networks to the example, calculates the error of the networks' output for this
example, computes the gradient with respect to the error on the example, and then
updates all the weights in the networks. This gradient descent step is iterated until
the networks perform acceptably well.
Note that the advantage of using gradient descent is its simplicity. Especially
when the sigmoid unit is used in the neural networks, the computation of the gradient
becomes highly feasible. The disadvantage is that the gradient descent can only find
a local optimum of the parameters. Usually one needs to perform random restarts
Search WWH ::




Custom Search