Machine Learning - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Fig. 22.1

Neural networks

22.2.1 Neural Networks

Neural networks provide a general method for learning real-valued, discrete-valued,

and vector-valued functions from examples. Algorithms such as Back Propaga-

tion (BP) use gradient descent to tune the network parameters to best fit a training

set, and has proven surprisingly successful in many practical problems like speech

recognition and handwriting recognition.

In this subsection, we will take multi-layer neural networks as example to illus-

trate the basic idea of neural networks and the BP algorithm.

A typical multi-layer neural network is shown in Fig. 22.1 . The first layer corre-

sponds to the input vectors; the second layer contains hidden nodes which determine

the representability of the network; the third layer corresponds to the output. Note

that if we define each node in the network as a linear function, the multi-layer struc-

ture actually will not bring too much difference to a single-layer network, since it

can still provide only linear functions. To avoid this problem, a sigmoid unit is used

in the network. The sigmoid unit first computes the linear combination of its inputs,

then applies a sigmoid function to the result. Specifically,

1

o =

e − w T x .

1

+

Usually the BP method is used to learn the weight for neural networks, given

the fixed set of nodes and edges in the networks. Specifically, the BP algorithm

repeatedly iterates over the training examples. For each training example, it applies

the networks to the example, calculates the error of the networks' output for this

example, computes the gradient with respect to the error on the example, and then

updates all the weights in the networks. This gradient descent step is iterated until

the networks perform acceptably well.

Note that the advantage of using gradient descent is its simplicity. Especially

when the sigmoid unit is used in the neural networks, the computation of the gradient

becomes highly feasible. The disadvantage is that the gradient descent can only find

a local optimum of the parameters. Usually one needs to perform random restarts

Search WWH ::

Custom Search

Home