Machine Learning - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Given the training set, in order to learn the hypothesis, a loss function is needed.

In particular, the loss function can be defined as follows.

n

w T x i − y i 2 .

1

2

L(w) =

i

=

1

In order to find the optimal w that can minimize the above loss function, one

can simply use the gradient descent method, since this loss function is convex. By

some simple mathematical tricks, one can get the following update rule for gradient

descent, which we usually call the least mean squares (LMS) update rule:

n

y i −

w T x i x i,j ,

w j =

w j +

α

i =

1

where j is the index for the features, and x i,j means the j th feature of input variable

x i .

The above update rule is quite intuitive. The magnitude of the update is propor-

tional to the error term (y i −

w T x i ) . As a result, if we encounter a training example

on which the hypothesis can predict the target value in a perfect manner, there will

be no need to change the parameters. In contrast, if the prediction made by the hy-

pothesis has a large error, a large change will be made to the parameters.

Note that the above update rule makes use of the entire training set in each step

of the update. There is an alternative approach, in which we only use one training

example in each step. This is usually called a stochastic gradient descent method,

and the original update rule corresponds to the so-called batch gradient descend

method. Often the stochastic gradient method gets the loss function close to the

minimum much faster than the batch gradient descent method, mainly because it

updates the parameter more frequently in the training process. In this regard, es-

pecially when the training set is very large, the stochastic gradient descent method

is usually preferred. However, when using the stochastic gradient descent method,

there is the risk of never converging to the minimum when the parameter w keeps

oscillating around the minimum of L(w) , although in practice most of the values

near the minimum will have been good enough.

22.1.2 Probabilistic Explanation

There are many different ways of explaining the above linear regression algorithm,

among which the probabilistic explanation shows that the least-square linear regres-

sion is a very natural algorithm.

Let us consider the following relationship between the target variable and the

input variable:

w T x i +

y i =

ε i ,

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home