Machine Learning - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

might not be the best choice since it would lead to poor performances on the unseen

instances. To tackle the problem, many learning algorithms consider the general-

ization of the model from training to test data. For this purpose, one should choose

those models that well fit the data, but at the same time are as simple as possible

(to avoid over fitting on the training data). Then an immediate question is how to

measure the simplicity (or its counterpart complexity) of a model. This is one of the

central problems in statistical learning theory.

Regarding this question, one needs to address an important issue: the assump-

tion on the data. For this issue, one may need to be aware of the famous No Free

Lunch theory, which basically states that if there is no assumption on the connec-

tion between training and test data, prediction is impossible. Furthermore, if there

are no priori assumptions on possible test data that are expected, it is impossible to

generalize and there is thus no better algorithm (any algorithm will be beaten by

some other ones on certain kind of test data). Therefore, one needs to make a rea-

sonable assumption on the data before deriving learning algorithms and performing

theoretical analysis on them.

In statistical learning theory, the assumption on the data is that both training and

test data are sampled independently from the same distribution. The independence

assumption means that all new data yield maximum information, and the identical

distribution means that the new data give information about the same underlying

probability distribution.

In the rest of the section, we will take binary classification as an example to show

some basic concepts of statistical learning theory.

22.3.1 Formalization

In this subsection, we formalize the basic problems in statistical learning theory.

For simplicity, we consider binary classification, where the input space is

X

and

the output space is

are

random variables distributed according to an unknown distribution P and the goal

is to construct a function g : X → Y

Y ={+

1 ,

−

1

}

. We assume that the pairs (X, Y )

∈ X × Y

which predicts Y from X .

For randomly chosen test data, the error made by the function g can be repre-

sented as

R(g) = P g(X) = Y = E [ I { g(X) = Y } ] .

It is clear that if we introduce the regression function η(x)

=

E

[

Y

|

X

=

x

]=

2 P(Y =

sgn (η(x)) ,

then this target function would achieve the minimum risk over all the possible mea-

surable functions:

1

| X = x) −

1 and the target function (or Bayes classifier) t(x) =

R(t) =

inf

g

R(g).

We denote the value of R(t) as R ∗ , and refer to it as the Bayes risk.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home