Machine Learning - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Our goal is to find this target function t(x) , however, since P is unknown we

cannot know directly the value of t at the data points. We can only use the following

empirical risk to estimate t :

R(g) =

I { g(X i ) = Y i } .

However, it might not be a good choice to directly minimize the empirical risk in

the entire function class due to the risk of over fitting. There are usually two ways

of tackling the challenges. The first is to restrict the class of functions in which the

minimization is performed, and the second is to modify the criterion to be mini-

mized (e.g., adding a penalty for complex functions). These two ways can also be

combined in some algorithms.

Considering the aforementioned ways of avoiding over fitting, researchers usu-

ally design their learning algorithms according to the following paradigms.

Empirical Risk Minimization . Algorithms belonging to this category are the most

straightforward yet usually very effective. The idea is to restrict the function class

to be

when performing the minimization of the empirical risk. That is,

R(g).

g n =

arg min

g ∈ G

Structural Risk Minimization . Algorithms belonging to this category introduce a

penalty of the size of the function class. Suppose there is an infinite sequence

{ G d :

of function classes with increasing size, and the learning algorithm

actually minimizes the following objective function:

1 , 2 ,...

}

∈ G d R(g)

g n =

arg min

Pen(d, n)

where the penalty Pen(d, n) gives preference to the models from a smaller function

class.

Sometimes, an alternative and easier-to-implement approach to the structural risk

minimization is to adopt a regularization term (typically the norm of g ) in the ob-

jective function, as follows:

R(g)

2 .

g n =

arg min

∈ G

As can be seen above, the eventual goal is to minimize the expected risk R(g)

while the empirical risk R(g) (or structural risk) is actually minimized by learning

algorithms. Given the learned function g n ∈ G

, in order to evaluate whether it is

good or not, one still needs to look at R(g n ) . However, R(g n ) is a random variable

and cannot be computed from the data. Therefore, one can usually only take the form

of probabilistic bounds for the analysis on the learning algorithms. In the following

paragraphs, we will introduce the basic forms of such bounds.

Here we introduce the best function g ∗ ∈ G

with R(g ∗ )

inf g ∈ G

R(g) ,wehave

R ∗ = R(g ∗ )

R ∗ + R(g n )

R(g ∗ ) .

R(g n )

−

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home