Machine Learning - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

where ε i is an error term that captures either the unmodeled effects or random noise.

Since we usually do not know much about this error term, a simple way is to assume

that the ε i are independently and identically distributed (i.i.d.) according to a Gaus-

sian distribution: ε i ∼ N( 0 ,σ 2 ) . In other words, we assume that the probability

density of ε i is

√ 2 πσ exp

.

ε i

2 σ 2

1

p(ε i )

=

−

Accordingly, we have

exp

.

w T x i ) 2

2 σ 2

(y i −

1

√ 2 πσ

p(y i |

x i ;

=

−

w)

Given the above assumptions, we can write the conditional likelihood of the train-

ing data as

exp

.

m

w T x i ) 2

2 σ 2

1

√ 2 πσ

(y i −

l(w)

=

p(y i |

x i ;

w)

=

−

i =

1

i =

1

The log likelihood can then be written as

n

y i −

w T x i 2 .

1

√ 2 πσ −

1

σ 2 ·

1

2

log l(w)

=

n log

i

=

1

Now we maximize this log likelihood in order to get the optimal parameter w .It

is not difficult to see that this is equivalent to minimizing the following least-square

loss function:

n

w T x i −

y i 2 .

1

2

L(w)

=

i =

1

The above analysis shows that under certain probabilistic assumptions on the

data, least-square regression corresponds to finding the maximum likelihood esti-

mation of w .

22.2 Classification

The literature of classification is relatively richer than that of regression. Many clas-

sification methods have been proposed, with different loss functions and different

formulations. In this section, we will take binary classification as an example to il-

lustrate several widely used classification algorithms, which are mostly related to

this topic.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home