Introduction - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

With query q as input, documents are ranked based on the query likelihood, or

the probability that the document's language model will generate the terms in the

query (i.e., P(q

|

d) ). By further assuming the independence between terms, one has

= i = 1 P(t i |

P(q

d) , if query q contains terms t 1 ,...,t M .

To learn the document's language model, a maximum likelihood method is used.

As in many maximum likelihood methods, the issue of smoothing the estimate is

critical. Usually a background language model estimated using the entire collection

is used for this purpose. Then, the document's language model can be constructed

as follows:

|

d)

λ) TF(t i ,d)

p(t i |

d)

=

( 1

−

LEN(d) +

λp (t i |

C),

(1.3)

where p(t i |

∈[

]

C) is the background language model for term t i , and λ

0 , 1

is a

smoothing factor.

There are many variants of LMIR, some of them even go beyond the query like-

lihood retrieval model (e.g., the models based on K-L divergence [ 92 ]). We would

not like to introduce more about it, and readers are encouraged to read the tutorial

authored by Zhai [ 90 ].

In addition to the above examples, many other models have also been proposed

to compute the relevance between a query and a document. Some of them [ 74 ]

take the proximity of the query terms into consideration, and some others consider

the relationship between documents in terms of content similarity [ 73 ], hyperlink

structure [ 67 ], website structure [ 60 ], and topic diversity [ 91 ].

1.2.1.2 Importance Ranking Models

In the literature of information retrieval, there are also many models that rank doc-

uments based on their own importance. We will take PageRank as an example for

illustration. This model is particularly applicable to Web search because it makes

use of the hyperlink structure of the Web for ranking.

PageRank uses the probability that a surfer randomly clicking on links will arrive

at a particular webpage to rank the webpages. In the general case, the PageRank

value for any page d u can be expressed as

PR(d v )

U(d v ) .

PR(d u )

=

(1.4)

d v ∈ B u

That is, the PageRank value for a page d u is dependent on the PageRank values

for each page d v out of the set B u (containing all pages linking to page d u ), divided

by U(d v ) , the number of outlinks from page d v .

To get a meaningful solution to ( 1.4 ), a smoothing term is introduced. When a

random surfer walks on the link graph, she/he does not necessarily always follow the

existing hyperlinks. There is a small probability that she/he will jump to any other

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home