Statistical Ranking Framework - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

The expected risk means the loss that a ranking model f would make for a ran-

dom document pair. As the distribution P is unknown, the average of the loss over

m training document pairs is used to estimate the expected risk,

m

˜

1

˜

R(f )

=

;

L(f

x j 1 ,x j 2 ,y j 1 ,j 2 ).

(16.8)

m

j

=

1

Note that the “average view” is also technically sound in certain situations. The

intuition is not always right that two document pairs cannot be independent of each

other when they share a common document. The reason is that the dependence (or

independence) is actually defined with regards to random variables but not their

values. Therefore, as long as two document pairs are sampled and labeled in an

independent manner, they are i.i.d. random variables no matter whether their values

(the specific documents in the pair) have overlap or not.

16.1.3 The Listwise Approach

The document ranking framework cannot describe the listwise approach. Most exist-

ing listwise ranking algorithms assume that the training set contains a deterministic

set of documents associated with each query, and there is no sampling of documents.

In contrast, there is no concept of a query in the document ranking framework while

the sampling of documents is assumed.

16.2 Subset Ranking Framework

In the framework of subset ranking [ 6 , 8 ], it is assumed that there is a hierarchical

structure in the data, i.e., queries and documents associated with each query. How-

ever, only the queries are regarded as i.i.d. random variables, while the documents

associated with each query is regarded as deterministically generated. For example,

in [ 6 ], it is assumed that an existing search engine is used to generate the training

and test data. Queries are randomly sampled from the query space. After a query

is selected, the query is submitted to a search engine and the top- k documents re-

turned will be regarded as the associated documents. In other words, there is no i.i.d.

sampling with regards to documents, and each query is represented by a fixed set of

documents (denoted by x ) and their ground-truth labels.

Note that generally speaking the number of documents m can be a random vari-

able, however, for ease of discussion, here we assume it to be a fixed number for all

queries.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home