The LETOR Datasets - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

A query set with 106 queries on the OHSUMED corpus has been used in many

previous works [ 11 , 19 ], with each query describing a medical search need (associ-

ated with patient information and topic information). The relevance degrees of the

documents with respect to the queries are judged by human assessors, on three lev-

els: definitely relevant, partially relevant, and irrelevant. There are a total of 16,140

query-document pairs with relevance judgments.

10.2.3 The “Gov2” Corpus and Two Query Sets

The Million Query (MQ) track ran for the first time in TREC 2007 and then became

a regular track in the following years. There are two design purposes of the MQ

track. First, it is an exploration of ad-hoc retrieval on a large collection of docu-

ments. Second, it investigates questions of system evaluation, particularly whether

it is better to evaluate using many shallow judgments or fewer thorough judgments.

The MQ track uses the so-called “terabyte” or “Gov2” corpus as its document

collection. This corpus is a collection of Web data crawled from websites in the

.gov domain in early 2004. This collection includes about 25,000,000 documents in

426 gigabytes.

There are about 1700 queries with labeled documents in the MQ track of 2007

(denoted as MQ2007 for short) and about 800 queries in the MQ track of 2008

(denoted as MQ2008). The judgments are given in three levels, i.e., highly relevant,

relevant, and irrelevant.

10.3 Document Sampling

Due to a similar reason to selecting documents for labeling, it is not feasible to

extract feature vectors of all the documents in a corpus either. A reasonable strategy

is to sample some “possibly” relevant documents, and then extract feature vectors

for the corresponding query-document pairs.

For TD2003, TD2004, NP2003, NP2004, HP2003, and HP2004, following the

suggestions in [ 9 ] and [ 12 ], the documents are sampled in the following way. First,

the BM25 model is used to rank all the documents with respect to each query, and

then the top 1000 documents for each query are selected for feature extraction.

Please note that this sampling strategy is to ease the experimental investigation,

and this is by no means to say that learning to rank can only be applicable in such a

re-ranking scenario.

Different from the above tasks in which unjudged documents are regarded as ir-

relevant, in OHSUMED, MQ2007, and MQ2008, the judgments explicitly contain

the category of “irrelevant” and the unjudged documents are ignored in the eval-

uation. Correspondingly, in LETOR, only judged documents are used for feature

extraction and all the unjudged documents are ignored for these corpora.

Search WWH ::

Custom Search

Home