The LETOR Datasets - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Chapter 10

The LETOR Datasets

Abstract In this chapter, we introduce the LETOR benchmark datasets, includ-

ing the following aspects: document corpora (together with query sets), document

sampling, feature extraction, meta information, cross validation, and major ranking

tasks supported.

10.1 Overview

As we all know, a benchmark dataset with standard features and evaluation mea-

sures is very helpful for the research on machine learning. For example, there are

benchmark datasets such as Reuters 1 and RCV-1 2 for text classification, and UCI 3

for general machine learning. However, there were no such benchmark datasets for

ranking until the LETOR datasets [ 8 ] were released in early 2007. In recent years,

the LETOR datasets have been widely used in the experiments of learning-to-rank

papers, and have helped to greatly move forward the research on learning to rank. Up

to the writing of this topic, there have been several versions released for LETOR. In

this chapter, we will mainly introduce two most popularly used versions, LETOR 3.0

and 4.0. In particular, we will describe the details of these datasets including docu-

ment corpus, document sampling, feature extraction, meta information, and learning

tasks.

10.2 Document Corpora

Three document corpora together with nine query sets are used in the LETOR

datasets. The first two document corpora are used in LETOR 3.0, while the third

one is used in LETOR 4.0.

1 http://www.daviddlewis.com/resources/testcollections/reuters21578/ .

2 http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm .

3 http://archive.ics.uci.edu/ml/ .

Search WWH ::

Custom Search

Home