Information Technology Reference
In-Depth Information
Chapter 10
The LETOR Datasets
Abstract In this chapter, we introduce the LETOR benchmark datasets, includ-
ing the following aspects: document corpora (together with query sets), document
sampling, feature extraction, meta information, cross validation, and major ranking
tasks supported.
10.1 Overview
As we all know, a benchmark dataset with standard features and evaluation mea-
sures is very helpful for the research on machine learning. For example, there are
benchmark datasets such as Reuters 1 and RCV-1 2 for text classification, and UCI 3
for general machine learning. However, there were no such benchmark datasets for
ranking until the LETOR datasets [ 8 ] were released in early 2007. In recent years,
the LETOR datasets have been widely used in the experiments of learning-to-rank
papers, and have helped to greatly move forward the research on learning to rank. Up
to the writing of this topic, there have been several versions released for LETOR. In
this chapter, we will mainly introduce two most popularly used versions, LETOR 3.0
and 4.0. In particular, we will describe the details of these datasets including docu-
ment corpus, document sampling, feature extraction, meta information, and learning
tasks.
10.2 Document Corpora
Three document corpora together with nine query sets are used in the LETOR
datasets. The first two document corpora are used in LETOR 3.0, while the third
one is used in LETOR 4.0.
1 http://www.daviddlewis.com/resources/testcollections/reuters21578/ .
2 http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm .
3 http://archive.ics.uci.edu/ml/ .
Search WWH ::




Custom Search