Information Technology Reference
Experimental Results on LETOR
Abstract In this chapter, we take the official evaluation results published at the
LETOR website as the source to perform discussions on the performances of differ-
ent learning-to-rank methods.
11.1 Experimental Settings
Three widely used measures are adopted for the evaluation in the LETOR datasets:
P@ k [ 1 ], MAP [ 1 ], and NDCG@ k [ 6 ]. For a given ranking model, the evaluation
results in terms of these three measures can be computed by the official evaluation
tool provided in LETOR.
LETOR official baselines include several learning-to-rank algorithms, such as
linear regression, belonging to the pointwise approach; Ranking SVM [ 5 , 7 ], Rank-
Boost [ 4 ], and FRank [ 8 ], belonging to the pairwise approach; ListNet [ 2 ], AdaRank
[ 10 ], and SVM map [ 11 ], belonging to the listwise approach. To make fair compar-
isons, the same setting for all the algorithms are adopted. Firstly, most algorithms
use the linear scoring function, except RankBoost and FRank, which uses binary
weak rankers. Secondly, all the algorithms use MAP on the validation set for model
selection. Some detailed experimental settings are listed here.
As for linear regression, the validation set is used to select a good mapping from
the ground-truth labels to real values.
For Ranking SVM, the public tool of SVMlight is employed and the validation
set is used to tune the parameter λ in its loss function.
For RankBoost, the weak ranker is defined on the basis of a single feature with
255 possible thresholds. The validation set is used to determine the best number
For FRank, the validation set is used to determine the number of weak learners in
the generalized additive model.
Note that there have been several other empirical studies [ 9 , 12 ] in the literature, based on
LETOR and other datasets. The conclusions drawn from these studies are similar to what we will
introduce in this chapter.