Information Technology Reference
Results on the MQ2008 dataset
however, the number of features in LETOR 4.0 is smaller than that in LETOR 3.0.
In this case, we hypothesize that the large scale of the dataset has enabled all the
learning-to-rank algorithms to (almost) fully realize the potential in the current fea-
ture representation. In other words, the datasets have become saturated and cannot
well distinguish different learning algorithms. A similar phenomenon has been ob-
served in some other work [ 3 ], especially when the scale of the training data is large.
In this regard, it is sometimes not enough to increase the size of the datasets in
order to get meaningful evaluation results on learning-to-rank algorithms. Enriching
the feature representation is also very important, if not more important. This should
be a critical future work for the research community of learning to rank.
Here we would like to point out that the above experimental results are still primal,
since the LETOR baselines have not been fine-tuned and the performance of almost
every baseline algorithm can be further improved.
Most baselines in LETOR use linear scoring functions. For such a complex prob-
lem as ranking, linear scoring functions may be too simple. From the experimen-
tal results, we can see that the performances of these algorithms are still much
lower than the perfect ranker (whose MAP and NDCG are both one). This par-
tially verifies the limited power of linear scoring function. Furthermore, query
features such as inverted document frequency cannot be effectively used by lin-
ear ranking functions. In this regard, if we use non-linear ranking functions, we
should be able to greatly improve the performances of the baseline algorithms.
As for the loss functions in the baseline algorithms, we also have a large space to
make further improvement. For example, in regression, we can add a regulariza-
tion term to make it more robust (actually according to the experiments conducted
in [ 3 ], the regularized linear regression model performs much better than the orig-
inal linear regression model, and its ranking performance is comparable to many
pairwise ranking methods); for ListNet, we can also add a regularization term to
its loss function and make it more generalizable to the test set.