Experimental Results on LETOR - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Table 11.10

Results on the MQ2008 dataset

Algorithm

NDCG@1

NDCG@3

NDCG@10

P@1

P@3

P@10

MAP

RankSVM

0.363

0.429

0.228

0.427

0.390

0.249

0.470

RankBoost

0.386

0.429

0.226

0.458

0.392

0.249

0.478

ListNet

0.375

0.432

0.230

0.445

0.384

0.248

0.478

AdaRank

0.375

0.437

0.230

0.443

0.390

0.245

0.476

however, the number of features in LETOR 4.0 is smaller than that in LETOR 3.0.

In this case, we hypothesize that the large scale of the dataset has enabled all the

learning-to-rank algorithms to (almost) fully realize the potential in the current fea-

ture representation. In other words, the datasets have become saturated and cannot

well distinguish different learning algorithms. A similar phenomenon has been ob-

served in some other work [ 3 ], especially when the scale of the training data is large.

In this regard, it is sometimes not enough to increase the size of the datasets in

order to get meaningful evaluation results on learning-to-rank algorithms. Enriching

the feature representation is also very important, if not more important. This should

be a critical future work for the research community of learning to rank.

11.4 Discussions

Here we would like to point out that the above experimental results are still primal,

since the LETOR baselines have not been fine-tuned and the performance of almost

every baseline algorithm can be further improved.

•

Most baselines in LETOR use linear scoring functions. For such a complex prob-

lem as ranking, linear scoring functions may be too simple. From the experimen-

tal results, we can see that the performances of these algorithms are still much

lower than the perfect ranker (whose MAP and NDCG are both one). This par-

tially verifies the limited power of linear scoring function. Furthermore, query

features such as inverted document frequency cannot be effectively used by lin-

ear ranking functions. In this regard, if we use non-linear ranking functions, we

should be able to greatly improve the performances of the baseline algorithms.

•

As for the loss functions in the baseline algorithms, we also have a large space to

make further improvement. For example, in regression, we can add a regulariza-

tion term to make it more robust (actually according to the experiments conducted

in [ 3 ], the regularized linear regression model performs much better than the orig-

inal linear regression model, and its ranking performance is comparable to many

pairwise ranking methods); for ListNet, we can also add a regularization term to

its loss function and make it more generalizable to the test set.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home