Experimental Results on LETOR - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Table 11.1

Results on the TD2003 dataset

Algorithm

NDCG@1

NDCG@3

NDCG@10

P@1

P@3

P@10

MAP

Regression

0.320

0.307

0.326

0.320

0.260

0.178

0.241

RankSVM

0.320

0.344

0.346

0.320

0.293

0.188

0.263

RankBoost

0.280

0.325

0.312

0.280

0.170

0.227

FRank

0.300

0.267

0.269

0.300

0.233

0.152

0.203

ListNet

0.400

0.337

0.348

0.400

0.293

0.200

0.275

AdaRank

0.260

0.307

0.306

0.260

0.158

0.228

SVM map

0.320

0.328

0.320

0.253

0.170

0.245

•

For ListNet, the validation set is used to determine the best mapping from the

ground-truth label to scores in order to use the Plackett-Luce model, and to de-

termine the optimal number of iterations in the gradient descent process.

•

For AdaRank, MAP is set as the evaluation measure to be optimized, and the

validation set is used to determine the number of iterations.

For SVM map , the publicly available tool SVM map ( http://projects.yisongyue.com/

svmmap/ ) is employed, and the validation set is used to determine the parameter

λ in its loss function.

•

11.2 Experimental Results on LETOR 3.0

The ranking performances of the aforementioned algorithms on the LETOR 3.0

datasets are listed in Tables 11.1 , 11.2 , 11.3 , 11.4 , 11.5 , 11.6 , and 11.7 . According to

these experimental results, we find that the listwise ranking algorithms perform very

well on most datasets. Among the three listwise ranking algorithms, ListNet seems

to be better than the other two. AdaRank and SVM map obtain similar performances.

Pairwise ranking algorithms obtain good ranking accuracy on some (although not

all) datasets. For example, RankBoost offers the best performance on TD2004 and

NP2003; Ranking SVM shows very promising results on NP2003 and NP2004; and

FRank achieves very good results on TD2004 and NP2004. Comparatively speak-

ing, simple linear regression performs worse than the pairwise and listwise ranking

algorithms. Its results are not so good on most datasets.

We have also observed that most ranking algorithms perform differently on dif-

ferent datasets. They may perform very well on some datasets, but not so well on

the others. To evaluate the overall ranking performances of an algorithm, we use the

number of other algorithms that it can beat over all the seven datasets as a measure.

That is,

S i (M)

{

M i (j)>M k (j)

}

j =

k =

where j is the index of a dataset, i and k are the indexes of algorithms, M i (j ) is the

performance of the i th algorithm on the j th dataset, and I

is the indicator function.

{·}

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home