Other Datasets - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Table 12.1 Dataset 1 for

Yahoo! learning-to-rank

challenge

Training

Validation

Test

Number of queries

19 , 944

2 , 994

6 , 983

Number of URLs

473 , 134

71 , 083

165 , 660

Table 12.2 Dataset 2 for

Yahoo! learning-to-rank

challenge

Training

Validation

Test

Number of queries

1 , 266

3 , 798

Number of URLs

34 , 815

34 , 881

103 , 174

The competition is divided into two tracks:

•

A standard learning-to-rank track, using only the larger dataset.

•

A transfer learning track, where the goal is to leverage the training set from

dataset 1 to build a better ranking function on dataset 2.

Two measures are used for the evaluation of the competition: NDCG [ 2 ] and

Expected Reciprocal Rank (ERR). The definition of ERR is given as follows:

1

,

m

i

−

1

m

G(y i )

16

G(y j )

16

2 y

ERR

=

−

with G(y)

=

−

1 .

(12.1)

i

=

1

j

=

1

The datasets can be downloaded from the sandbox of Yahoo! Research. 1 There

are no official baselines on these datasets, however, most of the winners of the com-

petition have published the details of their algorithms in the workshop proceedings.

They can serve as meaningful baselines.

12.2 Microsoft Learning-to-Rank Datasets

Microsoft Research Asia released two large-scale datasets for the research on learn-

ing to rank in May 2010: MSLR-WEB30k and MSLR-WEB10K. MSLR-WEB30K

actually has 31,531 queries and 3,771,126 documents. Up to the writing of this topic,

the MSLR-WEB30k dataset is the largest publicly-available dataset for the research

on learning to rank. MSLR-WEB10K is a random sample of MSLR-WEB30K,

which has 10,000 queries and 1,200,193 documents.

In the two datasets, queries and URLs are represented by IDs, with a similar

reason to the non-disclosure of queries and URLs in the Yahoo! Learning-to-Rank

Challenge datasets. The Microsoft datasets consist of feature vectors extracted from

query-URL pairs along with relevance judgments:

1 http://webscope.sandbox.yahoo.com/ .

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home