Information Technology Reference
In-Depth Information
Table 10.5 Data partitioning
for five-fold cross validation
Folds
Training set
Validation set
Test set
Fold1
{S1, S2, S3}
S4
S5
Fold2
{S2, S3, S4}
S5
S1
Fold3
{S3, S4, S5}
S1
S2
Fold4
{S4, S5, S1}
S2
S3
Fold5
{S5, S1, S2}
S3
S4
10.7 Discussions
LETOR has been widely used in the research community of learning to rank. How-
ever, its current version also has limitations. Here we list some of them.
Document sampling strategy . For the datasets based on the “Gov” corpus, the
retrieval problem is essentially cast as a re-ranking task (for the top 1000 docu-
ments) in LETOR. On one hand, this is a common practice for real-world Web
search engines. Usually two rankers are used by a search engine for sake of effi-
ciency: firstly a simple ranker (e.g., BM25 [ 16 ]) is used to select some candidate
documents, and then a more complex ranker (e.g., the learning-to-rank algorithms
as introduced in the topic) is used to produce the final ranking result. On the other
hand, however, there are also some retrieval applications that should not be cast
as a re-ranking task. It would be good to add datasets beyond re-ranking settings
to LETOR in the future.
Features . In both academic and industrial communities, more and more features
have been studied and applied to improve ranking accuracy. The feature list pro-
vided in LETOR is far away from comprehensive. For example, document fea-
tures (such as document length) are not included in the OHSUMED dataset, and
proximity features [ 18 ] are not included in all the datasets. It would be helpful to
add more features into the LETOR datasets in the future.
Scale and diversity of datasets. As compared with Web search, the scales (num-
ber of queries) of the datasets in LETOR are much smaller. To verify the perfor-
mances of learning-to-rank techniques for real Web search, large scale datasets
are needed. Furthermore, although there are nine query sets, there are only three
document corpora involved. It would be better to create new datasets using more
document corpora in the future.
References
1. Aslam, J.A., Kanoulas, E., Pavlu, V., Savev, S., Yilmaz, E.: Document selection methodolo-
gies for efficient and effective learning-to-rank. In: Proceedings of the 32nd Annual Interna-
tional ACM SIGIR Conference on Research and Development in Information Retrieval (SI-
GIR 2009), pp. 468-475 (2009)
2. Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the trec 2003 Web track. In:
Proceedings of the 12th Text Retrieval Conference (TREC 2003), pp. 78-92 (2003)
Search WWH ::




Custom Search