Data Preprocessing for Learning to Rank - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Chapter 13

Data Preprocessing for Learning to Rank

Abstract This chapter is concerned with data processing for learning to rank. In or-

der to learn an effective ranking model, the first step is to prepare high-quality train-

ing data. There are several important issues to be considered regarding the training

data. First, it should be considered how to get the data labeled on a large scale but at

a low cost. Click-through log mining is one of the feasible approaches for this pur-

pose. Second, since the labeled data are not always correct and effective, selection

of the queries and documents, as well as their features should also be considered.

In this chapter, we will review several pieces of previous work on these topics, and

also make discussions on the future work.

13.1 Overview

In the previous chapters, we have introduced different learning-to-rank methods.

Throughout the introduction, we have assumed that the training data (queries, asso-

ciated documents, and their feature representations) are already available, and have

focused more on the algorithmic aspect. However, in practice, how to collect and

process the training data are also issues that we need to consider.

The most straightforward approach of obtaining training data is to ask human

annotators to label the relevance of a given document with respect to a query. How-

ever, in practice, there may be several problems with this approach. First, the human

annotation is always costly. Therefore, it is not easy to obtain a large amount of la-

beled data. As far as we know, the largest labeled dataset used in published papers

only contains tens of thousands of queries, and millions of documents. Considering

that the query space is almost infinite (users can issue any words or combinations of

words as queries, and the query vocabulary is constantly evolving), such a training

set might not be sufficient for effective training. In such a case, it is highly demand-

ing if we can find a more cost-effective way to collect useful training data. Second,

even if we can afford a certain amount of cost, there is still a tough decision to be

made on how to spend these budget. Is it more beneficial to label more queries, or to

label more documents per query? Shall we label the data independent of the train-

ing process, or only ask human annotators to label those documents that have the

biggest contribution to the training process?

Search WWH ::

Custom Search

Home