Information Technology Reference
In-Depth Information
Chapter 13
Data Preprocessing for Learning to Rank
Abstract This chapter is concerned with data processing for learning to rank. In or-
der to learn an effective ranking model, the first step is to prepare high-quality train-
ing data. There are several important issues to be considered regarding the training
data. First, it should be considered how to get the data labeled on a large scale but at
a low cost. Click-through log mining is one of the feasible approaches for this pur-
pose. Second, since the labeled data are not always correct and effective, selection
of the queries and documents, as well as their features should also be considered.
In this chapter, we will review several pieces of previous work on these topics, and
also make discussions on the future work.
13.1 Overview
In the previous chapters, we have introduced different learning-to-rank methods.
Throughout the introduction, we have assumed that the training data (queries, asso-
ciated documents, and their feature representations) are already available, and have
focused more on the algorithmic aspect. However, in practice, how to collect and
process the training data are also issues that we need to consider.
The most straightforward approach of obtaining training data is to ask human
annotators to label the relevance of a given document with respect to a query. How-
ever, in practice, there may be several problems with this approach. First, the human
annotation is always costly. Therefore, it is not easy to obtain a large amount of la-
beled data. As far as we know, the largest labeled dataset used in published papers
only contains tens of thousands of queries, and millions of documents. Considering
that the query space is almost infinite (users can issue any words or combinations of
words as queries, and the query vocabulary is constantly evolving), such a training
set might not be sufficient for effective training. In such a case, it is highly demand-
ing if we can find a more cost-effective way to collect useful training data. Second,
even if we can afford a certain amount of cost, there is still a tough decision to be
made on how to spend these budget. Is it more beneficial to label more queries, or to
label more documents per query? Shall we label the data independent of the train-
ing process, or only ask human annotators to label those documents that have the
biggest contribution to the training process?
Search WWH ::




Custom Search