Data Preprocessing for Learning to Rank - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

If we do not use all the data but only a proportion of it, how should we select the

documents in order to maximize the effectiveness of the ranking model learned from

the data? This is a meaningful question in the following sense.

•

Sometimes one suffers from the scalability of the learning algorithms. When the

algorithms cannot make use of the large amount of training data (e.g., out of

memory), the most straightforward way is to down sample the training set.

•

Sometimes the training data may contain noise or outliers. In this case, if using

the entire training data, the learning process might not converge and/or the effec-

tiveness of the learned model may be affected.

In this subsection, we will introduce some previous work that investigates the

related issues. Specifically, in [ 2 ], different document selection strategies originally

proposed for evaluation are studied in the context of learning to rank. In [ 16 ], the

concept of pairwise preference consistency (PPC) is proposed, and the problem of

document and query selection is modeled as an optimization problem that optimizes

the PPC of the selected subset of the original training data.

13.3.2.1 Document Selection Strategies

In order to understand the influence of different document selection strategies on

learning to rank, six document selection strategies widely used in evaluation are

empirically investigated in [ 2 ]:

•

Depth-k pooling : According to the depth pooling, the union of the top- k doc-

uments retrieved by each retrieval system submitted to TREC in response to a

query is formed and only the documents in this depth- k pool are selected to form

the training set.

•

InfAP sampling : InfAP sampling [ 31 ] utilizes uniform random sampling to select

documents to be judged. In this manner, the selected documents are the represen-

tatives of the documents in the complete collection.

•

StatAP sampling : In StatAP sampling [ 26 ], with a prior of relevance induced by

the evaluation measure AP, each document is selected with probability roughly

proportional to its likelihood of relevance.

•

MTC :MTC[ 6 ] is a greedy on-line algorithm that selects documents according to

how informative they are in determining whether there is a performance differ-

ence between two retrieval systems.

•

Hedge : Hedge is an on-line learning algorithm used to combine expert ad-

vices. It aims at choosing documents that are most likely to be relevant [ 3 ].

Hedge finds many relevant documents “common” to various retrieval sys-

tems.

•

LETOR : In LETOR sampling [ 25 ], documents in the complete collection are first

ranked by their BM25 scores for each query and the top- k documents are then

selected.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home