Information Technology Reference
In-Depth Information
If we do not use all the data but only a proportion of it, how should we select the
documents in order to maximize the effectiveness of the ranking model learned from
the data? This is a meaningful question in the following sense.
Sometimes one suffers from the scalability of the learning algorithms. When the
algorithms cannot make use of the large amount of training data (e.g., out of
memory), the most straightforward way is to down sample the training set.
Sometimes the training data may contain noise or outliers. In this case, if using
the entire training data, the learning process might not converge and/or the effec-
tiveness of the learned model may be affected.
In this subsection, we will introduce some previous work that investigates the
related issues. Specifically, in [ 2 ], different document selection strategies originally
proposed for evaluation are studied in the context of learning to rank. In [ 16 ], the
concept of pairwise preference consistency (PPC) is proposed, and the problem of
document and query selection is modeled as an optimization problem that optimizes
the PPC of the selected subset of the original training data.
13.3.2.1 Document Selection Strategies
In order to understand the influence of different document selection strategies on
learning to rank, six document selection strategies widely used in evaluation are
empirically investigated in [ 2 ]:
Depth-k pooling : According to the depth pooling, the union of the top- k doc-
uments retrieved by each retrieval system submitted to TREC in response to a
query is formed and only the documents in this depth- k pool are selected to form
the training set.
InfAP sampling : InfAP sampling [ 31 ] utilizes uniform random sampling to select
documents to be judged. In this manner, the selected documents are the represen-
tatives of the documents in the complete collection.
StatAP sampling : In StatAP sampling [ 26 ], with a prior of relevance induced by
the evaluation measure AP, each document is selected with probability roughly
proportional to its likelihood of relevance.
MTC :MTC[ 6 ] is a greedy on-line algorithm that selects documents according to
how informative they are in determining whether there is a performance differ-
ence between two retrieval systems.
Hedge : Hedge is an on-line learning algorithm used to combine expert ad-
vices. It aims at choosing documents that are most likely to be relevant [ 3 ].
Hedge finds many relevant documents “common” to various retrieval sys-
tems.
LETOR : In LETOR sampling [ 25 ], documents in the complete collection are first
ranked by their BM25 scores for each query and the top- k documents are then
selected.
Search WWH ::




Custom Search