Information Technology Reference
13.3.1 Document and Query Selection for Labeling
No matter how the labels are obtained, the process is non-trivial and one needs
to consider how to make it more cost-effective. There are at least two issues to
be considered for this purpose. First, if we can only label a fixed total number of
documents, how should we distribute them (more queries and fewer documents per
query vs. fewer queries and more documents per query)? Second, if we can only
label a fixed total number of documents, which of the documents in the corpus
should we present to the annotators?
126.96.36.199 Deep Versus Shallow Judgments
In [ 32 ], an empirical study is conducted regarding the influence of label distribution
on learning to rank. In the study, LambdaRank [ 11 ] is used as the learning-to-rank
algorithm, and a dataset from a commercial search engine is used as the experimen-
tal platform. The dataset contains 382 features and is split into training, validation,
and test sets with 2,000, 1,000, and 2,000 queries respectively. The average number
of judged documents in the training set is 350 per query, and the number highly
varies across different queries.
To test the effect of judging more queries versus more documents per query, dif-
ferent training sets are formed by (i) sampling p% queries while keeping the number
of documents per query fixed to the maximum available, and (ii) sampling p% of
documents per query and keeping the number of queries fixed. Then LambdaRank
is trained using different training data and NDCG@10 on the test set is computed.
The experiments are repeated ten times and the average NDCG@10 value is used
for the final study.
According to the experimental results, one has the following observations.
Given limited number of judgments, it is better to judge more queries but fewer
documents per query than fewer queries with more documents per query. Some-
times additional documents per query do not result in any additional improve-
ments in the quality of the training set.
The lower bound on the number of documents per query is 8 on the dataset used
in the study. When the lower bound is met, if one has to decrease the total number
of judgments further, it is better to decrease the number of queries in the training
The explanation in [ 32 ] on the above experimental findings is based on the infor-
mativeness of the training set. Given some number of judged documents per query,
judging more documents for this query does not really add much information to the
training set. However, including a new query is much more informative since the
new query may have quite different properties than the queries that are already in
the training set. In [ 8 ], a theoretical explanation on this empirical finding is provided
based on the statistical learning theory for ranking. Please refer to Chap. 17 for more