Information Technology Reference
otherwise, l u,v =−
1. That is, this kind of judgment captures the relative prefer-
ence between documents. 11
3. Total order : Human annotators specify the total order of the documents with
respect to a query. For the set of documents
d j }
1 associated with query q ,
this kind of judgment is usually represented as a certain permutation of these
documents, denoted as π l .
Among the aforementioned three kinds of judgments, the first kind is the most
popularly used judgment. This is partially because this kind of judgment is easy
to obtain. Human assessors only need to look at each individual document to pro-
duce the judgment. Comparatively, obtaining the third kind of judgment is the most
costly. Therefore, in this topic, we will mostly use the first kind of judgment as an
example to perform the discussions.
Given the vital role that relevance judgments play in a test collection, it is impor-
tant to assess the quality of the judgments. In previous practices like TREC, both the
completeness and the consistency of the relevance judgments are of interest. Com-
pleteness measures the degree to which all the relevant documents for a topic have
been found; consistency measures the degree to which the assessor has marked all
the “truly” relevant documents as relevant and the “truly” irrelevant documents as
Since manual judgment is always time consuming, it is almost impossible to
judge all the documents with regards to a query. Consequently, there are always
unjudged documents returned by the ranking model. As a common practice, one
regards the unjudged documents as irrelevant in the evaluation process.
With the relevance judgment, several evaluation measures have been proposed
and used in the literature of information retrieval. It is clear that understanding these
measures will be very important for learning to rank, since to some extent they
define the “true” objective function of ranking. Below we list some popularly used
measures. In order to better understand these measures, we use the example shown
in Fig. 1.4 to perform some quantitative calculation with respect to each measure.
In the example, there are three documents retrieved for the query “learning to rank”,
and binary judgment on the relevance of each document is provided.
Most of the evaluation measures are defined first for each query, as a function of
the ranked list π given by the ranking model and the relevance judgment. Then the
measures are averaged over all the queries in the test set.
As will be seen below, the maximum values for some evaluation measures, such
as MRR, MAP, and NDCG are one. Therefore, we can consider one minus these
measures (e.g., (1
MAP)) as ranking errors. For
ease of reference, we call them measure-based ranking errors .
NDCG), and (1
Mean Reciprocal Rank (MRR)
For query q , the rank position of its first relevant
document is denoted as r 1 . Then
r 1 is defined as MRR for query q . It is clear that
documents ranked below r 1 are not considered in MRR.
11 This kind of judgment can also be mined from click-through logs of search engines [ 41 , 42 , 63 ].