Information Technology Reference
and the τ K between (1, 2, 3) and (3, 1, 2) is
3 . Therefore, we can
obtain that τ K (π, Ω l ) =
3 in this case.
To summarize, there are some common properties in these evaluation mea-
1. All these evaluation measures are calculated at the query level . That is, first the
measure is computed for each query, and then averaged over all queries in the test
set. No matter how poorly the documents associated with a particular query are
ranked, it will not dominate the evaluation process since each query contributes
similarly to the average measure.
2. All these measures are position based . That is, rank position is explicitly used.
Considering that with small changes in the scores given by a ranking model, the
rank positions will not change until one document's score passes another, the
position-based measures are usually discontinuous and non-differentiable with
regards to the scores. This makes the optimization of these measures quite diffi-
cult. We will conduct more discussions on this in Sect. 4.2.
Note that although when designing ranking models, many researchers have taken
the assumption that the ranking models can assign a score to each query-document
pair independently of other documents; when performing evaluation, all the docu-
ments associated with a query are considered together. Otherwise, one cannot deter-
mine the rank position of a document and the aforementioned measures cannot be
1.3 Learning to Rank
Many ranking models have been introduced in the previous section, most of which
contain parameters. For example, there are parameters k 1 and b in BM25 (see ( 1.2 )),
parameter λ in LMIR (see ( 1.3 )), and parameter α in PageRank (see ( 1.5 )). In order
to get a reasonably good ranking performance (in terms of evaluation measures), one
needs to tune these parameters using a validation set. Nevertheless, parameter tuning
is far from trivial, especially considering that evaluation measures are discontinuous
and non-differentiable with respect to the parameters. In addition, a model perfectly
tuned on the validation set sometimes performs poorly on unseen test queries. This
is usually called over-fitting. Another issue is regarding the combination of these
ranking models. Given that many models have been proposed in the literature, it is
natural to investigate how to combine these models and create an even more effective
new model. This is, however, not straightforward either.
12 Note that this is not a complete introduction of evaluation measures for information retrieval.
There are several other measures proposed in the literature, some of which even consider the nov-
elty and diversity in the search results in addition to the relevance. One may want to refer to [ 2 , 17 ,
56 , 91 ] for more information.