Introduction - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

otherwise, l u,v =−

1. That is, this kind of judgment captures the relative prefer-

ence between documents. 11

3. Total order : Human annotators specify the total order of the documents with

respect to a query. For the set of documents

m

j

{

d j }

1 associated with query q ,

this kind of judgment is usually represented as a certain permutation of these

documents, denoted as π l .

Among the aforementioned three kinds of judgments, the first kind is the most

popularly used judgment. This is partially because this kind of judgment is easy

to obtain. Human assessors only need to look at each individual document to pro-

duce the judgment. Comparatively, obtaining the third kind of judgment is the most

costly. Therefore, in this topic, we will mostly use the first kind of judgment as an

example to perform the discussions.

Given the vital role that relevance judgments play in a test collection, it is impor-

tant to assess the quality of the judgments. In previous practices like TREC, both the

completeness and the consistency of the relevance judgments are of interest. Com-

pleteness measures the degree to which all the relevant documents for a topic have

been found; consistency measures the degree to which the assessor has marked all

the “truly” relevant documents as relevant and the “truly” irrelevant documents as

irrelevant.

Since manual judgment is always time consuming, it is almost impossible to

judge all the documents with regards to a query. Consequently, there are always

unjudged documents returned by the ranking model. As a common practice, one

regards the unjudged documents as irrelevant in the evaluation process.

With the relevance judgment, several evaluation measures have been proposed

and used in the literature of information retrieval. It is clear that understanding these

measures will be very important for learning to rank, since to some extent they

define the “true” objective function of ranking. Below we list some popularly used

measures. In order to better understand these measures, we use the example shown

in Fig. 1.4 to perform some quantitative calculation with respect to each measure.

In the example, there are three documents retrieved for the query “learning to rank”,

and binary judgment on the relevance of each document is provided.

Most of the evaluation measures are defined first for each query, as a function of

the ranked list π given by the ranking model and the relevance judgment. Then the

measures are averaged over all the queries in the test set.

As will be seen below, the maximum values for some evaluation measures, such

as MRR, MAP, and NDCG are one. Therefore, we can consider one minus these

measures (e.g., (1

=

MAP)) as ranking errors. For

ease of reference, we call them measure-based ranking errors .

−

MRR), (1

−

NDCG), and (1

−

Mean Reciprocal Rank (MRR)

For query q , the rank position of its first relevant

1

document is denoted as r 1 . Then

r 1 is defined as MRR for query q . It is clear that

documents ranked below r 1 are not considered in MRR.

11 This kind of judgment can also be mined from click-through logs of search engines [ 41 , 42 , 63 ].

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home