Introduction - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

1.2.2 Query-Level Position-Based Evaluations

Given the large number of ranking models as introduced in the previous subsection,

a standard evaluation mechanism is needed to select the most effective model.

Actually, evaluation has played a very important role in the history of informa-

tion retrieval. Information retrieval is an empirical science; and it has been a leader

in computer science in understanding the importance of evaluation and bench-

marking. Information retrieval has been well served by the Cranfield experimental

methodology [ 81 ], which is based on sharable document collections, information

needs (queries), and relevance assessments. By applying the Cranfield paradigm to

document retrieval, the corresponding evaluation process can be described as fol-

lows.

•

Collect a large number of (randomly sampled) queries to form a test set.

•

For each query q ,

- Collect documents

m

j

{

d j }

1 associated with the query.

- Get the relevance judgment for each document by human assessment.

- Use a given ranking model to rank the documents.

- Measure the difference between the ranking results and the relevance judgment

using an evaluation measure.

=

•

Use the average measure on all the queries in the test set to evaluate the perfor-

mance of the ranking model.

As for collecting the documents associated with a query, a number of strategies

can be used. For example, one can simply collect all the documents containing the

query word. One can also choose to use some predefined rankers to get documents

that are more likely to be relevant. A popular strategy is the pooling method used in

TREC. 10 In this method a pool of possibly relevant documents is created by taking

a sample of documents selected by the various participating systems. In particular,

the top 100 documents retrieved in each submitted run for a given query are selected

and merged into the pool for human assessment.

As for the relevance judgment, three strategies have been used in the literature.

1. Relevance degree : Human annotators specify whether a document is relevant or

not to the query (i.e., binary judgment), or further specify the degree of relevance

(i.e., multiple ordered categories, e.g., Perfect, Excellent, Good, Fair, or Bad).

Suppose for document d j associated with query q , we get its relevance judgment

as l j . Then for two documents d u and d v ,if l u l v , we say that document d u

is more relevant than document d v with regards to query q , according to the

relevance judgment.

2. Pairwise preference : Human annotators specify whether a document is more rel-

evant than the other with regards to a query. For example, if document d u is

judged to be more relevant than document d v , we give the judgment l u,v =

1;

10 http://trec.nist.gov/ .

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home