Information Technology Reference
In-Depth Information
1.2.2 Query-Level Position-Based Evaluations
Given the large number of ranking models as introduced in the previous subsection,
a standard evaluation mechanism is needed to select the most effective model.
Actually, evaluation has played a very important role in the history of informa-
tion retrieval. Information retrieval is an empirical science; and it has been a leader
in computer science in understanding the importance of evaluation and bench-
marking. Information retrieval has been well served by the Cranfield experimental
methodology [ 81 ], which is based on sharable document collections, information
needs (queries), and relevance assessments. By applying the Cranfield paradigm to
document retrieval, the corresponding evaluation process can be described as fol-
lows.
Collect a large number of (randomly sampled) queries to form a test set.
For each query q ,
- Collect documents
m
j
{
d j }
1 associated with the query.
- Get the relevance judgment for each document by human assessment.
- Use a given ranking model to rank the documents.
- Measure the difference between the ranking results and the relevance judgment
using an evaluation measure.
=
Use the average measure on all the queries in the test set to evaluate the perfor-
mance of the ranking model.
As for collecting the documents associated with a query, a number of strategies
can be used. For example, one can simply collect all the documents containing the
query word. One can also choose to use some predefined rankers to get documents
that are more likely to be relevant. A popular strategy is the pooling method used in
TREC. 10 In this method a pool of possibly relevant documents is created by taking
a sample of documents selected by the various participating systems. In particular,
the top 100 documents retrieved in each submitted run for a given query are selected
and merged into the pool for human assessment.
As for the relevance judgment, three strategies have been used in the literature.
1. Relevance degree : Human annotators specify whether a document is relevant or
not to the query (i.e., binary judgment), or further specify the degree of relevance
(i.e., multiple ordered categories, e.g., Perfect, Excellent, Good, Fair, or Bad).
Suppose for document d j associated with query q , we get its relevance judgment
as l j . Then for two documents d u and d v ,if l u l v , we say that document d u
is more relevant than document d v with regards to query q , according to the
relevance judgment.
2. Pairwise preference : Human annotators specify whether a document is more rel-
evant than the other with regards to a query. For example, if document d u is
judged to be more relevant than document d v , we give the judgment l u,v =
1;
10 http://trec.nist.gov/ .
Search WWH ::




Custom Search