Recall evaluates the capacity of a system to select all the relevant documents in
the collection (are all the relevant documents selected?) while precision evaluates the
capacity of the system to select only relevant documents (are all the selected
documents relevant?). Other measures based on these two have been
proposed [SAN 10]. For instance, mean average precision (MAP) corresponds to the
average precision calculated on a given set of test queries.
The field of IR is characterized by a long history of evaluation [VOO 02]. A way
to evaluate the IRSs is based on the definition of a “campaign” that occurs in the
1) The organizers spread a call for participation, which presents the proposed IR
givenquery.Incontrast,fora question answering task,theaimistoretrieveapieceof
obtain a list of documents dealing with this subject for the ad hoc task, whereas we
would get the list of the beach names of Anglet for the question answering task.
2) The interested IRS designers register to the tasks of their choice. They are then
referred to as participants.
3) The organizers provide a corpus of documents and 25+ topics representing
information needs (i.e. detailed queries with description and narrative).
ranked by decreasing relevance).
5) The organizers constitute a set of relevant documents for each topic: the
relevance judgments. They then check participants' results against these relevance
judgments by means of predefined appropriate measures. The computed value
represents the effectiveness (i.e. measurement of result quality) of the IRS for the
considered topic. Aggregating all the scores obtained by the IRS for each of the 25+
topics (e.g. averaging over them) leads to an overall evaluation score for the IRS.
6) The organizers publish the results of the participants and generally make
available the test collection (i.e. corpus, topics and relevance judgments). This
collection can then be reused later in order to evaluate an IRS outside the campaign
As shown in Figure 2.4, T REC [VOO 05] is a reference campaign in IR allowing
us to evaluate IRSs with respect to the thematic dimension. S EM E VAL [AGI 07] and
S EM S EARCH [HAL 10] are, in particular, involved in the semantic analysis of textual
contents. There is not a lot of published work relative to the evaluation of the two
other dimensions of geographic information. The spatial and temporal dimensions
have been the object, respectively, of the evaluation framework C LEF ([PET 01], task