Database Reference
In-Depth Information
3. Repeat step 2 until all the passages in the current list have been
examined.
After applying the abovementioned algorithm, each passage in the new list is
suciently dissimilar to others, thus favoring diversity rather than redundancy
in the new ranked list. The anti-redundancy threshold t is tuned on a training
set.
9.4 Evaluation Methodology
The approach we proposed above for information distillation raises
important issues regarding evaluation methodology. Firstly, since our system
allow the output to be passages at different leves of granularity (e.g., k -
sentence windows where k may vary) instead of a fixed level, it is not possible
to have pre-annotated relevance judgments at all such granularity levels.
Secondly, since we wish to measure the utility of the system output as a
combination of both relevance and novelty, traditional relevance-only based
measures must be replaced by measures that penalize the repetition of the
same information in the system output across time. Thirdly, since the output
of the system is ranked lists, we must reward those systems that present useful
information (both relevant and previously unseen) using shorter ranked lists,
and penalize those that present the same information using longer ranked lists.
None of the existing measures in ad hoc retrieval, adaptive filtering, novelty
detection or other related areas (text summarization and question answering)
have desirable properties in all the three aspects. Therefore, we must develop
a new one.
9.4.1 Answer Keys
To enable the evaluation of a system whose output consists of passages of
arbitrary length, we borrow the concept of answer keys from the Question
Answering (QA) community, where systems are allowed to produce arbitrary
spans of text as answers. Answer keys define what should be present in
a system response to receive credit, and are comprised of a collection of
information nuggets , i.e., factoid units about which human assessors can make
binary decisions of whether or not a system response contains them.
Defining answer keys and the associated binary decisions is a conceptual
task that requires semantic mapping (22), since a system can present the
same piece of information in many different ways. Hence, QA evaluations
have relied on human assessors, making them costly, time consuming and
not scalable to large query sets, document collections and extensive system
evaluations with various parameter settings.
 
Search WWH ::




Custom Search