Databases Reference
In-Depth Information
evaluations for the TAC/DUC summarization challenges, with five categories corresponding to
grammaticality , non-redundancy , referential clarity , focus and structure and coherence 11 .
The reasons why a summary might be informative but still score poorly on readability are
diverse; for extractive summaries, it may be the case that the source documents feature very noisy
or ungrammatical text. This is particularly an issue with conversational data, where sentences may
contain filled pauses, false starts, misspellings and sentence fragments. For example, Sentence 1
below features a false start, but this has been repaired by the summarization system in Sentence 2,
leading to better readability:
￿ So you will have - Baba and David Jordan, you will have to work together on the prototype.
￿ Baba and David Jordan, you will have to work together on the prototype.
For abstractive summaries, readability and linguistic quality will largely depend on the quality
of the language generation component. If the abstracts are lacking in lexical diversity or do not
properly handle anaphora (expressions referring to other expressions, e.g., pronouns), to give just
anaphora
two examples, they will likely be scored poorly on linguistic quality.
2.3.4 EVALUATION METRICS FOR SUMMARIZATION: A FINAL OVERVIEW
Figure 2.7 places the major evaluation methods we have discussed onto two axes, one describing how
automated the evaluation is and one describing how deeply the summaries are analyzed. By auto-
mated vs. manual, we are indicating how much manual intervention needs to be done in order to eval-
uate a newly generated summary. Evaluation methods such as ROUGE and precision/recall/F-score
require only an initial manual generation of gold-standard extracts or abstracts, and subsequently new
summaries are evaluated in a fully automatic fashion. In contrast, the Pyramid evaluation requires
annotation of each new summary, and extrinsic evaluations that measure how well the summaries
aid a user in performing a task require a great deal of manual intervention, such as recruiting par-
ticipants, designing the study, analyzing the results, etc. An extrinsic evaluation setup that measures
how automatic summarization improves an information retrieval task, on the other hand, might be
automated and easily replicable.
By shallow vs. deep, we are indicating whether the evaluation methods are analyzing the
summaries at a superficial, surface level or at a deeper level corresponding to meaning or utility.
ROUGE and sentence precision/recall/F-score are both fairly shallow, measuring n-gram overlap
and sentence overlap with gold-standard summaries, respectively. Pyramid and Basic Elements both
operate at a more semantic, conceptual level, while extrinsic summaries go beyond meaning to
measure actual utility.
11 http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt
Search WWH ::




Custom Search