Background: Corpora and Evaluation Methods - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

evaluations for the TAC/DUC summarization challenges, with five categories corresponding to

grammaticality , non-redundancy , referential clarity , focus and structure and coherence 11 .

The reasons why a summary might be informative but still score poorly on readability are

diverse; for extractive summaries, it may be the case that the source documents feature very noisy

or ungrammatical text. This is particularly an issue with conversational data, where sentences may

contain filled pauses, false starts, misspellings and sentence fragments. For example, Sentence 1

below features a false start, but this has been repaired by the summarization system in Sentence 2,

leading to better readability:

So you will have - Baba and David Jordan, you will have to work together on the prototype.

Baba and David Jordan, you will have to work together on the prototype.

For abstractive summaries, readability and linguistic quality will largely depend on the quality

of the language generation component. If the abstracts are lacking in lexical diversity or do not

properly handle anaphora (expressions referring to other expressions, e.g., pronouns), to give just

anaphora

two examples, they will likely be scored poorly on linguistic quality.

2.3.4 EVALUATION METRICS FOR SUMMARIZATION: A FINAL OVERVIEW

Figure 2.7 places the major evaluation methods we have discussed onto two axes, one describing how

automated the evaluation is and one describing how deeply the summaries are analyzed. By auto-

mated vs. manual, we are indicating how much manual intervention needs to be done in order to eval-

uate a newly generated summary. Evaluation methods such as ROUGE and precision/recall/F-score

require only an initial manual generation of gold-standard extracts or abstracts, and subsequently new

summaries are evaluated in a fully automatic fashion. In contrast, the Pyramid evaluation requires

annotation of each new summary, and extrinsic evaluations that measure how well the summaries

aid a user in performing a task require a great deal of manual intervention, such as recruiting par-

ticipants, designing the study, analyzing the results, etc. An extrinsic evaluation setup that measures

how automatic summarization improves an information retrieval task, on the other hand, might be

automated and easily replicable.

By shallow vs. deep, we are indicating whether the evaluation methods are analyzing the

summaries at a superficial, surface level or at a deeper level corresponding to meaning or utility.

ROUGE and sentence precision/recall/F-score are both fairly shallow, measuring n-gram overlap

and sentence overlap with gold-standard summaries, respectively. Pyramid and Basic Elements both

operate at a more semantic, conceptual level, while extrinsic summaries go beyond meaning to

measure actual utility.

Search WWH ::

Custom Search

Home