Background: Corpora and Evaluation Methods - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

multiple references. With ROUGE, n-gram overlap between a machine summary and multiple

human references is calculated, and it is assumed that a good machine summary will contain certain

elements of each reference. With Pyramid, the SCUs are weighted based on how many summaries

they occur in, and with weighted F-score, we rely on multiple annotators' links between extracts

and abstracts. Teufel and van Halteren [ 2004 ] and Nenkova et al. [ 2007 ] discussed the issue of how

many references are needed to create reliable scores, but the crucial point is that there is no such thing

as a single best summary and multiple gold-standard reference summaries are desirable. As Galley

[ 2006 ] observed, the challenge is not low inter-annotator agreement itself but in using evaluation

metrics that account for the diversity in reference summaries.

This has been a necessarily incomplete overview of summarization metrics, as many in-house

metrics have proliferated over the years and during that time there was not widespread agreement

on which metrics to use. This was a research bottleneck, as it meant that researchers could not easily

compare their results with one another. This is less of a problem now, as the community has largely

adopted ROUGE and Pyramid as standard metrics. We have also focused on generally applicable

metrics in this section, and so have ignored metrics such summary accuracy [ Zechner and Waibel ,

2000 ] which are speech-specific by incorporating speech recognition error rate. Of all the metrics

we have described here, each has advantages and disadvantages. What metrics like ROUGE and

weighted precision have in common is that there is an initial stage of manually creating model

summaries, and subsequently new machine summaries can be quickly and automatically evaluated.

In contrast, Pyramid evaluation requires additional manual annotation of machine summaries. On

the other hand, an evaluation scheme like Pyramid operates at a more meaningful level of granularity

compared to using n-grams or entire sentences since an SCU roughly represents a concept that can

be realized in many surface forms. What all these schemes have in common is replicability, being able

to reproduce the results once the relevant annotations have been done, which is not feasible when

simply enlisting human judges to conduct subjective evaluations of summary informativeness or

quality. Such human evaluations are very useful for periodic large-scale evaluation of summarization

systems, however, and crucial for ensuring that automatic or semi-automatic metrics correlate with

human judgements or real-world utility.

2.3.2 EXTRINSIC SUMMARIZATION EVALUATION

While intrinsic evaluation metrics are essential for expediting development and can be easily repli-

cated, they should be chosen according to whether they are good predictors for extrinsic usefulness,

e.g., whether they correlate to a measure of real-world usefulness. Evaluating in comparison to hu-

man gold-standard annotations is sensible and practical, but ultimately all summarization work is

done for the purpose of facilitating some task and should be evaluated in the context of that task. As

Sparck-Jones has said, “it is impossible to evaluate summaries properly without knowing what they

are for” [ Jones , 1999 ]. Ideally, even evaluation measures that compare a system-generated summary

with a full source document or a model summary would do so with regards to use constraints.

One popular extrinsic evaluation has been the relevance assessment task [ Mani , 2001b ] . With

relevance

assessment

Search WWH ::

Custom Search

Home