Databases Reference
In-Depth Information
multiple references. With ROUGE, n-gram overlap between a machine summary and multiple
human references is calculated, and it is assumed that a good machine summary will contain certain
elements of each reference. With Pyramid, the SCUs are weighted based on how many summaries
they occur in, and with weighted F-score, we rely on multiple annotators' links between extracts
and abstracts. Teufel and van Halteren [ 2004 ] and Nenkova et al. [ 2007 ] discussed the issue of how
many references are needed to create reliable scores, but the crucial point is that there is no such thing
as a single best summary and multiple gold-standard reference summaries are desirable. As Galley
[ 2006 ] observed, the challenge is not low inter-annotator agreement itself but in using evaluation
metrics that account for the diversity in reference summaries.
This has been a necessarily incomplete overview of summarization metrics, as many in-house
metrics have proliferated over the years and during that time there was not widespread agreement
on which metrics to use. This was a research bottleneck, as it meant that researchers could not easily
compare their results with one another. This is less of a problem now, as the community has largely
adopted ROUGE and Pyramid as standard metrics. We have also focused on generally applicable
metrics in this section, and so have ignored metrics such summary accuracy [ Zechner and Waibel ,
2000 ] which are speech-specific by incorporating speech recognition error rate. Of all the metrics
we have described here, each has advantages and disadvantages. What metrics like ROUGE and
weighted precision have in common is that there is an initial stage of manually creating model
summaries, and subsequently new machine summaries can be quickly and automatically evaluated.
In contrast, Pyramid evaluation requires additional manual annotation of machine summaries. On
the other hand, an evaluation scheme like Pyramid operates at a more meaningful level of granularity
compared to using n-grams or entire sentences since an SCU roughly represents a concept that can
be realized in many surface forms. What all these schemes have in common is replicability, being able
to reproduce the results once the relevant annotations have been done, which is not feasible when
simply enlisting human judges to conduct subjective evaluations of summary informativeness or
quality. Such human evaluations are very useful for periodic large-scale evaluation of summarization
systems, however, and crucial for ensuring that automatic or semi-automatic metrics correlate with
human judgements or real-world utility.
2.3.2 EXTRINSIC SUMMARIZATION EVALUATION
While intrinsic evaluation metrics are essential for expediting development and can be easily repli-
cated, they should be chosen according to whether they are good predictors for extrinsic usefulness,
e.g., whether they correlate to a measure of real-world usefulness. Evaluating in comparison to hu-
man gold-standard annotations is sensible and practical, but ultimately all summarization work is
done for the purpose of facilitating some task and should be evaluated in the context of that task. As
Sparck-Jones has said, “it is impossible to evaluate summaries properly without knowing what they
are for” [ Jones , 1999 ]. Ideally, even evaluation measures that compare a system-generated summary
with a full source document or a model summary would do so with regards to use constraints.
One popular extrinsic evaluation has been the relevance assessment task [ Mani , 2001b ] . With
relevance
assessment
Search WWH ::




Custom Search