Databases Reference
In-Depth Information
same number of SCUs, we would compare it with an ideal summary containing the average number
of SCUs in all the human model summaries used to create the Pyramid model. For instance, in our
running example, a machine summary containing the three SCUs (SCU 1 ,SCU 3 ,SCU 4 ) would
have a precision of 10 / 11, but a recall of 10 / 14, because the average number of SCUs in the model
summaries is 4 (see Figure 2.5 ), and as we have already seen the sum of the weights of an ideal
summary of length four is ( 4
14.
The advantage of the Pyramid method is that it uses content units of variable length and
assigns weights to them by importance according to occurrence in model summaries, but the dis-
advantage is that the scheme requires a great deal of human annotation since every new machine
summary must be annotated for SCUs. Pyramid was used as part of the DUC 2005 evaluation,
with numerous institutions taking part in the peer annotation step, and while the submitted peer
annotations required a substantial amount of corrections, Nenkova et al. [ 2007 ] reported acceptable
levels for inter-annotator agreement.
Galley [ 2006 ] introduced a matching constraint for the Pyramid method when applied to
meeting transcripts; namely, that when comparing machine extracts to model extracts, SCUs are
only considered to match if they originate from the same sentence in the transcript. This was done
to account for the fact that sentences might be superficially similar in each having a particular SCU,
but nevertheless have much different overall meanings.
The weighted F-score metric [ Murray et al. , 2006 ] is analogous to the Pyramid method, but
+
4
+
3
+
3 ) =
weighted
F-score
with full sentences as the SCUs. This evaluation metric relies on human gold-standard abstracts,
multiple human extracts, and the many-to-many mapping between the abstracts and extracts as
described in Section 2.1.1 . The idea is that document sentences are weighted according to how
often they are linked to an abstract sentence, analogous to weighted Pyramid SCUs. The metric
was originally precision-based but was later extended to weighted precision/recall/F-score. The
advantage of the scheme is that once the model annotations have been completed, new machine
summaries can be easily and quickly evaluated, but the disadvantage is that it is limited to evaluating
extractive summaries and works only at the dialogue act level.
The challenge with evaluating summaries intrinsically is that there is not normally a single
best summary for a given source document, as illustrated by the low κ scores between human
annotators. Given the same input, human judges will often exhibit low agreement in the units they
select [ Mani , 2001b , Mani et al. , 1999 ]. In early work on automatic text summarization, Rath et al.
[ 1961 ] showed that even a single judge who summarizes a document once and then summarizes it
again several weeks later will often create two very different summaries (in that specific case, judges
could only remember which sentences they had previously selected 42.5% of the time). With many
annotation tasks, such as dialogue act labeling for example, one can expect high inter-annotator
agreement, but summarization annotation is clearly a more difficult task. As Mani et al. [ 1999 ]
pointed out, there are similar problems regarding the evaluation of other NLP technologies that may
have more than one acceptable output, such as natural language generation and machine translation.
The metrics described above have various ways of addressing this challenge, relying generally on
Search WWH ::




Custom Search