Background: Corpora and Evaluation Methods - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

same number of SCUs, we would compare it with an ideal summary containing the average number

of SCUs in all the human model summaries used to create the Pyramid model. For instance, in our

running example, a machine summary containing the three SCUs (SCU 1 ,SCU 3 ,SCU 4 ) would

have a precision of 10 / 11, but a recall of 10 / 14, because the average number of SCUs in the model

summaries is 4 (see Figure 2.5 ), and as we have already seen the sum of the weights of an ideal

summary of length four is ( 4

14.

The advantage of the Pyramid method is that it uses content units of variable length and

assigns weights to them by importance according to occurrence in model summaries, but the dis-

advantage is that the scheme requires a great deal of human annotation since every new machine

summary must be annotated for SCUs. Pyramid was used as part of the DUC 2005 evaluation,

with numerous institutions taking part in the peer annotation step, and while the submitted peer

annotations required a substantial amount of corrections, Nenkova et al. [ 2007 ] reported acceptable

levels for inter-annotator agreement.

Galley [ 2006 ] introduced a matching constraint for the Pyramid method when applied to

meeting transcripts; namely, that when comparing machine extracts to model extracts, SCUs are

only considered to match if they originate from the same sentence in the transcript. This was done

to account for the fact that sentences might be superficially similar in each having a particular SCU,

but nevertheless have much different overall meanings.

The weighted F-score metric [ Murray et al. , 2006 ] is analogous to the Pyramid method, but

+

4

+

3

+

3 ) =

weighted

F-score

with full sentences as the SCUs. This evaluation metric relies on human gold-standard abstracts,

multiple human extracts, and the many-to-many mapping between the abstracts and extracts as

described in Section 2.1.1 . The idea is that document sentences are weighted according to how

often they are linked to an abstract sentence, analogous to weighted Pyramid SCUs. The metric

was originally precision-based but was later extended to weighted precision/recall/F-score. The

advantage of the scheme is that once the model annotations have been completed, new machine

summaries can be easily and quickly evaluated, but the disadvantage is that it is limited to evaluating

extractive summaries and works only at the dialogue act level.

The challenge with evaluating summaries intrinsically is that there is not normally a single

best summary for a given source document, as illustrated by the low κ scores between human

annotators. Given the same input, human judges will often exhibit low agreement in the units they

select [ Mani , 2001b , Mani et al. , 1999 ]. In early work on automatic text summarization, Rath et al.

[ 1961 ] showed that even a single judge who summarizes a document once and then summarizes it

again several weeks later will often create two very different summaries (in that specific case, judges

could only remember which sentences they had previously selected 42.5% of the time). With many

annotation tasks, such as dialogue act labeling for example, one can expect high inter-annotator

agreement, but summarization annotation is clearly a more difficult task. As Mani et al. [ 1999 ]

pointed out, there are similar problems regarding the evaluation of other NLP technologies that may

have more than one acceptable output, such as natural language generation and machine translation.

The metrics described above have various ways of addressing this challenge, relying generally on

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home