Background: Corpora and Evaluation Methods - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

relevance assessment, a person is presented with a description of a topic or event and then must

decide whether a given document (which could be a summary or a full-text) is relevant to that topic

or event. Such evaluations have been used for a number of years and on a variety of projects [ Dang ,

2005 , Jing et al. , 1998 , Mani et al. , 1999 ]. Due to issues of low inter-annotator agreement on such

tasks, Dorr et al. [ 2005 ] proposed a new evaluation scheme that compares the relevance judgement

of an annotator given a full text with that same annotator given a condensed text.

A second type of extrinsic evaluation for summarization is the reading comprehension

task [ Hirschman et al. , 1999 , Mani , 2001b , Morris et al. , 1992 ]. With a reading comprehension

reading

compre-

hension

task, a user is given either a full source or a summary text and is then given a multiple-choice test

relating to information from the full source. One can then compare how well users perform in term of

the quality of their answers and the amount of time to produce them, when given only the summary

compared with the full source document. This evaluation framework relies on the assumption that

truly informative summaries should be able to act as substitutes for the full source document. This

does not hold true for certain classes of summaries such as query-based or indicative summaries (as

defined in Chapter 1 ), which are not intended to convey all of the important information of the

source document.

A decision audit task [ Murray et al. , 2009 ] has been proposed for meeting summarization,

decision

audit

and we argue that it could be applied to email summarization as well. In this task, a user must

determine which way a group decided on a particular issue and furthermore what the decision-

making process was. They are presented with the transcripts of the group's meetings as well as

summaries of each meeting and must find the relevant information pertaining to that decision in a

limited timeframe. They then write a synopsis of the decision-making process. This synopsis is then

evaluated by human judges as to its correctness. By instrumenting the meeting browser, one can

also inspect where the user clicked, how frequently they used the summary, whether they played the

audio, and so on. Murray et al. [ 2009 ] carried out a decision audit evaluation to compare extractive

and abstractive summaries and to assess the impact of ASR errors.

Not all extrinsic summarization evaluations involve using the summaries to aid a person per-

forming a task; the summaries could also be used to aid a system automatically performing a task.

For example, one might be able to improve the precision or recall of a document classification

system by first generating summaries of the documents in the collection. We could then evaluate

the summarizer by running the document classifier with and without the summarization compo-

nent [ Mihalcea and Hassan , 2005 ].

2.3.3 LINGUISTIC QUALITY EVALUATION

A final type of evaluation we will discuss are evaluations of readability or linguistic quality . This entails

linguistic

quality

scoring the summaries according to fluency, coherence, grammaticality or general readability. It is

possible for a summary to be very relevant and informative but to be nearly unreadable to a user, and

intrinsic measures such as ROUGE and Pyramid cannot capture that distinction. Typically, we must

enlist actual users to make such readability judgments. Linguistic quality assessments are standard

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home