Databases Reference
In-Depth Information
relevance assessment, a person is presented with a description of a topic or event and then must
decide whether a given document (which could be a summary or a full-text) is relevant to that topic
or event. Such evaluations have been used for a number of years and on a variety of projects [ Dang ,
2005 , Jing et al. , 1998 , Mani et al. , 1999 ]. Due to issues of low inter-annotator agreement on such
tasks, Dorr et al. [ 2005 ] proposed a new evaluation scheme that compares the relevance judgement
of an annotator given a full text with that same annotator given a condensed text.
A second type of extrinsic evaluation for summarization is the reading comprehension
task [ Hirschman et al. , 1999 , Mani , 2001b , Morris et al. , 1992 ]. With a reading comprehension
reading
compre-
hension
task, a user is given either a full source or a summary text and is then given a multiple-choice test
relating to information from the full source. One can then compare how well users perform in term of
the quality of their answers and the amount of time to produce them, when given only the summary
compared with the full source document. This evaluation framework relies on the assumption that
truly informative summaries should be able to act as substitutes for the full source document. This
does not hold true for certain classes of summaries such as query-based or indicative summaries (as
defined in Chapter 1 ), which are not intended to convey all of the important information of the
source document.
A decision audit task [ Murray et al. , 2009 ] has been proposed for meeting summarization,
decision
audit
and we argue that it could be applied to email summarization as well. In this task, a user must
determine which way a group decided on a particular issue and furthermore what the decision-
making process was. They are presented with the transcripts of the group's meetings as well as
summaries of each meeting and must find the relevant information pertaining to that decision in a
limited timeframe. They then write a synopsis of the decision-making process. This synopsis is then
evaluated by human judges as to its correctness. By instrumenting the meeting browser, one can
also inspect where the user clicked, how frequently they used the summary, whether they played the
audio, and so on. Murray et al. [ 2009 ] carried out a decision audit evaluation to compare extractive
and abstractive summaries and to assess the impact of ASR errors.
Not all extrinsic summarization evaluations involve using the summaries to aid a person per-
forming a task; the summaries could also be used to aid a system automatically performing a task.
For example, one might be able to improve the precision or recall of a document classification
system by first generating summaries of the documents in the collection. We could then evaluate
the summarizer by running the document classifier with and without the summarization compo-
nent [ Mihalcea and Hassan , 2005 ].
2.3.3 LINGUISTIC QUALITY EVALUATION
A final type of evaluation we will discuss are evaluations of readability or linguistic quality . This entails
linguistic
quality
scoring the summaries according to fluency, coherence, grammaticality or general readability. It is
possible for a summary to be very relevant and informative but to be nearly unreadable to a user, and
intrinsic measures such as ROUGE and Pyramid cannot capture that distinction. Typically, we must
enlist actual users to make such readability judgments. Linguistic quality assessments are standard
Search WWH ::




Custom Search