Databases Reference
In-Depth Information
Intrinsic evaluations measure the information content of a generated summary, typically by
intrinsic
evaluation
comparing it with human gold-standard summaries. These types of evaluations are concerned with
whether the candidate summary contains the most important information from the source document.
Many of the intrinsic evaluation schemes we will introduce are automated metrics, and as such
it is important to confirm that they correlate with human judgments. A major reason why the
summarization community has been slow to adopt “official” evaluation metrics (compared with,
say, the machine translation community) is precisely owing to conflicting results regarding such
correlations in different domains. Liu and Liu [ 2010 ] is a recent example of work trying to measure
the usefulness of a popular intrinsic evaluation software package (ROUGE, described in Chapter 2 )
on noisy conversational data.
Extrinsic evaluations, on the other hand, measure the usefulness of a summary in aiding
extrinsic
evaluation
some real-world task, such as document classification or reading comprehension. The motivation
for conducting extrinsic evaluations is that summaries are generated for some purpose, and we should
directly evaluate how well they serve that purpose, rather than simply comparing them with other
summaries. However, extrinsic evaluations are typically user studies, which involve a great deal of
human hours in terms of design, recruitment, experiments and analysis. It is therefore common to
regularly employ intrinsic evaluations to speed research and development, while occasionally carrying
out extrinsic evaluations to assess major development milestones.
1.5
TOPIC PREVIEW
In Chapter 2 , we describe popular conversation corpora for summarization and mining research, in-
cluding descriptions of the relevant annotations. We also describe in detail the widely used evaluation
metrics for both text mining generally and automatic summarization particularly.
In Chapter 3 , we introduce mining tasks and methods for conversational data. This includes
topic segmentation and labeling, subjectivity and sentiment detection, dialogue act detection, ex-
traction of conversation structure, and detection of decisions and action items.
In Chapter 4 , we first give a general characterization of the architecture of summarization
systems, then describe how summarizers have been designed for particular conversation modalities.
We also describe attempts at developing summarizers for conversations across modalities, and give
a detailed case study of an abstractive, multi-modal conversation summarizer.
In Chapter 5 , we review our discussion and lay out suggestions for future work in the promising
and still largely unexplored corners of the mining and summarization research space.
Assumptions about Our Readers We have tried to make this topic accessible by providing sufficient
background on each topic, and think that it should be suitable for the graduate student who may
have a background in computer science or linguistics but only minimal exposure to NLP. However,
due to space limitations, we do assume that our readers are at least somewhat familiar with several
topics, including basic probability and machine learning. In Section 1.3 , we have provided pointers
Search WWH ::




Custom Search