Background: Corpora and Evaluation Methods - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

CHAPTER

2

Background: Corpora and

Evaluation Methods

In this chapter we describe some of the conversation datasets that are widely used for summarization

and text mining research. Large collections of possibly annotated documents are called corpora (sing.

corpus ) in NLP and we will use this terminology. We characterize the raw data as well as the available

annotations. Most of the techniques presented in this topic rely on machine learning methods that

need to be trained and tested using such corpora. Subsequently, we detail the evaluation metrics that

are commonly used for summarization and text mining tasks.

2.1 CORPORA AND ANNOTATIONS

In this section, we introduce two meeting corpora and two email corpora, all of which are freely

available. We describe the annotations (or codings ) that are most relevant and useful for summa-

annotation

rization and text mining. When we say that a corpus has been annotated or coded for a particular

task such as summarization, we mean that human judges have manually labeled the data for the

phenomena relevant to that task. For summarization, this typically means identifying the most im-

portant sentences and writing a high-level abstract summary of the document, but we will describe

such annotation schemes in detail momentarily.

At points we refer to the κ statistic for a given set of annotations, which measures agreement

kappa

statistic

between multiple annotators, factoring in the probability of chance agreement [ Carletta , 1996 ]. More

precisely, κ is used to measure agreement between each pair of annotators where the annotators are

making category judgments. In the case of extractive summarization, for example, the category

judgment is whether or not each sentence should be extracted. In the case of opinion mining, to

make another example, the judgment is whether the sentence has a positive, negative or neutral

polarity.

Given two sets of codings representing the category judgments of two annotators, κ is

calculated as

P (A)

−

P(E)

κ =

,

1

−

P(E)

where P(A) is the proportion of times the annotators agree with one another and P(E) is the

proportion of agreement that we would expect based purely on chance. When multiple coders are

chance

agreement

carrying out annotations on the same data, we expect some baseline level of agreement just by chance.

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home