Summarizing Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

conversation is a document). However, if the document collection is a collection of other emails

(such as the contents of Mr. Skilling's inbox) and those emails often discuss telephone calls, the idf

score may be quite low and the term will not be weighted highly.

It should be clear why term-weighting is relevant to summarization. The goal of summariza-

tion is to identify the most important information in a document, and term-weighting is a useful tool

for identifying important or significant words in a document. One of the simplest summarization

approaches, then, would be to extract the sentences with the highest tf.idf scores (e.g., by summing

or averaging over each sentence). Indeed, this can be a surprisingly decent baseline summarizer in

some cases. But in the following section, we will see that there exist much more advanced methods

of measuring informativeness, and many useful feature types beyond the term-weights described

above.

Each of the mining techniques described in Chapter 3 can be considered a potential input

to a summarization system. For example, a system may depend on having fine-grained sentiment

analysis or decision detection. Many summarizers utilize topic detection or clustering modules. In

particular, conversation domains such as meetings and emails, summarization systems may make

assumptions about the data and metadata that are available to the summarizer, and we will discuss

these in each subsection.

All conversation summarization systems share the simple assumptions that the input is a

multi-party exchange featuring turn-taking and interactions. Indeed, these characteristics define

conversation itself and are the common link between meetings, emails, blogs and discussion forms,

and set these domains apart from lectures, broadcast news and articles, all of which feature little

or no conversation. Beyond those common characteristics of turn-taking and interaction, conversa-

tions can widely differ in terms of number of participants, goal-directedness, synchronicity, etc. A

summarization system designed for a particular domain such as meetings might make assumptions

about the nature of a conversation in that domain, e.g., that it has a definite beginning and end and

that conversation participants have specific roles, and these assumptions may not be true of other

conversation domains such as blog comments or discussion forums.

In terms of inputs, conversation summarization systems diverge according to how the con-

versation is documented. For meetings, there may be a rich, multi-modal corpus of data including

transcripts, audio, video, and notes. Email threads contain the email text in addition to metadata

from the email header, and possibly attached documents. Blogs contain posts, comments and links

to other webpages. Some summarization systems might eschew the multi-modal data and process

only the available text of the conversation discussion itself.

In the discussion of each domain below, we will describe what assumptions various summa-

rizers embody when applied to that domain. We will also see that summarization systems that are

designed to work on conversations across different domains must make comparatively few assump-

tions about the nature and structure of their inputs.

Search WWH ::

Custom Search

Home