Databases Reference
In-Depth Information
conversation is a document). However, if the document collection is a collection of other emails
(such as the contents of Mr. Skilling's inbox) and those emails often discuss telephone calls, the idf
score may be quite low and the term will not be weighted highly.
It should be clear why term-weighting is relevant to summarization. The goal of summariza-
tion is to identify the most important information in a document, and term-weighting is a useful tool
for identifying important or significant words in a document. One of the simplest summarization
approaches, then, would be to extract the sentences with the highest tf.idf scores (e.g., by summing
or averaging over each sentence). Indeed, this can be a surprisingly decent baseline summarizer in
some cases. But in the following section, we will see that there exist much more advanced methods
of measuring informativeness, and many useful feature types beyond the term-weights described
above.
Each of the mining techniques described in Chapter 3 can be considered a potential input
to a summarization system. For example, a system may depend on having fine-grained sentiment
analysis or decision detection. Many summarizers utilize topic detection or clustering modules. In
particular, conversation domains such as meetings and emails, summarization systems may make
assumptions about the data and metadata that are available to the summarizer, and we will discuss
these in each subsection.
All conversation summarization systems share the simple assumptions that the input is a
multi-party exchange featuring turn-taking and interactions. Indeed, these characteristics define
conversation itself and are the common link between meetings, emails, blogs and discussion forms,
and set these domains apart from lectures, broadcast news and articles, all of which feature little
or no conversation. Beyond those common characteristics of turn-taking and interaction, conversa-
tions can widely differ in terms of number of participants, goal-directedness, synchronicity, etc. A
summarization system designed for a particular domain such as meetings might make assumptions
about the nature of a conversation in that domain, e.g., that it has a definite beginning and end and
that conversation participants have specific roles, and these assumptions may not be true of other
conversation domains such as blog comments or discussion forums.
In terms of inputs, conversation summarization systems diverge according to how the con-
versation is documented. For meetings, there may be a rich, multi-modal corpus of data including
transcripts, audio, video, and notes. Email threads contain the email text in addition to metadata
from the email header, and possibly attached documents. Blogs contain posts, comments and links
to other webpages. Some summarization systems might eschew the multi-modal data and process
only the available text of the conversation discussion itself.
In the discussion of each domain below, we will describe what assumptions various summa-
rizers embody when applied to that domain. We will also see that summarization systems that are
designed to work on conversations across different domains must make comparatively few assump-
tions about the nature and structure of their inputs.
Search WWH ::




Custom Search