Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

One often mentioned limitation of LDA is its inability to choose the optimal number of

topics. Possible approaches to this problem are discussed in Blei and Lafferty [ 2009 ].

Supervised Classification: TextTiling and LDA are examples of unsupervised techniques,

since they do not need to be trained on a corpus annotated with segmentation and topics. However,

if such a corpus is available for a specific domain (e.g., news article, email), supervised machine

learning approaches can be effectively applied to the topic modeling task.

For instance, text segmentation can be framed as a binary classification task in which given

any two adjacent blocks of sentences the classifier would predict whether the gap between them is

a segment boundary or not. Several features of the two sentence blocks have been considered in the

literature, including word overlap and cosine word similarity (see Chapter 4 ) between the blocks,

whether the terms in the two blocks refer to the same entities (e.g., like “Ciro” and “He” in sentences

1 and 2 at the beginning of this section), and the presence of discourse markers (also called cue

words) at the beginning of the second block. Discourse markers are specific words or phrases such

as “Well” and “Let's” that strongly correlate with the start of a new segment. Finally, as is often the

case in machine learning, the output of unsupervised techniques (e.g., the estimates of LDA) can

be added to the feature set.

Topic labeling can also be framed as a classification task. If you have a corpus in which each

segment is labeled with its corresponding topic, a classifier can be trained to predict the topic of

any give segment. All kinds of text features can be used, including lexical and syntactic ones. In this

case, the classification task is a multi-class one, with a class for each topic covered in the corpus. For

instance, in a corpus on documents about “natural disasters,” a multiclass classifier could be built

to identify segments about “effects on the population,” “effects on the infrastructure,” “plan/cost of

reconstruction,” etc.

Classification (binary or multi-class) is just the simplest way to turn topic modeling into a

supervised problem. Since the task essentially involves labeling a sequence of gaps (or sentences),

more sophisticated supervised sequence labeling techniques can be applied (e.g., Hidden Markov

Models (HMM), Conditional Random Fields (CRF) [ Poole and Mackworth , 2010 ]).

sequence

labeling

All the basic approaches to topic modeling of generic text are summarized in Figure 3.4 .As

we have seen, topic segmentation can be performed in an unsupervised way by either considering

the cohesion between segments (TextTiling), or by learning a probabilistic generative model for

the target documents (LDA). LDA, in particular, can be also used for the topic labeling task. For

supervised approaches to topic modeling (second column in Figure 3.4 ), binary and multi-class

classification methods, as well as sequence labeling ones can be effectively applied. In the next

section, we will discuss how all the approaches summarized in Figure 3.4 have been extended and

sometimes combined to perform topic modeling of conversations.

3.2.2 TOPIC MODELING OF CONVERSATIONS

Most previous work on topic modeling of multi-party conversations has focused on meeting tran-

scripts. Only very recently, researchers have started to work with emails and blogs for this task.

Search WWH ::

Custom Search

Home