Databases Reference
In-Depth Information
One often mentioned limitation of LDA is its inability to choose the optimal number of
topics. Possible approaches to this problem are discussed in Blei and Lafferty [ 2009 ].
Supervised Classification: TextTiling and LDA are examples of unsupervised techniques,
since they do not need to be trained on a corpus annotated with segmentation and topics. However,
if such a corpus is available for a specific domain (e.g., news article, email), supervised machine
learning approaches can be effectively applied to the topic modeling task.
For instance, text segmentation can be framed as a binary classification task in which given
any two adjacent blocks of sentences the classifier would predict whether the gap between them is
a segment boundary or not. Several features of the two sentence blocks have been considered in the
literature, including word overlap and cosine word similarity (see Chapter 4 ) between the blocks,
whether the terms in the two blocks refer to the same entities (e.g., like “Ciro” and “He” in sentences
1 and 2 at the beginning of this section), and the presence of discourse markers (also called cue
words) at the beginning of the second block. Discourse markers are specific words or phrases such
as “Well” and “Let's” that strongly correlate with the start of a new segment. Finally, as is often the
case in machine learning, the output of unsupervised techniques (e.g., the estimates of LDA) can
be added to the feature set.
Topic labeling can also be framed as a classification task. If you have a corpus in which each
segment is labeled with its corresponding topic, a classifier can be trained to predict the topic of
any give segment. All kinds of text features can be used, including lexical and syntactic ones. In this
case, the classification task is a multi-class one, with a class for each topic covered in the corpus. For
instance, in a corpus on documents about “natural disasters,” a multiclass classifier could be built
to identify segments about “effects on the population,” “effects on the infrastructure,” “plan/cost of
reconstruction,” etc.
Classification (binary or multi-class) is just the simplest way to turn topic modeling into a
supervised problem. Since the task essentially involves labeling a sequence of gaps (or sentences),
more sophisticated supervised sequence labeling techniques can be applied (e.g., Hidden Markov
Models (HMM), Conditional Random Fields (CRF) [ Poole and Mackworth , 2010 ]).
sequence
labeling
All the basic approaches to topic modeling of generic text are summarized in Figure 3.4 .As
we have seen, topic segmentation can be performed in an unsupervised way by either considering
the cohesion between segments (TextTiling), or by learning a probabilistic generative model for
the target documents (LDA). LDA, in particular, can be also used for the topic labeling task. For
supervised approaches to topic modeling (second column in Figure 3.4 ), binary and multi-class
classification methods, as well as sequence labeling ones can be effectively applied. In the next
section, we will discuss how all the approaches summarized in Figure 3.4 have been extended and
sometimes combined to perform topic modeling of conversations.
3.2.2 TOPIC MODELING OF CONVERSATIONS
Most previous work on topic modeling of multi-party conversations has focused on meeting tran-
scripts. Only very recently, researchers have started to work with emails and blogs for this task.
Search WWH ::




Custom Search