Databases Reference
In-Depth Information
1.4.1 MINING TEXT CONVERSATIONS
A small set of basic challenging questions can be asked about any text conversations: what topics are
covered in the conversation? what opinions do participants express on those topics? what is structure
of the conversation? or more specifically, what is the intended function of each particular message
(or sentence) and its relationship to other contributions?
We can consider these questions in order.
TopicModeling: Topic Segmentation andTopic Labeling Conversations often span different topics;
an initial email message, asking a team to explain low sales in Asia, can generate a thread on what
is the best visualization tool for a particular analysis task. Or, alternatively, the follow up may be a
discussion on how the team may need to be reorganized.
Even if you look at our short, sample email conversation, it clearly covers at least two topics.
The conversation starts with a proposal for a vacation but then one sub-thread (on the right of
Figure 1.4 ) veers off talking about a problematic course assignment.
This example can help us to define the two basic subtasks of topic modeling: topic segmentation
and topic labeling. In topic segmentation, you are interested in identifying what portions of the
conversation are about the same topic, or equivalently, in detecting where in the conversation the
topic shifts are. For instance, in our sample conversation, there is a topic shift between the first
and the second (non quoted) sentences in Email-1.2 and this shift splits the conversation in two
segments, i.e., the text below the shift in the right sub-thread vs. the rest of the conversation.
Topic labeling, on the other hand, is about generating informative labels (typically sets of
words) for all the topics covered by a conversation. In our example, two informative (but still not
ideal) labels for the two identified topics might be “spring break Mexico skiing” and “assignment
question idea”.
A large number of topic modeling techniques have been developed for generic text (not neces-
sarily conversational in nature), including supervised and unsupervised machine learning methods,
as well as a combination of the two. Among all these proposals, a novel, probabilistic approach,
based on Latent Dirichlet Allocation (LDA) [ Blei et al. , 2003 ] appears to be the most effective and
influential (see Blei and Lafferty [ 2009 ] for a gentle introduction). In LDA, the generation of a
collection of documents is modeled as a stochastic process, and topic modeling consist of estimating
the parameters of the underlying probabilistic generative model.
In Chapter 3 , we will discuss how topic modeling techniques developed for generic text can
be extended to deal with text conversations. For instance, how variations of the LDA framework
have been successfully applied to meeting transcripts [ Purver et al. , 2006b ], as well as to Twit-
ter [ Ramage et al. , 2010 ] and email conversations [ Dredze et al. , 2008 ].
Sentiment and Subjectivity (i.e., Opinion Mining) Conversations typically exhibit a large amount
of highly subjective content. Participants may agree or disagree with one another, argue for or against
various proposals, and generally take turns expressing their opinions and emotions. Mining all this
subjective content can be framed at different levels of granularity. At the highest level, you have
Search WWH ::




Custom Search