Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

Figure 3.6 shows the graphical model for Labeled LDA, where represents the set of all

possible topics. Since we assume the existence of labels for each document, is observed for each

document (grayed in the Figure).

T d

D

z

w

) j

K

N d

T

D

/

E

Figure 3.6: Graphical model for labeled LDA. Additions to the standard LDA model are highlighted

in black.

To apply Labeled LDA to Twitter, Ramage et al. first conducted a set of structured interviews

to identify what are the basic dimensions people consider when they decide what posts to read

or what user to follow on Twitter. They found four such dimensions (called 4S): substance topics

(about an entity or idea), social topics where language is used towards a social end (e.g., making plans

with friends), status topics conveying personal updates, and style topics (e.g., humor or wit). Then,

through a rather complex semi-automated process they label a large Twitter dataset with those four

labels and run Labeled LDA on it.

The output of this process is a topic model for Twitter conversations that is only based on four

topics, namely, the 4S. This model can be applied to any set of tweets. Figure 3.7 ,from Ramage et al.

[ 2010 ], shows how the tweets of two sample users can be a visualized in the context of a 4S topic

model 4 .

Ramage et al. also ran a user study which indicates that the learned topic models would be

effective in helping Twitter users to identify the most valuable posts in their current feed, as well as

what new users to follow.

Recently, there has also been work on applying topic modeling techniques to email con-

versations, with the limited goal of generating summary keywords for each email message. In an

empirical comparison of different unsupervised approaches, LDA has been shown to be the best

performer [ Dredze et al. , 2008 ]. In essence, once a set of email messages has been modeled with

LDA, the best keywords to describe an email are the ones with the highest probability given that

email. The probability of each candidate keyword c , given an email e , can be formally computed as:

P(c | e) = j = 1 P(c | z i )P (z i | e)

4 Additional interactive examples can be explored at http://twahpic.cloudapp.net/

Search WWH ::

Custom Search

Home