Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

Jeong et al. [ 2009 ] recently applied a semi-supervised learning method [ Bennett et al. , 2002 ]

to dialogue act labeling for both email and forum conversations. Their focus is on labeling at the

sentence level, in which each sentence is labeled with one of the twelve domain-independent tags

shown in Figure 3.2 . With respect to previous work (e.g., [ Shrestha and McKeown , 2004 ]), they use

a more sophisticated set of sentence features, which includes subtrees of the dependency tree of the

sentence [ Kübler et al. , 2009 ].

Their semi-supervised approach is essentially an attempt to learn from a combination of labeled

transcripts of speech conversations with unlabeled email and forum conversations. For training, they

used two large corpora of transcribed spoken conversations as labeled data, namely, a corpus of

phone conversations (the SWITCHBOARD corpus), along with a corpus of transcribed meetings

(the MRDA corpus). As email unlabeled data, they used a subset of 23,391 emails from the Enron

Corpus (see Chapter 2 ), while as unlabeled forum data they collected 11,602 threads and 55,743

posts from the TripAdvisor travel forum site.

For testing, they annotated with dialog acts all the emails in the BC3 Corpus (see Chapter 2 ),

as well as a small portion of the TripAdvisor posts.

Their experiments reveal several interesting findings. First, more sophisticated sentence fea-

tures are beneficial for dialogue act labeling. Second, the application of the semi-supervised method

was successful, as for both emails and forums the semi-supervised method outperforms a supervised

approach in which you simply train on the SWITCHBOARD and MRDA corpora. Third, a closer

analysis of the results indicate that the semi-supervised method achieves larger improvements on

the less frequent dialogue acts, which suggests that the semi-supervised method is more effective

when minimal amount of labeled data are available. Finally, in terms of differences between email

and forum conversations, forum data seem to be more challenging, possibly because anyone can post

on a forum and this entails more diversity in linguistic and communicative behaviors.

Even more recent work by Ritter et al. [ 2010 ] is investigating a completely unsupervised

approach to dialogue act modeling, which could be easily applied across new forms of media and

new domains. The goal here is less ambitious than full dialogue act labeling. Instead of labeling

each utterance (or turn), they cluster together utterances (or turns) that play a similar conversational

function. The dialogue act label for each cluster would then be determined through other means,

which they do not explore in this work, but may include minimal supervision. Preliminary results on

micro-blog data (Twitter) indicate that a sequential HMM-like model can be effectively learned from

the data, and that such model reveals interesting properties of the structure of Twitter conversations.

For instance, when the states of the model are given some meaningful labels by a human annotator,

and transition probabilities of the HMM-like model are visualized as a graph (see Figure 3.11 ), it

becomes clear that Twitter conversations typically start in three different ways: Status, Reference

Broadcast and Question to Followers, where a Status dialogue act is describing what the user is

doing, a Reference Broadcast act sends out an interesting link or post, and a Question to Followers

is self-explanatory. As shown in Figure 3.11 , each of these acts can then be followed by different

combinations of other dialogue acts with different probabilities. For instance, Status can be followed

Search WWH ::

Custom Search

Home