Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

and negative classes. For example, using the simple bag-of-words (BOW) approach, sentences are

represented as unordered collections of their unigrams and the learning method determines which

unigrams tend to occur with each class.

bag-of-

words

One can go beyond the BOW approach to learn much more complex pat-

terns. Riloff and Phillips [ 2004 ] presented a method for learning subjective extraction patterns from

a large amount of data, which takes subjective and non-subjective text as input, and outputs sig-

nificant lexico-syntactic patterns, that can discriminate between subjective and non-subjective sen-

tences. These patterns are based on shallow syntactic structure output by the Sundance dependency

parser [ Riloff and Phillips , 2004 ]. They are extracted by exhaustively applying syntactic templates

such as <subj>passive-verb and active-verb <dobj> to a training corpus, with an extracted

pattern for every instantiation of the syntactic template. These patterns are scored according to the

probability of a sentence to be subjective given the pattern and the frequency of the pattern. Because

these patterns are based on syntactic structure, they can represent subjective expressions that are not

fixed word sequences and would therefore be missed by a simple n-gram approach.

The disadvantage of a lexicon-based system is that it usually relies on a hand-built dictionary,

which requires many human hours and limits portability to new domains and modalities since

vocabularies may differ. An advantage of a statistical system, in contrast, is that porting it to a

new domain only requires the new dataset and any requisite annotations from which to learn. The

annotation itself may admittedly be time-consuming, depending how coarse or fine it is, but once

complete, the system will automatically learn new subjective and polar terms for that domain. On

the other hand, the advantage of a lexicon-based system is that it can have very high precision, since

the dictionaries are typically hand-built and tuned for a particular domain.

As mentioned, the supervised vs. unsupervised distinction is roughly related to the lexicon-

based vs. statistical-based distinction in practice. Lexicon-based approaches are often unsupervised,

rule-based algorithms (e.g., SO-Cal), while statistical systems typically are trained on labeled data.

However, systems can easily cut across these distinctions, e.g., by using the output of a lexicon-

based system as a feature of a statistical classifier. And a lexicon-based system itself need not be

entirely hand-crafted, but can incorporate words and associated scores that are learned from data in

a supervised or semi-supervised fashion.

3.3.2 SENTIMENT DETECTION IN CONVERSATIONS

With meetings, most recent sentiment detection work has focused on the AMI corpus (see Chap-

ter 2 ). Somasundaran et al. [ 2007 ] describe their coding scheme for opinion annotation and apply it

to a subset of the AMI corpus. They consider two types of opinions: expressing sentiment, which in-

cludes feelings and emotions, and arguing, which includes convictions and persuasion. Their system

for detecting sentiment and arguing is a good example of combining lexicon-based and statistical

approaches, as they avail themselves of existing sentiment lexicons and create a new arguing lexicon,

but also combine these knowledge sources with dialogue act and adjacency pair information in a

Search WWH ::

Custom Search

Home