Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

machine learning techniques, where a classifier can be trained on data labeled as subjective vs. non-

subjective or positive vs. negative, at the document or sentence level. Alternatively, semi-supervised

or unsupervised algorithms can predict subjectivity and polarity with little or no labeled data; an

example of a semi-supervised approach is to use a manually selected set of seed words that are known

seed words

to be subjective or to have a particular polarity, and use those seed words to automatically label

sentences or documents. This can lead to the discovery of new subjective words, expansion of the

seed set and repetition of the whole process. This would be an example of a boot-strapping procedure.

boot-

strapping

A related distinction is between lexicon-based approaches and statistical approaches, though

lexicon-

based

this is less of a theoretical distinction than a reflection of common system implementations. In a

lexicon-based approach, there is a dictionary of subjective or polar words, usually associated with

numerical scores indicating the strength of the word polarity. For example, the scores may range from

-5 to +5, with -5 indicating very negative sentiment (e.g., “terrible”) and +5 very positive sentiment

(e.g., “awesome”). Given a text, a lexicon-based system identifies words contained in its lexicon and

retrieves their word scores. A phrase, sentence or document can be scored, in the simplest case, by

summing over its sentiment word scores.

SO-Cal [ Taboada et al. , 2010 ] (for Semantic Orientation Calculator ) is a lexicon-based system

that is considerably more sophisticated than that and illustrates why simply summing over word

scores is not sufficient. To give just one example, such a system must account for negators that

negators

can weaken or reverse a word's dictionary score. The following three sentences help illustrate this

phenomenon:

1. I love this interface design.

2. I don't love this interface design.

3. I hate this interface design.

It seems clear that Sentence 1 is very positive and Sentence 3 is very negative, as indicated

by the words love and hate , respectively. However, Sentence 2 also contains the word love . Based

solely on the dictionary scores for the sentiment words, this sentence should therefore be considered

positive as well. Of course, we know that the preceding word don't negates that positive sentiment,

and any system will need to account for this effect. However, if we simply reverse the sentence score

due to the presence of the negator, we will end up assigning a very negative score similar to the score

for Sentence 3. Intuitively, it seems that Sentence 2 is more ambivalent than Sentence 3 and should

have more of a neutral score. For that reasons, systems like SO-Cal make more subtle adjustments

to a sentence score when a negator is present.

In contrast, many statistical systems do not rely on hand-crafted dictionaries, but rather au-

statistical

systems

tomatically learn subjective terms or phrases from labeled or unlabeled data. One idea is to build a

list of subjective words by identifying the words that occur most frequently in text labeled as being

subjective, once stopwords have been removed. Other statistical systems never use an explicit list of

subjective or polar words, but rather extract raw lexical features such as unigrams and bigrams and

let the machine learning method automatically learn how those features correlate with the positive

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home