Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

9.7 Determining Sentiments

In addition to the TFIDF and topic models, the Data Science team may want to

identify the sentiments in user comments and reviews of the ACME products.

Sentiment analysis refers to a group of tasks that use statistics and natural

language processing to mine opinions to identify and extract subjective information

from texts.

Early work on sentiment analysis focused on detecting the polarity of product

reviews from Epinions [34] and movie reviews from the Internet Movie Database

(IMDb) [35] at the document level. Later work handles sentiment analysis at the

sentence level [36]. More recently, the focus has shifted to phrase-level [37] and

short-text forms in response to the popularity of micro-blogging services such as

Twitter [38-42].

Intuitively, to conduct sentiment analysis, one can manually construct lists of words

with positive sentiments (such as brilliant , awesome , and spectacular ) and

negative sentiments (such as awful , stupid , and hideous ). Related work has

pointed out that such an approach can be expected to achieve accuracy around 60%

[35], and it is likely to be outperformed by examination of corpus statistics [43].

Classification methods such as naïve Bayes as introduced in Chapter 7, maximum

entropy (MaxEnt), and support vector machines (SVM) are often used to extract

corpus statistics for sentiment analysis. Related research has found out that these

classifiers can score around 80% accuracy [35, 41, 42] on sentiment analysis over

unstructured data. One or more of such classifiers can be applied to unstructured

data, such as movie reviews or even tweets.

The movie review corpus by Pang et al. [35] includes 2,000 movie reviews collected

from an IMDb archive of the rec.arts.movies.reviews newsgroup [43]. These movie

reviews have been manually tagged into 1,000 positive reviews and 1,000 negative

reviews.

Depending on the classifier, the data may need to be split into training and testing

sets. As seen previously in Chapter 7, a useful rule of the thumb for splitting data is

to produce a training set much bigger than the testing set. For example, an 80/20

split would produce 80% of the data as the training set and 20% as the testing set.

Next, one or more classifiers are trained over the training set to learn the

characteristics or patterns residing in the data. The sentiment tags in the testing

data are hidden away from the classifiers. After the training, classifiers are tested

Search WWH ::

Custom Search

Home