Database Reference
In-Depth Information
9.7 Determining Sentiments
In addition to the TFIDF and topic models, the Data Science team may want to
identify the sentiments in user comments and reviews of the ACME products.
Sentiment analysis refers to a group of tasks that use statistics and natural
language processing to mine opinions to identify and extract subjective information
from texts.
Early work on sentiment analysis focused on detecting the polarity of product
reviews from Epinions [34] and movie reviews from the Internet Movie Database
(IMDb) [35] at the document level. Later work handles sentiment analysis at the
sentence level [36]. More recently, the focus has shifted to phrase-level [37] and
short-text forms in response to the popularity of micro-blogging services such as
Twitter [38-42].
Intuitively, to conduct sentiment analysis, one can manually construct lists of words
with positive sentiments (such as brilliant , awesome , and spectacular ) and
negative sentiments (such as awful , stupid , and hideous ). Related work has
pointed out that such an approach can be expected to achieve accuracy around 60%
[35], and it is likely to be outperformed by examination of corpus statistics [43].
Classification methods such as naïve Bayes as introduced in Chapter 7, maximum
entropy (MaxEnt), and support vector machines (SVM) are often used to extract
corpus statistics for sentiment analysis. Related research has found out that these
classifiers can score around 80% accuracy [35, 41, 42] on sentiment analysis over
unstructured data. One or more of such classifiers can be applied to unstructured
data, such as movie reviews or even tweets.
The movie review corpus by Pang et al. [35] includes 2,000 movie reviews collected
from an IMDb archive of the rec.arts.movies.reviews newsgroup [43]. These movie
reviews have been manually tagged into 1,000 positive reviews and 1,000 negative
reviews.
Depending on the classifier, the data may need to be split into training and testing
sets. As seen previously in Chapter 7, a useful rule of the thumb for splitting data is
to produce a training set much bigger than the testing set. For example, an 80/20
split would produce 80% of the data as the training set and 20% as the testing set.
Next, one or more classifiers are trained over the training set to learn the
characteristics or patterns residing in the data. The sentiment tags in the testing
data are hidden away from the classifiers. After the training, classifiers are tested
Search WWH ::




Custom Search