Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

over the testing set to infer the sentiment tags. Finally, the result is compared

against the original sentiment tags to evaluate the overall performance of the

classifier.

The code that follows is written in Python using the Natural Language Processing

Toolkit (NLTK) library ( http://nltk.org/ ). It shows how to perform sentiment

analysis using the naïve Bayes classifier over the movie review corpus.

The code splits the 2,000 reviews into 1,600 reviews as the training set and 400

reviews as the testing set. The naïve Bayes classifier learns from the training

set. The sentiments in the testing set are hidden away from the classifier. For

each review in the training set, the classifier learns how each feature impacts the

outcome sentiment. Next, the classifier is given the testing set. For each review in

the set, it predicts what the corresponding sentiment should be, given the features

in the current review.

import nltk.classify.util

from nltk.classify import NaiveBayesClassifier

from nltk.corpus import movie_reviews

from collections import defaultdict

import numpy as np

# define an 80/20 split for train/test

SPLIT = 0.8

def word_feats(words):

feats = defaultdict(lambda: False)

for word in words:

feats[word] = True

return feats

posids = movie_reviews.fileids('pos')

negids = movie_reviews.fileids('neg')

posfeats = [(word_feats(movie_reviews.words(fileids=[f])),

'pos')

for f in posids]

negfeats = [(word_feats(movie_reviews.words(fileids=[f])),

'neg')

for f in negids]

cutoff = int(len(posfeats) * SPLIT)

trainfeats = negfeats[:cutoff] + posfeats[:cutoff]

testfeats = negfeats[cutoff:] + posfeats[cutoff:]

print 'Train on %d instances\nTest on %d instances' %

Search WWH ::

Custom Search

Home