Database Reference
In-Depth Information
over the testing set to infer the sentiment tags. Finally, the result is compared
against the original sentiment tags to evaluate the overall performance of the
classifier.
The code that follows is written in Python using the Natural Language Processing
Toolkit (NLTK) library ( http://nltk.org/ ). It shows how to perform sentiment
analysis using the naïve Bayes classifier over the movie review corpus.
The code splits the 2,000 reviews into 1,600 reviews as the training set and 400
reviews as the testing set. The naïve Bayes classifier learns from the training
set. The sentiments in the testing set are hidden away from the classifier. For
each review in the training set, the classifier learns how each feature impacts the
outcome sentiment. Next, the classifier is given the testing set. For each review in
the set, it predicts what the corresponding sentiment should be, given the features
in the current review.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define an 80/20 split for train/test
SPLIT = 0.8
def word_feats(words):
feats = defaultdict(lambda: False)
for word in words:
feats[word] = True
return feats
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
posfeats = [(word_feats(movie_reviews.words(fileids=[f])),
'pos')
for f in posids]
negfeats = [(word_feats(movie_reviews.words(fileids=[f])),
'neg')
for f in negids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' %
Search WWH ::




Custom Search