Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

(len(trainfeats),

len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)

print 'Accuracy:', nltk.classify.util.accuracy(classifier,

testfeats)

classifier.show_most_informative_features()

# prepare confusion matrix

pos = [classifier.classify(fs) for (fs,l) in

posfeats[cutoff:]]

pos = np.array(pos)

neg = [classifier.classify(fs) for (fs,l) in

negfeats[cutoff:]]

neg = np.array(neg)

print 'Confusion matrix:'

print '\t'*2, 'Predicted class'

print '-'*40

print '|\t %d (TP) \t|\t %d (FN) \t| Actual class' % (

(pos == 'pos').sum(), (pos == 'neg').sum()

print '-'*40

print '|\t %d (FP) \t|\t %d (TN) \t|' % (

(neg == 'pos').sum(), (neg == 'neg').sum())

print '-'*40

The output that follows shows that the naïve Bayes classifier is trained on 1,600

instances and tested on 400 instances from the movie corpus. The classifier

achieves an accuracy of 73.5%. Most information features for positive reviews

from the corpus include words such as outstanding , vulnerable , and

astounding ; and words such as insulting , ludicrous , and uninvolving

are the most informative features for negative reviews. At the end, the output also

shows the confusion matrix corresponding to the classifier to further evaluate the

performance.

Train on 1600 instances

Test on 400 instances

Accuracy: 0.735

Most Informative Features

outstanding = True pos : neg = 13.9 : 1.0

insulting = True neg : pos = 13.7 : 1.0

vulnerable = True pos : neg = 13.0 : 1.0

Search WWH ::

Custom Search

Home