Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

ludicrous = True neg : pos = 12.6 : 1.0

uninvolving = True neg : pos = 12.3 : 1.0

astounding = True pos : neg = 11.7 : 1.0

avoids = True pos : neg = 11.7 : 1.0

fascination = True pos : neg = 11.0 : 1.0

animators = True pos : neg = 10.3 : 1.0

symbol = True pos : neg = 10.3 : 1.0

Confusion matrix:

Predicted class

----------------------------------------

| 195 (TP) | 5 (FN) | Actual class

----------------------------------------

| 101 (FP) | 99 (TN) |

----------------------------------------

As discussed earlier in Chapter 7, a confusion matrix is a specific table layout

that allows visualization of the performance of a model over the testing set. Every

row and column corresponds to a possible class in the dataset. Each cell in the

matrix shows the number of test examples for which the actual class is the row and

the predicted class is the column. Good results correspond to large numbers down

the main diagonal (TP and TN) and small, ideally zero, off-diagonal elements (FP

and FN). Table 9.7 shows the confusion matrix from the previous program output

for the testing set of 400 reviews. Because a well-performed classifier should have a

confusion matrix with large numbers for TP and TN and ideally near zero numbers

for FP and FN, it can be concluded that the naïve Bayes classifier has many false

negatives, and it does not perform very well on this testing set.

Table 9.7 Confusion Matrix for the Example Testing Set

Predicted Class

Positive Negative

Actual Class Positive 195 (TP) 5 (FN)

Negative

101 (FP) 99 (TN)

Chapter 7 has introduced a few measures to evaluate the performance of a classifier

beyond the confusion matrix. Precision and recall are two measures commonly

used to evaluate tasks related to text analysis. Definitions of precision and recall

are given in Equations 9.8 and 9.9 .

Search WWH ::

Custom Search

Home