Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

difficult to achieve the top-left corner. But a better classifier should be closer to the

top left, separating it from other classifiers that are closer to the diagonal line.

Related to the ROC curve is the area under the curve (AUC). The AUC is

calculated by measuring the area under the ROC curve. Higher AUC scores mean

the classifier performs better. The score can range from 0.5 (for the diagonal line

TPR=FPR) to 1.0 (with ROC passing through the top-left corner).

In the bank marketing example, the training set includes 2,000 instances. An

additional 100 instances are included as the testing set. Figure 7.10 shows a ROC

curve of the naïve Bayes classifier built on the training set of 2,000 instances and

tested on the testing set of 100 instances. The figure is generated by the following

R script. The ROCR package is required for plotting the ROC curve. The 2,000

instances are in a data frame called banktrain , and the additional 100 instances

are in a data frame called banktest .

library(ROCR)

# training set

banktrain <-

read.table("bank-sample.csv",header=TRUE,sep=",")

# drop a few columns

drops <- c("balance", "day", "campaign", "pdays",

"previous", "month")

banktrain <- banktrain [,!(names(banktrain) %in% drops)]

# testing set

banktest <-

read.table("bank-sample-test.csv",header=TRUE,sep=",")

banktest <- banktest [,!(names(banktest) %in% drops)]

# build the naïve Bayes classifier

nb_model <- naiveBayes(subscribed˜.,

data=banktrain)

# perform on the testing set

nb_prediction <- predict(nb_model,

# remove column "subscribed"

banktest[,-ncol(banktest)],

type='raw')

score <- nb_prediction[, c("yes")]

actual_class <- banktest$subscribed == 'yes'

pred <- prediction(score, actual_class)

perf <- performance(pred, "tpr", "fpr")

Search WWH ::

Custom Search

Home