Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Accuracy and prediction error

The prediction error for binary classification is possibly the simplest measure available. It

is the number of training examples that are misclassified, divided by the total number of

examples. Similarly, accuracy is the number of correctly classified examples divided by the

total examples.

We can calculate the accuracy of our models in our training data by making predictions on

each input feature and comparing them to the true label. We will sum up the number of cor-

rectly classified instances and divide this by the total number of data points to get the aver-

age classification accuracy:

val lrTotalCorrect = data.map { point =>

if (lrModel.predict(point.features) == point.label) 1 else

0

}.sum

val lrAccuracy = lrTotalCorrect / data.count

The output is as follows:

lrAccuracy: Double = 0.5146720757268425

This gives us 51.5 percent accuracy, which doesn't look particularly impressive! Our model

got only half of the training examples correct, which seems to be about as good as a ran-

dom chance.

Note

Note that the predictions made by the model are not naturally exactly 1 or 0. The output is

usually a real number that must be turned into a class prediction. This is done through use

of a threshold in the classifier's decision or scoring function.

For example, binary logistic regression is a probabilistic model that returns the estimated

probability of class 1 in its scoring function. Thus, a decision threshold of 0.5 is typical.

That is, if the estimated probability of being in class 1 is higher than 50 percent, the model

decides to classify the point as class 1; otherwise, it will be classified as class 0.

Note that the threshold itself is effectively a model parameter that can be tuned in some

models. It also plays a role in evaluation measures, as we will see now.

Search WWH ::

Custom Search

Home