Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Precision and recall

In information retrieval, precision is a commonly used measure of the quality of the results,

while recall is a measure of the completeness of the results.

In the binary classification context, precision is defined as the number of true positives

(that is, the number of examples correctly predicted as class 1) divided by the sum of true

positives and false positives (that is, the number of examples that were incorrectly pre-

dicted as class 1). Thus, we can see that a precision of 1.0 (or 100 percent) is achieved if

every example predicted by the classifier to be class 1 is, in fact, in class 1 (that is, there are

no false positives).

Recall is defined as the number of true positives divided by the sum of true positives and

false negatives (that is, the number of examples that were in class 1, but were predicted as

class 0 by the model). We can see that a recall of 1.0 (or 100 percent) is achieved if the

model doesn't miss any examples that were in class 1 (that is, there are no false negatives).

Generally, precision and recall are inversely related; often, higher precision is related to

lower recall and vice versa. To illustrate this, assume that we built a model that always pre-

dicted class 1. In this case, the model predictions would have no false negatives because

the model always predicts 1; it will not miss any of class 1. Thus, the recall will be 1.0 for

this model. On the other hand, the false positive rate could be very high, meaning precision

would be low (this depends on the exact distribution of the classes in the dataset).

Precision and recall are not particularly useful as standalone metrics, but are typically used

together to form an aggregate or averaged metric. Precision and recall are also dependent

on the threshold selected for the model.

Intuitively, below some threshold level, a model will always predict class 1. Hence, it will

have a recall of 1, but most likely, it will have low precision. At a high enough threshold,

the model will always predict class 0. The model will then have a recall of 0, since it can-

not achieve any true positives and will likely have many false negatives. Furthermore, its

precision score will be undefined, as it will achieve zero true positives and zero false posit-

ives.

The precision-recall ( PR ) curve shown in the following figure plots precision against re-

call outcomes for a given model, as the decision threshold of the classifier is changed. The

area under this PR curve is referred to as the average precision. Intuitively, an area under

Search WWH ::

Custom Search

Home