1. Introduction

In this tutorial, we’ll explore the concepts of precision and average precision in machine learning (ML). Both are performance metrics for classification, but although their names are similar, the difference is fundamental.

To keep things simple, we’ll focus on binary classification, where we have only two classes: one positive and the other negative.

2. Classification Performance Metrics

When testing ML classifiers, we’re interested in several metrics, including but not limited to accuracy, the F_1 score, and AUROC. In binary classification, those metrics evaluate our classifier’s capacity to differentiate between two classes.

To compute the metrics, we use four basic scores:

Rendered by QuickLaTeX.com

In the table, TP is the number of true positives, FP denotes the number of false positives, TN is the number of true negatives, and FN stands for the number of false negatives.

3. Precision

Precision is the ratio of correctly predicted positives and predicted positives. More specifically, precision tells us how many objects we classify as positive belong to the positive class.

For instance, let’s say that our classifier labeled 150 objects as positive and that TP is 120. Then, the classifier’s precision is:

    \[\frac{120}{150} = \frac{4}{5} = 80\%\]

From the example, we see that the formula for precision is:

(1)   \begin{equation*}  p = \frac{TP}{TP + FP} \end{equation*}

We can interpret it as the probability that the assigned positive label is correct. That explains the metric’s name: it quantifies how precise the (positive) predictions are.

4. Average Precision

Precision describes a specific ML classifier. In contrast, the average precision evaluates a family of classifiers. To explain the difference, let’s first formalize the notion of threshold-based binary classifiers.

4.1. Threshold-based Binary Classifiers

We can define many binary classifiers as indicator functions:

(2)   \begin{equation*}  f(x) = \begin{cases} 1, & t(x) \geq \tau \\ 0, & t(x) < \tau \end{cases} \end{equation*}

where t(x) is the score the classifier computes for the object x before classifying it, which denotes the confidence with which the classifier can label x as positive, and \tau is the decision threshold. So, higher values of t(x) correspond to a higher confidence that x is positive, and vice versa.

For instance, in Support Vector Machines, t(x) is the signed distance to separating hyperplane, and \tau=0. In Logistic Regression models, t(x) estimates the probability that x is positive, and \tau = 1/2.

So, the precision as we calculate it using Equation (1) is a function of \tau, just as TP and FP:

(3)   \begin{equation*} p(\tau) = \frac{TP(\tau)}{TP(\tau) + FP(\tau)} \end{equation*}

4.2. The Average Precision Metric

By varying \tau, we get a family of classifiers having the same form. If we calculate the precision at each \tau and then compute the mean, we get the average precision (AP):

(4)   \begin{equation*}  AP = \frac{1}{n} \sum_{i=1}^{n} p(\tau_i) \end{equation*}

where \tau_1, \tau_2, \ldots, \tau_n are chosen thresholds. Such defined AP estimates the expected precision with respect to the distribution of the thresholds. Assuming that they’re uniformly distributed between \tau_{\min} and \tau_{\max}, we get the formula for the expected precision:

(5)   \begin{equation*}  \int_{\tau_{\min}}^{\tau_{\max}}p(\tau)\mathrm{d}\tau \end{equation*}

So, to get a precise and unbiased estimate, we need to use enough thresholds and cover the range evenly.

5. Average Precision in the Precision-Recall Space

There’s a problem with the average precision defined above. First, thresholds can have different, even unbounded ranges (when \tau_{\min}=-\infy or \tau_{\max}=\infty or both).

For that reason, a visual comparison of the classifiers with different ranges may be difficult or impossible. For example, the t-scores, and consequently, the thresholds \tau, are unbounded in Support Vector Machines, so we have to cover the whole (-\infty, \infty) on the x-axis. In contrast, the scores belong to [0, 1] in Logistic Regression. The difference in scales means that we can’t compare the two precision curves easily:

Precision curves over different threshold ranges

When we analyze the plot, it isn’t clear which classifier family works better.

Second, this version of AP doesn’t consider the recall.

It turns out that we can address both issues in the precision-recall (PR) space. So, let’s quickly revise what the recall is and why it’s important.

5.1. Recall

In contrast to precision, the recall score tells us how many positive objects our classifier correctly identifies as positive. So, we define and compute it as the ratio of true positives and the total number of positive objects in our data:

(6)   \begin{equation*} r = \frac{TP}{TP + FN} \qquad \text{or rather:} \qquad r(\tau) = \frac{TP(\tau)}{TP(\tau) + FN(\tau)} \end{equation*}

The difference between the two metrics is subtle but critical. Recall estimates a classifier’s ability to label all positive objects as such. Precision estimates the ability to identify only positive objects as positive.

5.2. Weighted Average Precision

If we considered only precision, we could get a good score by classifying as positive only the objects with a high t-value. That would result in an overly conservative classifier failing to identify many positive objects. So, it wouldn’t be useful in practice due to a low recall.

Therefore, we want a metric that considers both scores. Since we’d like to have both precision and recall high, we could incorporate the latter into the formula for the average precision by multiplying each p(\tau_i) with the realized recall gain.

So, if p_i = p(\tau_i) and r_i = r(\tau_i), our adjusted average precision is:

(7)   \begin{equation*}  AP = \frac{1}{n} \sum_{i=1}^{n} (r_i - r_{i-1}) p_i \end{equation*}

5.3. Geometry

The AP from Formula (7) has a nice visual interpretation in the PR space.

Let the x-axis denote recall, and let the y-axis represent precision. Then, the AP (7) estimates the area under the precision curve in the PR space:

Area under the precision curve in the PR space

 

Further, since recall and precision take values from \boldsymbol{[0, 1]}, we can visualize the precision curves of different classifiers on the same plot in a way that allows for direct comparison. For example, let’s take a look at the curves of the SVM and LR models from above, but now in the PR space:

Precision curves in the PR space

 

Since the x-axis ranges are the same, the difference is now easy to spot.

6. A Word of Caution

Since precision and recall don’t consider negative objects, we can use the precision and average precision scores when false negatives can be ignored.

However, if true negatives add value, we shouldn’t overlook them. Instead, we should calculate the scores that take both classes into account.

7. Conclusion

In this article, we talked about precision and average precision. The former evaluates a specific classifier, whereas the latter describes the performance of a classifier family.

Comments are closed on this article!