1. Introduction

In this tutorial, we’ll explain specificity and sensitivity in machine learning.

We use those scores to estimate the performance of a classifier such as a neural network, decision tree, and many others.

2. Confusion Matrix

We can use a confusion matrix to calculate both metrics. For example, in binary classification problems, the matrix has two rows and two columns. Columns correspond to the ground truth, whereas rows denote the classifier’s predictions.

Each cell represents the number of test-set objects whose predicted (\widehat{y}), and actual (y) labels correspond to its row and column:

    \[\bordermatrix{ &y=1 & y=0 \cr \widehat{y}=1 & TruePositives & FalsePositives \cr \widehat{y}=0 & FalseNegatives & TrueNegatives \cr }\]

In binary classification, we usually call the classes positive and negative and denote them with 1 and 0.

3. Sensitivity and Specificity

Sensitivity is the percentage of positive objects that our classifier labeled as such. It’s the same as recall.

So, the formula to calculate it is:

    \[Sensitivity = \frac{TruePositives}{TruePositives + FalseNegatives}\]

Specificity is the percentage of negative objects that got the correct label:

    \[Specificity = \frac{TrueNegatives}{TrueNegatives + FalsePositives}\]

3.1. Example

Let’s say this is the confusion matrix of our binary classifier:

    \[\bordermatrix{ &y=1 & y=0 \cr \widehat{y}=1 & 139 & 270 \cr \widehat{y}=0 & 40 & 130 \cr }\]

Its sensitivity is:

    \[Sensitivity = \frac{139}{139 + 40} = \frac{139}{179} \approx 0.777\]

The specificity is:

    \[Specificity = \frac{130}{130+270} = \frac{130}{400} = 0.325\]

3.2. Confidence Intervals

If our test set is small, it’s a good idea to construct confidence or credible intervals for the sensitivity and specificity estimates.

The reason is that our estimates won’t be precise in that case. The actual scores are those we would get if we calculated sensitivity and specificity over all objects from the problem space. Since that’s impossible to do, we can use intervals to capture those values with a high probability.

Since both metrics are proportions, we can use any method for proportion intervals. For example, a normal-approximation confidence interval for the estimated proportion p is:

    \[p \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]

where 1-\alpha is our target confidence (usually 95% or 99%), z_{\alpha/2} is the corresponding quantile of the standard normal distribution, and p is the estimated proportion (sensitivity or specificity calculated from the confusion matrix). If we’re computing sensitivity, n is the number of positive objects in the test set, and if p is specificity, n is the number of negative ones. 

So, the 99% confidence interval for the sensitivity from our previous example is:

    \[[0.696, 0.857]\]

Although the original estimate is 0.777, the confidence interval reveals that any value between 0.696 and 0.857 is plausible as the classifier’s actual sensitivity.

The 99% confidence interval for the specificity is:

    \[[0.265, 0.386]\]

This confidence interval is narrower than that of sensitivity. The reason is that there are more negative objects. The confidence intervals will be narrow if we have a lot of data. In that case, we can use the estimates from the confusion matrix to assess the classifier’s performance or compare two classifiers.

3.3. Comparing Two Classifiers

If the test dataset isn’t large, a statistically sound way of comparing the scores of two classifiers is to check if the corresponding confidence intervals intersect.

If they don’t, we will have more grounds to say that their actual sensitivities (or specificity) scores differ.

For example, let’s say that we trained the same model again and got the one whose confusion matrix is:

    \[\bordermatrix{ &y=1 & y=0 \cr \widehat{y}=1 & 150 & 250 \cr \widehat{y}=0 & 29 & 150 \cr }\]

Its sensitivity and specificity scores are 0.838 and 0.375, and the confidence intervals are [0.767, 0.902] and [0.313, 0.437]. Both estimates are higher than the scores of the previous classifier (0.777 and 0.325), but their confidence intervals intersect:

    \[\begin{aligned} [0.696, 0.857] \cap [0.767, 0.902] &= [0.767, 0.857] & \neq \emptyset \\ [0.265, 0.386] \cap [0.313, 0.437] & = [0.313, 0.386] & \neq \emptyset \end{aligned}\]

So, the intervals reveal that it’s highly likely these two models will perform the same on unseen data even though their test-set scores differ.

3.4. How to Choose a Classifier?

Let’s say we have two classifiers: a support vector machine (SVM) and a logistic regression (LR) model. SVM has a specificity of 0.8 and a sensitivity of 0.7, whereas LR has a specificity of 0.7 and a sensitivity of 0.8.

Assuming that the confidence intervals don’t overlap, we have a situation where one classifier is better by one metric, and the other one beats it by the other performance score. Which one to use?

The choice depends on our goals. If identifying positive objects is more important, we should go with the model with higher sensitivity. In this case, that would be LR. Conversely, if we care more about correctly classifying negative objects, we should use SVM since its specificity is higher.

4. Multiclass Classification

We can generalize specificity and sensitivity to multiclass classification problems. In this scenario, we’ll have specificity and sensitivity scores for each class.

If there are m classes, the confusion matrix will have m columns and m rows:

    \[\bordermatrix{ & y=1 & y=2 & \ldots & y=m \cr \widehat{y} = 1 & C_{11} & C_{12} & \ldots & C_{1m} \cr \widehat{y} = 2 & C_{21} & C_{22} & \ldots & C_{2m} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr \widehat{y} = m & C_{m1} & C_{m2} & \ldots & C_{mm} \cr }\]

The sensitivity for class \boldsymbol{j \in \{1,2,\ldots,m\}} is the percentage of objects whose actual label is \boldsymbol{j} that were classified as such:

    \[Sensitivity(j) = \frac{C_{jj}}{\sum_{\substack{i=1 \\ i \neq j}}^{m} C_{ij}}\]

In the denominator, we sum all the elements of the jth column except the C_{jj}.

The specificity for class \boldsymbol{j} is the percentage of non-\boldsymbol{j} objects that got a label other than \boldsymbol{j}:

    \[Specificity(j) = \frac{  \sum_{\substack{i=1 \\ i \neq j}}^{m} \sum_{\substack{k=1 \\ k \neq j}}^{m} C_{ik}  }{ \sum_{i=1}^{m} \sum_{\substack{k=1 \\ k \neq j}}^{m} C_{ik} }\]

The numerator sums the entire matrix except for the cells in the jth column and jth row: this is the number of non-j objects that got a non-j label. The denominator sums all the elements except those in the jth column to get the number of non-j objects.

4.1. Example

We have three classes of days: Hot, Cold, and Mild, and we have trained a random forest to predict what the day will be like based on some early-morning data.

To estimate the forest’s performance, we tested it on a held-out year of data and got the following confusion matrix:

    \[\bordermatrix{ & y=Hot & y=Cold & y=Mild \cr \widehat{y} = Hot & 13 & 10 &  30 \cr \widehat{y} = Cold & 22 &  70 &  20 \cr \widehat{y} = Mild & 30 &  20 & 150 \cr }\]

The sensitivity for Hot days is:

    \[Sensitivity(Hot) = \frac{13}{13 + 22 + 30} = \frac{13}{65} = 0.2\]

The corresponding specificity is:

    \[Specificity(Hot) = \frac{70+20+20+150}{10+30+70+20+20+150} = \frac{260}{300} \approx 0.867\]

So, this model is very bad at detecting hot days but pretty good at guessing which days won’t be hot.

5. Conclusion

In this article, we talked about sensitivity and specificity. We calculate them using the confusion matrix. If our dataset is small, it’s a good choice to compute the interval estimates of those two metrics.

Comments are closed on this article!