# Probabilistic Neighbourhood Component Analysis:

Sample Efficient Uncertainty Estimation in Deep Learning

###### Abstract

While Deep Neural Networks (DNNs) achieve state-of-the-art accuracy in various applications, they often fall short in accurately estimating their predictive uncertainty and, in turn, fail to recognize when these predictions may be wrong. Several uncertainty-aware models, such as Bayesian Neural Network (BNNs) and Deep Ensembles have been proposed in the literature for quantifying predictive uncertainty. However, research in this area has been largely confined to the big data regime. In this work, we show that the uncertainty estimation capability of state-of-the-art BNNs and Deep Ensemble models degrades significantly when the amount of training data is small. To address the issue of accurate uncertainty estimation in the small-data regime, we propose a probabilistic generalization of the popular sample-efficient non-parametric kNN approach. Our approach enables deep kNN classifier to accurately quantify underlying uncertainties in its prediction. We demonstrate the usefulness of the proposed approach by achieving superior uncertainty quantification as compared to state-of-the-art on a real-world application of COVID-19 diagnosis from chest X-Rays. Our code is available at https://github.com/ankurmallick/sample-efficient-uq.

## 1 Introduction

Deep Neural Networks (DNNs) have achieved remarkable success in a wide range of applications where a large amount of labeled training data is available [10, 9]. However, in many emerging applications of machine learning such as diagnosis and treatment of novel coronavirus disease (COVID-19) [6] a large labeled training datasets may not be available. Furthermore, test data in these applications may deviate from the training data distribution, e.g., due to sample selection bias, nonstationarity, and even can be from Out-of-Distribution in some extreme cases [3]. Note that several of these applications are high-regret in nature implying that incorrect decisions or predictions have significant costs. Therefore, such applications require not only achieving high accuracy but also accurate quantification of predictive uncertainties. Accurate predictive uncertainty in these applications can help practitioners to assess the true performance and risks and to decide whether the model predictions should (or should not) be trusted.

Unfortunately, DNNs often make overconfident predictions in the presence of distributional shifts and Out-of-Distribution data. As an example, Fig. 1 shows the predictions of different deep learning models trained to detect the presence of COVID-19 from chest X-ray images. All models achieve similar accuracy () on in-distribution validation data. However, their quality of uncertainties is widely varied as explained next. While all models are forced to output some prediction, on every input image, we would want a model to not be very confident on input data that is very different from the data used to train it. However, we observe that state-of-the-art deep learning models make highly overconfident predictions on Out-of-Distribution data [17]. Interestingly, we found that even popular uncertainty-aware models, (e.g., BNNs, deep ensembles) that are designed to address this precise issue, perform poorly in small data regime. This is an extremely problematic issue especially owing to the flurry of papers that have been attempting to use DNNs for detecting COVID-19 using chest X-Ray images [15, 5, 21] as real-world test data is almost always different as compared to the training data.

While there have been separate efforts on improving the sample efficiency [14] and accurate uncertainty estimation [7] of deep learning, to the best of our knowledge there has not been any effort on studying these seemingly different issues in a unified manner. Therefore, this paper takes some initial steps towards (a) studying the effect of training data on the quality of uncertainty and (b) developing sample efficient uncertainty-aware predictive models. Specifically, to overcome the challenge of providing accurate uncertainties without compromising the accuracy in the small-data regime, we propose a probabilistic generalization of the popular non-parametric kNN approach (referred to as probabilistic neighborhood component analysis (PNCA)). By mapping data into distributions in a latent space before performing classification, we enable a deep kNN classifier to accurately quantify underlying uncertainties in its prediction. Following [11, 18], for a meaningful and effective performance evaluation, we compare the quality of predictive uncertainty of different models under conditions of distributional shift and Out-of-Distribution. We empirically show that the proposed PNCA approach achieves significantly better uncertainty estimation performance as compared to state-of-the-art approaches in small data regime.

## 2 Probabilistic Neighbourhood Component Analysis

In this section, we describe our model to achieve sample-efficient and uncertainty-aware classification. The details of the algorithm and proof of Proposition 1 are presented in Appendix B.

### 2.1 Neighbourhood Components Analysis (NCA)

Our approach is a generalization of NCA proposed in [8] wherein the authors learn a distance metric for kNN classification of points with corresponding class labels . A data point is projected into a latent space to give an embedding . Here can be a linear transformation like a matrix or a non-linear transformation like a neural network with a dimensional output, and are the parameters of the transformation. The probability of a point selecting another point as its neighbor is given by applying a softmax activation to the distance between points in the latent space

(1) |

The probability of selecting a point in the same class as itself is given by and the optimal model parameters are obtained by minimizing the loss

(2) |

which is the negative log-likelihood of the data under our the model. The authors of [8] experiment with a variety of transformations and classification tasks and show that NCA achieves competitive accuracy.

### 2.2 Our Model

The lack of data may cause the NCA model to overfit when learning the weights by optimizing the loss in creftypecap 2. We expect that the uncertainty due to the scarcity of training data can be better captured by *probability distributions* in the latent space than by individual data samples. Therefore, we propose a probabilistic generalization of the model, PNCA, which learns a distribution over the model parameters and, thus, deals with both the lack of training data and the task of accurate uncertainty estimation.

Latent Space Mapping using Probabilistic Neural Networks. Each data point passes through a probabilistic neural network with parameters to give a random variable . Due to the stochasticity of , each data point corresponds to a different distribution in the latent space.

NCA over Latent Distributions. Observe that the individual terms in the softmax activation correspond to a *kernel* between latent embeddings , e.g., the squared exponential kernel . Since in our approach the embedding corresponding to a data point is the probability distribution , we propose to use the following kernel between distributions

(3) |

where corresponds to the inner product between distributions and in the Reproducing Kernel Hilbert Space (RKHS) defined by the kernel [16] and, thus, captures similarity between distributions in the same way as captures the similarity between individual embeddings in NCA.

### 2.3 Training Algorithm

The forward pass described above, is used to compute a kernel between data points and which can then be used to compute the probability of selecting in an analogous fashion to NCA as

(4) |

Since the latent embedding for a data point is given by , , we can rewrite creftypecap 3 as

(5) |

Thus, we can view as a *functional* of . The negative log likelihood in creftypecap 2 is then also a functional of . The optimal distribution over the model parameters can be obtained by solving

(6) |

The choice of is critical to the success of this approach.
Following [13], we choose to be where is a RKHS given by a kernel between model parameters (note that this is *different* from the RKHS into which distributions in the latent space are embedded and which is given by the kernel ). This choice of includes all smooth transformations from the initial distribution , and the optimization problem creftypecap 6 now reduces to computing the optimal shift .
Next, we provide an expression for the functional gradient of the negative log-likelihood under our model with respect to the shift .

###### Proposition 1.

If we draw realizations of model parameters , , then

(7) |

where is given by substituting in creftypecap 2.

To estimate the optimal shift, (or optimal distribution ) we draw an initial set of parameters and iteratively apply the functional gradient descent transformation as described in LABEL:algo:PNCA in Appendix B

## 3 Experiments

We consider two small data classification tasks: (1) handwritten digit recognition, and (2) COVID-19 detection from chest X-Ray images. For both tasks, we compare proposed PNCA to 4 baselines, a Deep Neural Network (DNN), a Bayesian Neural Network (BNN) trained using the approach of [13], Deep Ensembles [11], and NCA [8]. For PNCA and NCA, we use a neural network to map the data to embeddings . For NCA and PNCA, the predicted class label for a test point is . For the other models, the predicted class label is the one with the highest softmax probability (average softmax probability for BNNs and Ensembles). Following [18], we use as a measure of the confidence of the model in predicting class for input and show the accuracy and number of examples vs. confidence for Out-of Distribution data to quantify the quality of uncertainties. Please refer to Appendix A for further details on the experiments and additional results.

MNIST Classification All neural networks in this experiment have the same architecture (2 hidden layers, 200 nodes per layer). Models are trained on a random subset of labeled examples from the MNIST dataset [12] and results are averaged over trials. Figures (b)b and (a)a show the performance comparison on unseen rotated MNIST dataset (MNIST test images rotated by ). It can be seen that PNCA outperforms all other models in terms of (a) accuracy vs. confidence (high confidence examples should have high accuracy) and (b) number of examples with high confidence (only a few examples should have high confidence). Moreover, Fig. (c)c shows the performance comparison on Out-of-Distribution, i.e., not-MNIST dataset [2] that contains letters instead of handwritten digits. We see that PNCA has significantly fewer examples with high confidence as compared to rest of the approaches on the not-MNIST dataset illustrating the superior capability in quantifying uncertainty.

COVID-19 detection There has been an increasing interest in using deep learning to detect COVID-19 from Chest X-Ray (CXR) images [19, 21]. Successful prediction from CXR data can effectively complement the standard RT-PCR test [22]. However, the lack of large amount of training data and distributional shift between train and test data are two major challenges in this task [15].

We consider two sources of COVID-19 data – [6], which has been used by most existing works to train their models for COVID-19 classification and [4], which we use as our unseen test data as it comes from a different source than the images in [6]. We follow the transfer learning approach of [15] wherein a ResNet-50 model pre-trained on Imagenet is used as a feature extractor and the last layer of the model is re-trained on [6] with aforementioned approaches – DNN, BNN, Ensemble, NCA, PNCA on the training dataset. We consider a binary classification problem, i.e., each model outputs a probability of the presence/absence of COVID-19 in a given CXR image.

We use the version of [6] available on Kaggle^{1}^{1}1https://www.kaggle.com/bachrr/covid-chest-xray as our training dataset, which contains COVID-19 X-Ray images and non-COVID X-Ray images. On the other hand, [4] is used as our test data, which contains COVID-19 X-Ray images and non-COVID X-Ray images. There is a distributional shift present between train and test data resulting in relatively low test accuracy for all models in Fig. (a)a.
We also look at the number of examples classified with a high confidence for both the test data and completely Out-of-Distribution data (shoulder and hand X-Rays from [17]). As can be seen, on [4], which potentially has a different distribution, BNN has slightly lower number of examples classified with high confidence than the other models. Next, in Fig. (c)c, we can see that as the distribution shift increases, PNCA makes *significantly* fewer high confidence predictions than *all* other models corroborating its superior uncertainty quantification.

In summary, these experiments demonstrate that PNCA achieves much better uncertainty quantification than the baselines without losing accuracy in small-data regime.

## 4 Conclusion and Broader Impact

This work serves as a caution to practitioners interested in applying deep learning for disease detection especially during the current pandemic since we find that the issues related to overconfident and inaccurate predictions of DNNs become even more severe in small-data regime. While our approach appears to be less susceptible to making overconfident misclassifications and have good uncertainty estimation performance, we acknowledge that there is still room for improvement especially with respect to the accuracy of the model. With this in mind, we will explore approaches to improve the generalization capability of PNCA in future work. Further, sample efficient uncertainty calibration approaches such as [24] and more reliable evaluation approaches for small data regime can be explored.

## Acknowledgement

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-811603).

## References

- [1] (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: Appendix A.
- [2] (2011) Not-mnist dataset. Note: http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html Cited by: §3.
- [3] (2020) Anomalous instance detection in deep learning: a survey. arXiv preprint arXiv:2003.06979. Cited by: §1.
- [4] (2020) ActualMed covid-19 data. Note: https://github.com/agchung/Actualmed-COVID-chestxray-dataset Cited by: Appendix A, Figure 1, Figure 9, §3, §3.
- [5] (2020) Predicting covid-19 pneumonia severity on chest x-ray with deep learning. arXiv preprint arXiv:2005.11856. Cited by: §1.
- [6] (2020) COVID-19 image data collection. arXiv preprint arXiv:2003.11597. Cited by: Appendix A, Figure 1, §1, §3, §3.
- [7] (2016) Uncertainty in deep learning. University of Cambridge. Cited by: §1.
- [8] (2005) Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520. Cited by: §2.1, §3.
- [9] (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
- [10] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- [11] (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1, §3.
- [12] (2010) MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2, pp. 18. Cited by: Appendix A, §3.
- [13] (2016) Stein variational gradient descent: a general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378–2386. Cited by: Appendix A, §2.3, §3.
- [14] (2019) Deep probabilistic kernels for sample-efficient learning. arXiv preprint arXiv:1910.05858. Cited by: §1.
- [15] (2020) Deep-covid: predicting covid-19 from chest x-ray images using deep transfer learning. arXiv preprint arXiv:2004.09363. Cited by: §1, §3, §3.
- [16] (2017) Kernel mean embedding of distributions: a review and beyond. Foundations and Trends® in Machine Learning 10 (1-2), pp. 1–141. Cited by: §2.2.
- [17] (2017) Mura: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Cited by: Figure 1, §1, Figure 9, §3.
- [18] (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §1, §3.
- [19] (2020) Unveiling covid-19 from chest x-ray with deep learning: a hurdles race with small data. arXiv preprint arXiv:2004.05405. Cited by: §3.
- [20] (2016) Incorporating nesterov momentum into adam. Natural Hazards 3 (2), pp. 437–453. Cited by: Appendix A.
- [21] (2020) COVID-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images. arXiv. Cited by: §1, §3.
- [22] (2020) Detection of sars-cov-2 in different types of clinical specimens. Jama 323 (18), pp. 1843–1844. Cited by: §3.
- [23] (2016) Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975–1983. Cited by: Appendix A.
- [24] (2020) Mix-n-match: ensemble and compositional methods for uncertainty calibration in deep learning. arXiv preprint arXiv:2003.07329. Cited by: §4.

## Appendix A Additional Details on Experiments

\@dblfloatalgocf[htbp] \[email protected]

All models are implemented in TensorFlow [1] on a Titan X GPU with 3072 CUDA cores. We use the Adam Optimizer with Nesterov Momentum [20] with a learning rate of to train the models for epochs. For DNN, BNN, and Ensemble we use minibatches of size with 1 epoch corresponding to a pass over the entire dataset, while for NCA and DPNCA, the entire dataset is used to calculate gradients.

Following [13] we use the RBF kernel as the kernel between model parameters in PNCA, with bandwidth chosen according to the median heuristic described in their work since it causes for all , leading to behave like a probability distribution. We also use Orthogonal Random Features [23] to approximate the kernel between probability distributions in the latent space in creftype 3 for faster computation. We use features where is the dimensionality of the latent space and a ReLU activation on the approximate kernel to set any spurious negative values to zero (since the original squared exponential kernel can never be negative).

Table 1 contains the accuracy of different models across experiments (MNIST test data [12], Rotated MNIST test data, COVID-19 validation data [6] and COVID-19 test data [4]). For MNIST, accuracy values are averaged over 10 trials (where each trial corresponds to a different set of 100 training examples). For COVID-19 accuracy values, the training data is split into 5 equal folds and in each trial we use 4 folds to train the model and the 5th fold to calculate in-distribution (validation) accuracy. Since each training data point is a part of the validation data (5th fold) only once, therefore we do not have any standard deviation values for the validation accuracy. The accuracy on COVID-19 test data is averaged across all folds.

Method | MNIST Test | Rotated MNIST | COVID-19 Validation | COVID-19 Test |
---|---|---|---|---|

BNN | ||||

NCA | ||||

DNN | ||||

PNCA | ||||

Ensemble |

## Appendix B Proof of Proposition 1

Observe that for any smooth one-to-one transform , , the kernel between the latent distributions corresponding to data points and under the transformed distribution can be written as

(8) | ||||

(9) |

Since the above holds for infinitesimal shifts , a tractable choice of (For eg. Gaussian), enables efficient approximation of by sample averages with samples , .

Moreover in creftypecap 9 is a *functional* of the transformation (for fixed ) i.e. a functional of the shift in our case. Therefore, the problem of finding in creftypecap 6 reduces to the problem of finding the optimal shift (given ) i.e.

(10) |

Since , which is the RKHS for the kernel , we can solve creftypecap 10 via functional gradient descent.

Defining we have

(11) |

Assuming that the distributions and shifts are functions in a RKHS given by the kernel (, , and are all different), we have (from the definition of functional gradient ),

(12) |

Thus we need to compute the difference which, from creftypecap 11 is given by

(13) |

We use to denote the expectation when . The above equation can be rewritten as where

(14) | ||||

(15) | ||||

(16) | ||||

(17) | ||||

(18) |

where the last line follows from the RKHS property. Similarly,

(19) | ||||

(20) |

Since we transform the weights after every iteration, therefore we only ever need to compute the gradient at . Thus, finally, we have the expression

(21) |

If we draw samples of model parameters , the empirical estimate of given by replacing expectations with sample averages is given by

(22) |

Without loss of generality consider all the terms in the above expression that contain the gradient with respect to and let us call that part of the summation . Therefore

(23) |

Recall the expression for the empirical estimate of the entries of the kernel matrix

(24) |

Differentiating both sides with respect to ,

(25) |

Note that the term occurs in both summmations. This is because when ( are all variable) .

Substituting creftypecap 25 in creftypecap 23

(26) |

We can apply the same argument to simplify the terms in creftypecap 22 that contain gradients with respect to other weights in the same fashion. Therefore,

(27) | ||||

(28) |

From the chain rule for functional gradient descent we have and the corresponding empirical estimate . Switching the order of the summations gives

(29) |