Robust Estimators in Robust Statistics | Baeldung on Computer Science

1. Introduction

In this article, we’ll give an overview of Robust Statistics, the area of statistics that provides methods tolerant to anomalous data or small deviations from the model. We’ll define the main measures of robustness, and we’ll illustrate the most common estimators of the central tendency and the statistical dispersion.

2. What Is Robust Statistics?

Robust statistics addresses the problem of finding estimators that are resilient to small departures from the statistical model assumed. The foundations of robust statistics occurred in the 1960s, with the fundamental works of John Tukey (1960), Peter Huber (1964), and Frank Hampel (1971).

Classical estimation methods rely on model assumptions which are often not met in the real world. For example, in data analysis, it is often assumed that errors follow a normal distribution or that the Central Limit Theorem holds to retain that estimates are normally distributed. In practice, the statistical model describes the majority of the observations, but some observations follow different patterns or no pattern at all. Such anomalous data are called outliers.

Classical estimators have poor performance when there are few outliers in the data. The sample mean and the sample standard deviation, which are classical estimators of the center and dispersion of the data, respectively, are highly sensitive to outliers. Given a set of $N$ observed values $x_1, x_2, \cdots, x_N$ the sample mean $\bar{x}$ and the sample standard deviation $s$ are defined by:

$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i , \ \ \ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2}.$

Let’s consider as an example the following set of six measurements containing a single outlier (marked in bold):

$3.10, \quad 3.01, \quad 3.08, \quad 3.21, \quad 3.11, \quad \mathbf{29.1}.$

The values of the sample mean and the sample standard deviation for the above data are $\bar{x} = 7.44$ and $s =10.61$ , respectively. It is evident that the resulting value of the sample mean is very far from the majority of the observations and, as such, does not represent a good estimate of the center of the data. Similarly, the resulting value of the standard deviation is very large compared to the dispersion of the bulk of the observations. If we delete the outlier value 29.1, then the sample mean and sample standard deviation are changed to $\bar{x} = 3.10$ and $s =0.07$ . Now, these values provide a good estimate of the center and dispersion of the data. Hence, a single outlier completely upset the sample mean and the sample standard deviation.

3. Measures of Robustness

How to evaluate the robustness of an estimator? Several measures of robustness were proposed in the literature. The most relevant ones are the breakdown point, the sensitivity curve, and the influence function.

3.1. Breakdown Point

The breakdown point of an estimator is defined as the smallest fraction of arbitrarily large observations (outliers) in the dataset that causes an arbitrarily large estimate, hence the breakdown of the estimator.

Let’s consider the sample mean. Given a set of $N$ observed values $x_1, x_2, \cdots, x_N$ , if only one observation in the sample is replaced by an extremely large value, the sample mean will “explode”. Hence, the breakdown point of the sample mean is simply $1/N$ . In the limit $N \to \infty$ , the breakdown point of the sample means is $0$ , which is the worst possible case.

The higher the breakdown point of an estimator, the more robust it is.

3.2. Sensitivity Curve

The sensitivity curve measures the effect of a single outlier on the estimator. The idea is to compute the difference between the estimate for a given sample $x_1, x_2, \cdots, x_N$ and the estimate when an observation $x$ is added to the sample. The resulting difference is normalized by the fraction of contamination $1/(N + 1)$ . Hence, for a given estimator $\hat{\theta}$ the sensitivity curve $\operatorname{SC}$ is defined as:

$\operatorname{SC}\left(x ; x_{1}, x_{2}, \ldots, x_{N}, \hat{\theta}\right)=\frac{\hat{\theta}\left(x_{1}, x_{2}, \ldots, x_{N}, x\right)-\hat{\theta}\left(x_{1}, x_{2}, \ldots, x_{N}\right)}{1 /(N+1)}$

An estimator is considered robust if its sensitivity curve is a bounded function.

For the sample mean we obtain:

$\operatorname{SC}\left(x ; x_{1}, x_{2}, \ldots, x_{N}, \bar{x}\right)= x - \bar{x}\left(x_{1}, x_{2}, \ldots, x_{N}\right) .$

which is an unbounded function that increase as the outlier value $x$ , proving that the sample mean is not robust.

It is worth noting that the sensitivity curve depends strongly on the sample values $x_i$ .

3.3. Influence Function

The influence function is the asymptotic version of the sensitivity curve. It does not depend on a finite set of observations, but it depends on the specific distribution for which the estimator is computed. The influence function measure how the estimate changes when a contamination is added to a probability distribution $\mathbf{F}$ . The “contaminated” distribution $\tilde{F}$ can be written as:

$\tilde{F} = (1 -\varepsilon) F + \varepsilon \delta_x$

where $\delta_x$ the denotes the Dirac measure which gives 1 at the point $x$ and 0 elsewhere. Hence, the influence function $\operatorname{IF}$ is defined as:

$\operatorname{IF}\left(x ; F, \theta\right)=\lim _{\varepsilon \rightarrow 0} \frac{\hat{\theta} (\tilde{F})- \hat{\theta}(F)}{\varepsilon}$

It can be demonstrated that the influence function of the mean $E$ associated to a standard gaussian distribution $g$ is:

$\operatorname{IF} \left(x ; g, E \right)= x .$

Hence, if the contamination value $x$ is large, it will have a large influence on the mean estimator because the influence function is not bounded in $x$ . This again proves that the mean is not robust.

4. Robust Estimators of the Central Tendency

In statistics, the central tendency represents the behavior of quantitative data to cluster around some central value. The classical measure of central tendency is the mean, but it is not robust. The most relevant robust estimators of the central tendency are the median and the trimmed mean.

4.1. Median

The median represents the “middle” value that occupies a central position in the list of the observations sorted from smallest to greatest.

Given a set of $N$ ordered values $\left {x_i \rigth }$ the median is defined as:

$\operatorname{median}(x_i)= \begin{cases}x_{ {(N+1) / 2}} & \text { if } N \text { is odd } \\ \left(x_{{N / 2}}+x_{({N / 2} +1)}\right) \cdot 0.5 & \text { if } N \text { is even }\end{cases}.$

The breakdown point of the median is 0.5, meaning that the median can resist up to 50% of outliers without causing an “explosion” of the estimate.

4.2. Trimmed Mean

The trimmed mean is the simple arithmetic mean computed by ignoring the $M$ smallest and the $M$ largest observations:

$\hat{\mu}_{\alpha}=\frac{1}{N-2 M} \sum_{i=M+1}^{N-M} x_{(i)}$

where $M = \alpha (N-1)$ and $0 \leq \alpha < 0.5$ . The breakdown point of the trimmed mean is $(M+1) / N$ .

5. Robust Estimators of the Dispersion

The statistical dispersion represents the variability of the observations in a dataset. The standard deviation is the classical measure of the statistical dispersion, but it is not robust since it can be made arbitrarily large by a single outlier. The most common robust estimators of the dispersion are the median absolute deviation and the interquartile range.

5.1. Median Absolute Deviation

The Median Absolute Deviation (MAD) is the median of all absolute deviations from the median of the sample:

$\operatorname{MAD}\left(x_{i}\right)=\operatorname{median} \left(\left|x_{i}-\operatorname{median}\left(x_i\right)\right|\right).$

The MAD of a set of random gaussian variables does not match the standard deviation $\sigma$ . To make the MAD a consistent estimator of the standard deviation for the normal distribution, it must be multiplied by the correction factor 1.482:

$\operatorname{MADN}\left(x_{i}\right)= 1.482 \cdot \operatorname{MAD}\left(x_{i}\right)$

this corrected estimator is called Normalized Median Absolute Deviation (MADN).

The breakdown value of the MAD is 50%.

5.2. Interquartile Range

The interquartile range (IQR) is defined as the difference between the 75th and 25th percentiles of the data:

$\operatorname{IQR} = P_{75} (x_i) - P_{25} (x_i) .$

The IQR of a set of random gaussian variables does not match with the standard deviation $\sigma$ . The normalized interquartile range (IQRN) is the corrected estimator of the standard deviation for the normal distribution. The IQRN is computed as:

$\operatorname{IQRN}\left(x_{i}\right)= 0.7413 \cdot \operatorname{IQR}\left(x_{i}\right) .$

The breakdown value of the IQR is 25%.

6. Conclusion

In this article, we reviewed the main concepts of Robust Statistics. We defined the main measures of robustness, and we illustrated the most common estimators of the central tendency and the statistical dispersion.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex