1. Introduction

In statistics, we often encounter the assumption that some random variables are independent and identically distributed (i.i.d.). But what does this mean? And why is it so common when analyzing problems involving probabilities?

In this tutorial, we’ll define i.i.d. variables. We’ll also discuss why it’s important to know what they are, and we’ll analyze examples where it’s assumed that the variables of interest are i.i.d.

2. Relevance of i.i.d. Variables

Let’s start with an example. Let’s say we’re polling to estimate which candidate the US population will vote for president. We randomly pick a group of people in a city and ask who they’ll vote for.

The i.i.d. assumption means that each individual in the US population has the same probability of being chosen for our survey. Additionally, we randomly select people to ensure independence. This implies that an individual’s answers don’t impact the responses of others and are not influenced by them.

Violating the i.i.d. assumption leads to biased estimation and incorrect inferences from our sample.

3. Definitions

We’ll define random i.i.d. variables step by step. We’ll start with independence.

3.1. Independent Variables

Let X be a variable modeling the outcome of throwing a fair die. It’s a uniform random discrete variable with the sample space \Omega = \{ 1, 2, 3, 4, 5, 6\}.

Now, let Charles (C) and Mary (M) throw the die. Charles throws it first and gets the number 4, so C=4. Then, it’s Mary’s turn to throw the die. But, before she does that, what can we say about the outcome M, knowing that C=4? Nothing. We say that two variables are independent of one another since the outcome of one doesn’t influence the probability of the outcome of the other.

If we have more than two variables, then we say they’re mutually independent if every variable is independent from any subset of other variables in the set.

3.2. Identically Distributed Variables

We say two variables are identically distributed if their probability distributions are the same.

But what is a probability function? For a discrete variable X, the probability P(x) defines the probability of X being equal to x:

    \[P(x) = Pr(X=x)\]

Let discrete random variable C_3 represent the number of times Charles rolled the number three. Similarly, M_3 models how many times Mary rolled a 3. We say that variables C_3 and M_3 are identically distributed variables because for each x,  the equality P(C=x) = P(M=x) holds. The variables share the probability function because the probability of the occurrence of a three in each roll is the same for both (P(3) =\frac{1}{6}).

If X is a continuous random variable, we have an uncountable set of possible values. Therefore, we can only assign probabilities to ranges instead of values. We define the probability density function f(x) and the probability of event a < x \leq b becomes:

    \begin{equation} P(a < x \leq b) = \[ \int_{a}^{b} f(x) \,dx \] \end{equation}

So, continuous variables are identically distributed if they have the same density.

4. Misleading Examples

4.1. Stock Prices

Let’s consider the stock prices of two companies in the technology field. At first glance, we might consider them as i.i.d. since the companies have distinct strategies, products, and managing boards. But if we look closer, we’ll find factors that show it’s the other way around.

Let’s say that the biggest company in the field declared bankruptcy. The stock prices of other companies in the same field will most likely be affected. The correlation of prices comes from the companies participating in the same market and industry, so they aren’t independent like we first thought.

4.2. Temperature Readings

On the other hand, we might think that temperatures at different stations in a region aren’t i.i.d. Maybe the altitude, vegetation, or specific geographical features mean that knowing the temperature from one station informs us about the temperatures on the other stations.

But if we have strict criteria to select the stations’ locations, we’ll end up with i.i.d. variables. For that, we have to choose the locations randomly. Then, we need to confirm that the distance between the stations is not too large or too small. Selecting locations too close to each other might capture microclimatic effects so that the readings would be correlated and biased. Similarly, we shouldn’t pick locations with significant geographical differences. For example, we shouldn’t position one station in a forest and another next to a factory.

If we follow these criteria, we mitigate the effect of local features by avoiding bias.

5. When Is i.i.d. Mandatory?

Statistics has several methods designed strictly for i.i.d variables that fail when the i.i.d assumption is violated.

One of the simplest examples involves regression analysis. We assume i.i.d. for the residuals when fitting a linear regression model. This means we consider the errors (residuals) mutually independent. If there is any correlation between them, we’ll have a wrong parameters estimation.

Many widely used tests fail or lead to incorrect results if we violate the i.i.d. assumption. To mention only a few, those are ANOVA, the t-test, and the chi-squared test.

6. Conclusion

In this article, we defined i.i.d. random variables. The probabilities of such variables’s outcomes are not affected by each other and are drawn from identical probability distributions. The i.i.d. assumption is a prerequisite for several statistical tools.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.