1. Introduction

In this tutorial, we’ll explain Bayes’ theorem.

It forms the basis of many AI methods, such as Bayesian networks and Naïve Bayes but is also at the core of Bayesian statistics, a popular statistical school of thought.

2. Inverting Conditional Probabilities

Let’s suppose we have a patient with a set of symptoms S. Based on our expert knowledge, we suspect the patient may have the health condition C. So, if the conditional probability P(C \mid S) is reasonably high, we may start administering therapy for C. Therefore, the estimation of P(C \mid S) should be our first step.

To be as objective as possible, we search medical records in our archive but find no way to compute P(C \mid S) directly. The problem is that patients are sorted according to their diagnoses, not by the symptoms upon examination. So, we can compute P(S \mid C) easily by counting cases in the disease C folder, but P(C \mid S) would require going over the entire archive, which is infeasible.

Bayes’ theorem can help us compute P(S \mid C) by using its converse P(C \mid S). In general, the theorem is useful when the conditional probability of interest is hard or impossible to estimate, but its converse is available.

3. Bayes’ Theorem

Here is the typical use case of Bayes’ theorem. We have some evidence (i.e., data) E, obtained as the result of experimentation or observation, and want to decide if hypothesis H explains E reasonably well. A sensible approach is to compute the probability P(H \mid E) and consider H true if the P(H \mid E) is high enough.

By the definition of conditional probability, we have:

(1)   \begin{equation*} P(H \mid E) = \frac{P(H \land E)}{P(E)}\end{equation*}

The probability of an intersection P(H \land E) can be expressed as:

(2)   \begin{equation*}P(H \land E) = P(E \mid H) \times P(H)\end{equation*}

Putting the two equations together, we get Bayes’ theorem:

(3)   \begin{equation*} P(H \mid E) = \frac{P(E \mid H) \times P(H)}{P(E)}\end{equation*}

3.1. Terminology and Notation

The probabilities in Bayes’ theorem have special names. P(H) is the prior probability or simply the prior. It represents our degree of belief that \boldsymbol{H} holds before we have seen evidence \boldsymbol{E}, hence the name.

Analogously, P(H \mid E) is the posterior probability (or just the posterior). It denotes our belief that H is true after seeing evidence E.

Finally, the converse conditional probability P(E \mid H) is called the likelihood of E given H.

Since E = P(E \mid H) \times P(H) + P(E \mid \neg H) \times P(\neg H), we can drop P(E) and write the theorem informally:

    \[posterior \propto likelihood \times prior\]

Here, \propto means that the posterior is proportional to the product of the prior and likelihood. Usually, we don’t need to estimate P(E) directly since we can get it by focusing on P(E \mid H) and P(E \mid \neg H)P(\neg H) and combining them with our prior.

3.2. Sequential Updates

Bayes’ theorem allows sequential inference. What does that mean? Well, if we get pieces of evidence one by one, we can iteratively apply the theorem until we process the last one. The result will be the same as if we waited to collect all the evidence to use the theorem.

For example, let’s say that our data (evidence) comes in two parts: E_1 and then E_2. If we use the theorem after receiving both parts, we get the following:

    \[\begin{aligned} P(H \mid E_1 \land E_2) &=\frac{P(E_1 \land E_2 \mid H) \times P(H)}{P(E_1 \land E_2)} \\ &= \frac{P(E_2 \mid E_1 \land H) \times \boldsymbol{P(E_1 \mid H) \times P(H)}}{P(E_2 \mid E_1) \times \boldsymbol{P(E_1)}} \\ &= \frac{P(E_2 \mid E_1 \land H)}{P(E_2 \mid E_1) } P(H \mid E_1) \end{aligned}\]

In the second line, we use Bayes’ theorem to deal with the first piece of evidence and get P(H \mid E_1). The last line is equivalent to applying the theorem to E_2 conditioned on E_1 and the posterior we get with E_1.

So, we can update our belief state iteratively:

    \[\text{new belief } \propto \text{ likelihood of  the new evidence } \times \text{ old belief }\]

In brief, one step’s posterior becomes the next step’s prior.

4. Distributional Perspective

So far, we’ve dealt only with binary events. There were only two outcomes for H: true and false. The real world can be much more complex, so priors and posteriors can be and often are continuous distributions.

Let’s show an example. For instance, we may want to determine if a coin is fair. By definition, it’s such if P(head) = P(tail)=1/2. So, we estimate the probability q=P(head) and check if it’s close to 1/2.

4.1. Prior

Before tossing the coin, we should choose the prior. Since q \in [0, 1], the prior needs to specify a continuous distribution. For example, we can use the Beta distribution with the density:

    \[f_{prior}(q) = \frac{q^{\alpha - 1}(1-q)^{\beta - 1}}{B(\alpha, \beta)}\]

where B(\alpha, \beta) is the Beta function, and \alpha and \beta are the parameters we set when specifying the prior.

4.2. Likelihood

The evidence is the number of heads. If we toss the coin n times and get k heads, the likelihood of the evidence is:

    \[likelihood(q) = \binom{n}{k}q^k(1-q)^{n-k}\]

The overall probability of the evidence is:

    \[P(E) = \int_{0}^{1}likelihood(u) du= \int_{0}^{1} \binom{n}{k}u^k(1-u)^{n-k}= \binom{n}{k} \times \Delta\]

Here, we integrate likelihood over all the possible values of the head probability. \Delta is a constant with respect to any choice of q appearing in f_{posterior}(q).

4.3. Bayesian Update

Now, we can get our new posterior by combining the prior with the likelihood:

    \[\begin{aligned} f_{posterior}(q) &= \frac{likelihood(q) \times f_{prior}(q)}{P(E)} \\ &= \frac{\binom{n}{k}q^k(1-q)^{n-k} \times \frac{q^{\alpha - 1}(1-q)^{\beta - 1}}{B(\alpha, \beta)} }{\binom{n}{k} \times \Delta}\\ & \propto q^{k + \alpha - 1}(1-q)^{n - k+\beta - 1} \\ \end{aligned}\]

What’s more, the posterior is a Beta distribution with parameters k + \alpha and n - k + \beta.

If we choose a suitable prior, we’ll be able to derive the posterior analytically, as in this example. However, if no closed-form solution exists for our prior, we’ll need to use numerical methods to approximate the posterior.

5. The Prior Controversy and Other Criticisms

Bayes’ theorem gives us a rigorous mathematical tool for updating our beliefs. However, the approach drew much criticism. Doesn’t it open a door for our prejudices and subjective beliefs to override objective data and influence our judgment? Can we claim to have a sound inference procedure if a wrong choice of the prior can bias the posterior and lead to incorrect conclusions?

Bayesians argue that the choice of the prior isn’t and shouldn’t be arbitrary. If we use the prior “the probability that the hypothesis is true is 75%”, that needs to be grounded in theory or previous empirical evidence. For instance, if a medical condition C is rare and occurs only in 1% of the population, it makes sense to use P(C)=0.01 as the prior.

Additionally, the proponents of the Bayesian approach argue that, in the end, all decisions about accepting or rejecting hypotheses are subjective and based on internal belief states. As a result, the Bayesian methodology is best suited to support inference. Whether we share this view or not, Bayes’ theorem is a prerequisite for modern AI and statistics.

6. Example

Let’s say that we found that 75% of the patients with medical condition C had symptoms S, i.e., P(S \mid C)=3/4. In the literature, we find that the prevalence of C in the general population is 1%, whereas the symptoms S are estimated to hit a quarter of the population at any given time.

Then, the probability that a patient with symptoms S has the condition C is:

    \[\begin{aligned} P(C \mid S) &= \frac{P(S \mid C) \times P(C)}{P(S)} \\ &= \frac{\frac{3}{4} \times \frac{1}{100}}{\frac{1}{4}} \\ &= \frac{3}{100} \end{aligned}\]

So, the chance that a random person with symptoms S has C is only 3%. If we didn’t use Bayes’ theorem and looked only at P(S \mid C), we’d arrive at a completely wrong conclusion.

7. Conclusion

In this article, we covered Bayes’ theorem. It allows us to get the posterior probability based on the prior and data likelihood. However, if we don’t choose a suitable prior, our inference may be biased.

Comments are closed on this article!