Information Gain in Machine Learning | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll describe the information gain. We’ll explain it in terms of entropy, the concept from information theory that found application in many scientific and engineering fields, including machine learning. Then, we’ll show how to use it to fit a decision tree.

2. Information

Intuitively, a piece of information is anything that increases our knowledge of a system, process, or phenomenon. But what is knowledge?

Putting aside philosophical details, we could say that it’s the mirror image of uncertainty. The more we’re sure about something, the greater our knowledge is. The converse is also true, and it’s the crux of the information theory: the more uncertain we are, the less we know. So, information is anything that decreases our uncertainty.

The next question we ask is how to quantify uncertainty and information. Let’s show it in an example.

2.1. Example: Uncertainty and Information

Let’s consider an opaque box containing 1 red and 2 white balls. Anyone drawing a ball randomly from it will have $1/3$ of a chance to get the red ball and $2/3$ of drawing a white one. Although we can’t be certain about the color, we can describe the outcomes and how likely they are with the following random variable:

(1) $\begin{equation*} \begin{pmatrix} \text{red} & \text{white} \\ \frac{1}{3} & \frac{2}{3} \end{pmatrix} \end{equation*}$

Now, if we find out that someone drew the red ball, there will be no uncertainty about the color we can get:

(2) $\begin{equation*} \begin{pmatrix} \text{red} & \text{white} \\ 0 & 1 \end{pmatrix} \end{equation*}$

The quantity of information we received by learning that someone drew the red ball is the difference between the probabilistic models (2) and (1). More specifically, it’s the difference in uncertainty, which we measure with entropy.

2.2. Entropy

In the information theory, the entropy is a function $H$ of probabilities $p_1, p_2, \ldots, p_n$ . It satisfies three conditions we’d expect a measure of uncertainty should fulfill:

$H$ is continuous in $p_1, p_2, \ldots, p_n$ . This condition ensures that the uncertainty doesn’t fluctuate a lot if any $\boldsymbol{p_i}$ undergoes a small change.

If the probabilities are all equal ( $\boldsymbol{p_1 = p_2 = \ldots = p_n = \frac{1}{n}}$ ), then the uncertainty should increase as $\boldsymbol{n}$ grows. Mathematically, $H \left( \frac{1}{n}, \frac{1}{n}, \ldots, \frac{1}{n} \right)$ should be monotone with $n$ . This is intuitive: the more outcomes there are, the more uncertain we are of any of them occurring.

Finally, any way of decomposing $\boldsymbol{H(p_1, p_2, \ldots, p_n)}$ as a (weighted) sum of the uncertainties over the subgroups of $\boldsymbol{p_1, p_2, \ldots, p_n}$ should give the same result.

Shannon showed that the only function satisfying these conditions was:

(3) $\begin{equation*} H ( p_1, p_2, \ldots, p_n) = - K\sum_{i=1}^{n} p_i \log(p_i) \end{equation*}$

There, $K$ is a constant, and all the logarithms have the same base. Usually, we take binary logarithms. So:

(4) $\begin{equation*} H (p_1, p_2, \ldots, p_n) = - \sum_{i=1}^{n} p_i \log_2 p_i \end{equation*}$

2.3. The Amount of Information Is the Difference in Entropy

Now that we have to compute the entropy, we can measure the amount of information. We define it as the change in uncertainty.

Formally, let’s say that $S$ is the random variable modeling our belief system before learning the information $z$ . Let’s denote with $S \mid z$ our updated state of belief (after $z$ ) and stretch the notation so that $H$ can act on $S$ and $S \mid z$ in addition to the probabilities.

The amount of information $AI(z)$ of $z$ is then:

(5) $\begin{equation*} AI(z) = H(S) - H(S \mid z) \end{equation*}$

2.4. Example: Calculating the Amount of Information

Let’s calculate the amount of information we gain by finding out that someone has drawn the red ball from a box that initially held one red and two white balls.

The entropy before the information is:

$\begin{aligned} H \left( \frac{1}{3}, \frac{2}{3} \right) &= - \frac{1}{3} \log_2 \frac{1}{3} - \frac{2}{3} \log_2 \frac{2}{3} \\ & \approx - \frac{1}{3} \cdot (-1.585) - \frac{2}{3} (-0.585) &= 0.918 \end{aligned}$

The entropy after is:

$H \left(0, 1 \right) &= - \underbrace{0 \log_2 0}_{=0} - 1 \log_2 1 = 0+0 = 0 \\$

So, the amount of information is $0.918-0=0.918$ . In this case, we received the maximum amount since the information eliminated our uncertainty.

What if someone told us they drew a white ball instead of the red one? The probabilities would shift to:

(6) $\begin{equation*} \begin{pmatrix} \text{red} & \text{white} \\ \frac{1}{2} & \frac{1}{2} \end{pmatrix} \end{equation*}$

So, our new entropy will be:

$H \left( \frac{1}{2}, \frac{1}{2} \right) &= - \frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{2} \log_2 \frac{1}{2} = -\frac{1}{2} (-1) - \frac{1}{2} (-1) = \frac{1}{2} + \frac{1}{2} = 1$

The amount is now $0.918-1=-0.082$ , a negative number. But wait, didn’t we say that information should decrease uncertainty? Ideally, that would be the case. We’d receive only the pieces of information that reduce our uncertainty. However, not all information is like that. Some can make us more uncertain, and the amount of such information will be negative.

That’s why we defined the amount as the difference in entropy before and after the information, not the other way around. That way, positive numbers get to represent an increase in knowledge (i.e., a decrease in uncertainty), which is more intuitive.

From there, we see that information is anything that affects our probabilistic model about some process or phenomenon. Its contribution is positive if it decreases uncertainty, and negative otherwise.

3. The Information Gain

Now that we know how to calculate the entropy and the amount of information, we can define the information gain. We’ll do it in the context of decision trees.

3.1. Decision Trees

At each internal node in a decision tree, we inspect a feature of an object. Depending on its value, we visit one of the node’s sub-trees.

We repeat the step until we hit a leaf. The leaf nodes don’t check any feature, and each contains a subset of the training data. Once we get to a leaf, we assign the object to the majority class in the leaf’s data subset.

3.2. The Information Gain

To build a decision tree, we need to decide which feature to check at which node. For instance, let’s suppose that we have two unused features: $a$ and $b$ , both binary. We also have five objects, two of which are positive:

$S : \qquad \begin{bmatrix} a & b & class \\ \hline 0 & 0 & positive \\ 0 & 1 & positive \\ 1 & 0 & negative \\ 1 & 1 & positive \\ 0 & 0 & negative \end{bmatrix}$

Which feature should we test to add a new node? The information gain can help us decide. It’s the expected amount of information we get by inspecting the feature. Intuitively, the feature with the largest expected amount is the best choice. That’s because it will reduce our uncertainty the most on average.

3.3. First, We Calculate the Entropies

In our example, the entropy before adding a new node is:

$H(S) = - \frac{2}{5} \log_2 \frac{2}{5} - \frac{3}{5} \log_2 \frac{3}{5} = 0.971$

That’s the measure of our uncertainty in a random object’s class (assuming the previous checks got it to this point in the tree). If we choose $a$ as the new node’s test feature, we’ll get two sub-trees covering two subsets of $S$ :

$S \mid a = 0: \quad \begin{bmatrix} 2 \text{ positive objects} \\ 1 \text{ negative object} \end{bmatrix} \qquad \qquad S \mid a = 1: \quad \begin{bmatrix} 1 \text{ positive object} \\ 1 \text{ negative object} \end{bmatrix}$

Their entropies are:

$\begin{aligned} H(S \mid a = 0) &= H \left( \frac{2}{3}, \frac{1}{3} \right) &&= 0.918 \\ H(S \mid a = 1) & = H \left( \frac{1}{2}, \frac{1}{2}\right) &&= 1 \end{aligned}$

3.4. Then, We Compute the Gain

The information gain $IG(a)$ is the expected amount of information we get by checking feature $a$ :

$\begin{aligned} IG(a) &= P(a=0)\cdot AI(a=0) + P(a=1) \cdot AI(a=1) \\ &= P(a=0) \left( H(S) - H(S \mid a = 0) \right) + P(a=1) \cdot \left( H(S) - H(S \mid a = 1) \right) \\ &= \left( P(a = 0) + P(a = 1) \right) H(S) - P(a=0) H(S \mid a = 0) - P(a=1) H(S \mid a = 1) \\ &= H(S) - P(a=0) H(S \mid a = 0) - P(a=1) H(S \mid a = 1) \\ &= 0.971 - 0.6 \cdot 0.918 - 0.4 \cdot 1 \\ &= 0.0202 \end{aligned}$

We define $P(a=0)$ and $P(a=1)$ to be the frequencies of $a=0$ and $a=1$ in $S$ , respectively.

The same calculation for $b$ shows that its gain is:

$\begin{aligned} IG(b) &= H(S) - P(b=0) H(S \mid b = 0) - P(b=1) H(S \mid b = 1) \\ &= 0.971 - 0.6 \cdot 0.918 - 0.4 \cdot 0 = 0.4202 \end{aligned}$

Since $IG(b) > IG(a)$ , we choose $b$ to create a new node.

As we see, $\boldsymbol{IG}$ favors the splits that maximize the purity of the subsets so that each has its majority class’s fraction as high as possible.

3.5. Formula

The calculation of $IG(a)$ shows us the formula for the gain in the general case. Let $S$ denote the dataset to be split by creating a new node. Let’s suppose that the attribute $a$ can take $m$ values: $a_1, a_2, \ldots, a_m$ , and that $p_i$ is the fraction of the objects in $S$ with $a=a_i$ . Then, the information gain of $\boldsymbol{a}$ is:

(7) $\begin{equation*} IG(a) = H(S) - \sum_{i=1}^{m}p_i H(S | a=a_i) \end{equation*}$

The gain cannot be negative even though individual pieces of information can have a negative contribution.

4. Conclusion

In this article, we showed how to calculate the information gain. Also, we talked about the entropy in the information theory and its connection to uncertainty.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung