## 1. Introduction

In this tutorial, we’ll talk about node impurity in decision trees. A decision tree is a greedy algorithm we use for supervised machine learning tasks such as classification and regression.

## 2. Splitting in Decision Trees

Firstly, the decision tree nodes are split based on all the variables. During the training phase, the data are passed from a root node to leaves for training. A decision tree uses different algorithms to decide whether to split a node into two or more sub-nodes. The algorithm chooses the partition maximizing the purity of the split (i.e., minimizing the impurity). Informally, impurity is a measure of homogeneity of the labels at the node at hand:

There are different ways to define impurity. In classification tasks, we frequently use the Gini impurity index and Entropy.

## 3. Gini Impurity

Gini Index is related to the misclassification probability of a random sample.

Let’s assume that a dataset contains examples from classes. Its Gini Index, , is defined as:

(1)

where is the relative frequency of class in , i.e., the probability that a randomly selected object belongs to class .

As we can observe from the above equation, Gini Index may result in values inside the interval . The minimum value of zero corresponds to a node containing the elements of the same class. In case this occurs, the node is called pure. The maximum value of 0.5 corresponds to the highest impurity of a node.

### 3.1. Example: Calculating Gini Impurity

In this example, we’ll compute the Gini Indices for 3 different cases of a set with 4 balls of two different colors, red and blue:

• 4 red & 0 blue balls:

• 2 red & 2 blue balls:

• 3 red & 1 blue balls:

## 4. Entropy

Ιn statistics, entropy is a measure of information.

Let’s assume that a dataset associated with a node contains examples from classes. Then, its entropy is:

(2)

where is the relative frequency of class in . Entropy takes values from . As is the case with the Gini index, a node is pure when takes its minimum value, zero, and impure when it takes its highest value, 1.

### 4.1. Example: Calculating Entropy

Let’s compute the entropies for the same examples as above:

• 4 red & 0 blue balls:

• 2 red & 2 blue balls:

• 3 red & 1 blue balls:

## 5. Splitting the Data

The quality of splitting data is very important during the training phase. When splitting, we choose to partition the data by the attribute that results in the smallest impurity of the new nodes.

We’ll show how to split the data using entropy and the Gini index.

### 5.1. Information Gain

The information gain is the difference between a parent node’s entropy and the weighted sum of its child node entropies.

Let’s assume a dataset with objects is partitioned into two datasets: and of sizes and . Then, the split’s Information Gain () is:

(3)

In general, if splitting into subsets with objects, respectively, the split’s Information Gain () is:

(4)

### 5.2. Example: Splitting by Information Gain

Let’s consider dataset , which shows if we’re going to play tennis or not based on the weather:

It contains 9 positive outcomes (Yes) and 5 negatives (No). So, the starting structure is . We want to determine the attribute that offers the highest Information Gain.

Let’s see what happens if we split by Outcome. Outlook has three different values: Sunny, Overcast, and Rain. As we see, the possible splits are and ( and stand for entropy and split):

Firstly, we calculate the entropies. They are: and . Therefore, the Information Gain based on the Outlook, , is:

(5)

Next, we split the tree based on the Wind feature. It can take two values: Weak and Strong, with the possible splits and :

The corresponding entropies are: and . Therefore, is:

(6)

Following this idea, we calculate the gains for Humidity and Temperature as well.

(7)

(8)

We see that the feature with the maximum Information Gain is Outlook. Thus, we conclude that Outlook is the best attribute to split the data at the tree’s root.

### 5.3. Example: Splitting by Gini Index

We can split the data by the Gini Index too. Let’s compute the required probabilities:

(9)

Out of the 14 days in the above example, Sunny, Overcast, and Rain occur 5, 4, and 5 times, respectively. Then, we compute the probabilities of a Sunny day and playing tennis or not. Out of the 5 times when Outlook=Sunny, we played tennis on 2 and didn’t play it on 3 days:

Having calculated the required probabilities, we can compute the Gini Index of Sunny:

We follow the same steps for Overcast and Rain:

Therefore, the weighted sum of the above Gini Indices is:

(10)

Similarly, we compute the Gini Index of Temperature, Humidity, and Wind:

(11)

We conclude that Outlook has the lowest Gini Index. Hence, we’ll choose it as the root node.

## 6. Conclusion

In this article, we talked about how we can compute the impurity of a node while training a decision tree. In particular, we talked about the Gini Index and entropy as common measures of impurity. By splitting the data to minimize the impurity scores of the resulting nodes, we get a precise tree.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.