1. Overview

In this tutorial, we’ll study the definition of cross-entropy for machine learning.

We’ll first discuss the idea of entropy in information theory and its relationship to supervised learning.

Then, we’ll see how to derive cross-entropy in bivariate distributions from the definition of entropy in univariate distributions. This will give us a good understanding of how one generalizes over the other.

Finally, we’ll see how to use cross-entropy as a loss function, and how to optimize the parameters of a model through gradient descent over it.

2. Entropy

2.1. Entropy and Labels in Supervised Learning

In our article on the computer science definition of entropy, we discussed the idea that information entropy of a binary variable relates to the combinatorial entropy in a sequence of symbols.

Let’s first define x as a randomly distributed binary variable. Then, we can compute its Shannon measure of entropy H as the combinatorial entropy of the two symbols, the 0 and 1 bits, that the variable can assume. The formula for H is this:

H = - \sum_{i=1}^{|x|} p(x_i) \times \text{log}_{2} p(x_i)

When we work on problems of classification in supervised machine learning, we try to learn a function that assigns one label among a finite set of labels to the features of an observation. The set C = \{c_1, c_2, ... , c_n\} of labels or classes c, then, comprises several distinct symbols which we can treat as possible values assumed by the output of a model. It follows that we can compute a measure of entropy for the class labels output by a predictive model for classification.

2.2. Probabilistic Rather Than Deterministic Classification

There are two ways to transit to a probabilistic definition of entropy that allows us to work with the probability rather than with the discrete distribution of the labels. The first way is to interpret the relative frequency of occurrence of classes as a probability of their occurrence. This means that we can consider p(c_i) as the number of times the class c_i occurs in the distribution of classes, divided by the length of the distribution.

The second relates to the consideration that some classification models are intrinsically probabilistic and don’t output single-point predictions, but rather, probability distributions. This has to do with the activation function that’s used in the outer layer of a classification model. The most common probabilistic functions for the output layers of machine learning models are:

These functions output a value or set of values between 0 and 1 that we can therefore interpret as the probability distribution of the class affiliation of the observations.

2.3. Entropy and Probabilistic Distributions of Labels

The softmax function in particular, rather than outputting a single class c \in C as the most likely label for a given input, returns a probability distribution P(C) over the whole set C. This probability corresponds to the individual probabilities p(c_i) that are assigned to each possible label c_i.

We can subsequently use them in order to calculate the entropy for the distribution of class labels C and their associated probabilities P(C):

H = - \sum_{i=1}^n p(c_i) \times log p(c_i)

2.4. Practical Example of Entropy in Classification

Let’s imagine, for example, that we’re conducting binary classification by using logistic regression. The output of the logistic model is a value comprised between 0 and 1, which we normally interpret as the probability P(y=1|x) = P(1) of the input being affiliated with the first class. This implies that the second possible class has a corresponding probability P(y=0 | x) = P(0) = 1- P(1), tertium non datur in binary classification.

We can initially assume that the logistic model has an input with a single feature, no bias term, and a parameter for the unique input equal to 1. In this sense, the model corresponds perfectly to the sigmoidal function \sigma_1(x) = \frac{1} {1-e^{- (\beta_0 + \beta_1 \times x)} } with \beta_0 = 0 \vee \beta_1 = 1.

We can then interpret the two probabilities \sigma_1(x) and 1-\sigma_1(x) as the probability distribution for a binary random variable, and compute the entropy measure H accordingly:

Rendered by QuickLaTeX.com

Not surprisingly, the entropy of \sigma_1(x) is maximized when the output of the classification is undecided. This happens when the probability assigned to each class is identical.

2.5. Working With Multiple Probability Distributions

We can however also work with multiple probability distributions and the respective models. This is, for example, the case if we’re comparing the output of multiple models for logistic regression, like the one we defined above.

Let’s imagine we want to compare the previous model \sigma_1(x) with a second model \sigma_2(x). One way to conduct this comparison is to study the differences existing between the relative two probability distributions and their entropies.

If we imagine that for \sigma_2(x) the two parameters are \beta_0 = 1 and \beta_1 = 2, we then obtain a model \sigma_2(x) = \frac{1} {1+e^{(-1-2 \times x)}} with this associated entropy:

Rendered by QuickLaTeX.com

Notice how the entropies of the two models don’t correspond. This means that, as a general rule, the entropy of two different probability distributions is different.

2.6. Some Entropies Are More Equal Than Others

Finally, if we compared the two entropies H(\sigma_1(x)) and H(\sigma_2(x)) with a third entropy H(\sigma_3(x)), originating from a logistic model with parameters \beta_0 = -10 and \beta_1 = -5, we’d observe this:

Rendered by QuickLaTeX.com

It istinctively appears to us that the first two probability distributions, associated with the classifiers \sigma_1(x) and \sigma_2(x), have entropies that are more similar to one another than the entropy of the third classifier \sigma_3(x).

This gives us the intuitive idea that, if we want to compare the predictions between probabilistic models, or even between a probabilistic model and some known probability distribution, we need to use some dimensionless measure for the comparison of their respective entropies.

3. Cross-Entropy

3.1. The Definition of Cross-Entropy

On these bases, we can extend the idea of entropy in a univariate random distribution to that of cross-entropy for bivariate distributions. Or, if we use the probabilistic terminology, we can expand from the entropy of a probability distribution to a measure of cross-entropy for two distinct probability distributions.

The cross-entropy H(p, q) of the two probability distributions P and Q possesses this formula:

H(p, q) = -\sum_{i=1}^n p(x_i) \times \text{log}\ q(x_i)

3.2. Cross-Entropy for Model Comparison

We can apply this formula to compare the output of the two models, \sigma_1(x) and \sigma_2(x), from the previous section:

H(\sigma_1(x), \sigma_2(x)) = -\ \sigma_1(x) \times \text{log}\ \sigma_2(x) - (1- \sigma_1(x)) \times \text{log} (1- \sigma_2(x)))

This is the graph of the cross-entropy for these two particular models:

Rendered by QuickLaTeX.com

Notice that the cross-entropy is generally (but not necessarily) higher than the entropy of the two probability distributions. An intuitive understanding that we can have of this phenomenon is to imagine the cross-entropy as some kind of total entropy of two distributions. More accurately, though, we can consider the cross-entropy from two distribution to distance itself from the entropy of those distributions, the more the two distributions differ from one another.

3.3. Pair Ordering Matters

Notice also that the order in which we insert the terms into the H operator matters. The two functions H(p(x), q(x)) and H(q(x), p(x)) are generally different. This is, for example, the graph that compares the cross-entropy of the two logistic regression models, while swapping the terms:

Rendered by QuickLaTeX.com

This is particularly important when we compute the cross-entropy between an observed probability distribution; say, the predictions of a classification model, and a target class distribution. In that case, the true probability distribution is always the first term \boldsymbol{p(x)}, and the predictions of the model are always the second, \boldsymbol{q(x)}.

4. Model Optimization Through Cross-Entropy

4.1. Cross-Entropy as a Loss Function

The most important application of cross-entropy in machine learning consists in its usage as a loss-function. In that context, the minimization of cross-entropy; i.e., the minimization of the loss function, allows the optimization of the parameters for a model. For model optimization, we normally use the average of the cross-entropy between all training observations and the respective predictions.

Let’s use as a model for prediction the logistic regression model \hat{y} = \sigma(x). Then, cross-entropy as its loss function is:

L(x, y) = -\frac {1} {n} \sum_{i=1}^n y_i \times \text{log}\ \sigma(x_i)

4.2. Algorithmic Minimization of Cross-Entropy

We can then minimize the loss functions by optimizing the parameters that constitute the predictions of the model. The typical algorithmic way to do so is by means of gradient descent over the parameter space spanned by \boldsymbol{\beta}.

We discussed above how to compute the predictions for a logistic model. Specifically, we stated that predictions are computed as the logistic function of a linear combination of input and parameters:

\sigma(x, \beta) = \frac {1} {1+ e^{- x \times \beta} }

We also know that the derivative of the logistic function is:

\sigma'(x, \beta) = \sigma(x, \beta) \times (1- \sigma(x, \beta) )

From this, we can derive the gradient with respect to the parameters \beta as:

\nabla_{\beta} \sigma(x_i) = \sigma_i \times (1- \sigma_i ) \times x_i

And finally, we can calculate the gradient of the loss function as:

\nabla_{\beta} L(x, y) = - \frac {1} {n} \sum_{i=1}^n (y_i - \sigma_i) \times x_i

This, lastly, lets us optimize the model through gradient descent.

5. Conclusion

In this article, we studied the definition of cross-entropy. We started from the formalization of entropy for univariate probability distributions. Then, we generalized to bivariate probability distributions and their comparison.

Further, we analyzed the role of cross-entropy as a loss function for classification models.

In relation to that, we also studied the problem of its minimization through gradient descent for parameter optimization.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.