What’s the Difference Between Cross-Entropy and KL Divergence?

1. Introduction

In this tutorial, we’ll review the differences between cross-entropy and Kullback Leibler (KL) divergence.

2. Entropy

Before we dive into cross-entropy and KL divergence, let’s review an important concept related to these two terms: entropy.

In information theory, entropy quantifies uncertainty, i.e., the information contained in the probability distribution of a random variable. A high entropy shows unpredictability (uncertainty), while a low entropy indicates predictability.

Let’s inspect a probability distribution $S$ over $\{1, 2, 3 \}$ :

It’s a distribution where data is concentrated around the mean. In contrast, a distribution whose data is spread out represents a high-entropy distribution:

3. Cross-Entropy

Cross-entropy is the expected number of bits we need to encode events from one probability distribution using the other distribution:

$H(R, S) = -\mathbb{E}_R[\log(S)]$

For discrete variables (distributions), S and R represent probabilities of specific discrete values, and the cross-entropy $H(R, S)$ is calculated by taking the negative sum of one probability distribution multiplied by the logarithm of the other:

$H(R, S) = -\sum_{x} R(x) \log S(x)$

We use integrals for continuous variables, where S and R represent probability density functions:

$H(R, S) = -\int_{-\infty}^{\infty} R(x) \log\left(S(x)\right) \, dx$

4. KL Divergence

The KL divergence ( $D_{\text{KL}}$ ), also known as relative entropy, quantifies the dissimilarity between two probability distributions.

Going back to our two probability distributions $R$ and $S$ , $D_{\text{KL}}$ is calculated from the expected difference between the log-ratio of probabilities in distributions $S$ and $R$ for all possible events:

$D_{\text{KL}}(R\parallel S) = \mathbb{E}_R\left[\log\left(\frac{R(x)}{S(x)}\right)\right]$

We use summation for discrete variables:

$D_{\text{KL}}(R \parallel S) = \sum_{x} R(x) \log\left(\frac{R(x)}{S(x)}\right)$

On the other hand, for continuous variables, we take the integral:

$D_{\text{KL}}(R \parallel S) = \int_{-\infty}^{\infty} R(x) \log\left(\frac{R(x)}{S(x)}\right) \, dx$

5. An Example

Let’s assume the probability distributions $R$ and $S$ take on the values over $\{1,2,3\}$ :

$R = \begin{pmatrix} 1 & 2 & 3 \\ 0.42 & 0.31 & 0.27 \end{pmatrix} \qquad S = \begin{pmatrix} 1 & 2 & 3 \\ 0.25 & 0.5 & 0.25 \end{pmatrix}$

The cross-entropy $H(R, S)$ is:

$\begin{aligned} H(R, S) & = -\sum_{x} R(x) \log S(x) \\ & = -\left(0.42 \cdot \log(0.25) + 0.31 \cdot \log(0.5) + 0.27 \cdot \log(0.25)\right) \\ & = -\left(0.42 \cdot (-1.39) + 0.31 \cdot (-0.69) + 0.27 \cdot (-1.39)\right) \\ & = -\left(-0.58 - 0.21 - 0.38\right) \\ & = -\left(-1.17\right) \\ & = 1.17 \end{aligned}$

The KL divergence:

$\begin{aligned} KL(R || S) & = \sum_{x} R(x) \log\left(\frac{R(x)}{S(x)}\right)\\ & = 0.42 \cdot \log\left(\frac{0.42}{0.25}\right) + 0.31 \cdot \log\left(\frac{0.31}{0.5}\right) + 0.27 \cdot \log\left(\frac{0.27}{0.25}\right) \\ & = 0.42 \cdot 0.39 + 0.31 \cdot (-0.41) + 0.27 \cdot 0.13 \\ & = 0.16 - 0.13 + 0.03 \\ & = 0.07 \end{aligned}$

6. Comparison

There are a few similarities between cross entropy and KL divergence:

both measures aren’t necessarily symmetric, so $D_{\text{KL}}(R \| S) \neq D_{\text{KL}}(S \| R)$ , and $H(R,S) \neq H(S,R)$
both take the value zero if and only if the two probability distributions are an exact match

However, these metrics aren’t the same:

Criterion	Cross-entropy	KL Divergence
Meaning	The expected number of bits to encode events from one distribution using another	The difference between two distributions
Formulation	$H(R, S) = -\mathbb{E}_R[\log(S)$	$D_{\text{KL}}(R\parallel S) = \mathbb{E}_R\left[\log\left(\frac{R(x)}{S(x)}\right)\right]$
Application	Loss function in classification tasks	Used in generative models

In machine learning, cross-entropy is used as a loss function in classification tasks to calculate the difference between the observed data and the predicted output. It quantifies how close the predicted data is to the observed data. The KL divergence is used in generative models to encourage a model to generate samples more similar to the actual data.

7. Conclusion

In this article, we provided an overview of cross-entropy and KL divergence.

There are subtle differences between them. Cross-entropy quantifies the difference between two probability distributions by measuring the expected number of bits needed to approximate one probability distribution using the other distribution. In contrast, KL divergence quantifies how one probability distribution differs.

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung