Before we dive into cross-entropy and KL divergence, let’s review an important concept related to these two terms: entropy.
In information theory, entropy quantifies uncertainty, i.e., the information contained in the probability distribution of a random variable. A high entropy shows unpredictability (uncertainty), while a low entropy indicates predictability.
Let’s inspect a probability distribution over :
It’s a distribution where data is concentrated around the mean. In contrast, a distribution whose data is spread out represents a high-entropy distribution:
Cross-entropy is the expected number of bits we need to encode events from one probability distribution using the other distribution:
For discrete variables (distributions), S and R represent probabilities of specific discrete values, and the cross-entropy is calculated by taking the negative sum of one probability distribution multiplied by the logarithm of the other:
We use integrals for continuous variables, where S and R represent probability density functions:
4. KL Divergence
The KL divergence (), also known as relative entropy, quantifies the dissimilarity between two probability distributions.
Going back to our two probability distributions and , is calculated from the expected difference between the log-ratio of probabilities in distributions and for all possible events:
We use summation for discrete variables:
On the other hand, for continuous variables, we take the integral:
5. An Example
Let’s assume the probability distributions and take on the values over :
The cross-entropy is:
The KL divergence:
There are a few similarities between cross entropy and KL divergence:
- both measures aren’t necessarily symmetric, so , and
- both take the value zero if and only if the two probability distributions are an exact match
However, these metrics aren’t the same:
In machine learning, cross-entropy is used as a loss function in classification tasks to calculate the difference between the observed data and the predicted output. It quantifies how close the predicted data is to the observed data. The KL divergence is used in generative models to encourage a model to generate samples more similar to the actual data.
In this article, we provided an overview of cross-entropy and KL divergence.
There are subtle differences between them. Cross-entropy quantifies the difference between two probability distributions by measuring the expected number of bits needed to approximate one probability distribution using the other distribution. In contrast, KL divergence quantifies how one probability distribution differs.