
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 28, 2025
In this tutorial, we’ll review the differences between cross-entropy and Kullback Leibler (KL) divergence.
Before we dive into cross-entropy and KL divergence, let’s review an important concept related to these two terms: entropy.
In information theory, entropy quantifies uncertainty, i.e., the information contained in the probability distribution of a random variable. A high entropy shows unpredictability (uncertainty), while a low entropy indicates predictability.
Let’s inspect a probability distribution over
:
It’s a distribution where data is concentrated around the mean. In contrast, a distribution whose data is spread out represents a high-entropy distribution:
Cross-entropy is the expected number of bits we need to encode events from one probability distribution using the other distribution:
For discrete variables (distributions), S and R represent probabilities of specific discrete values, and the cross-entropy is calculated by taking the negative sum of one probability distribution multiplied by the logarithm of the other:
We use integrals for continuous variables, where S and R represent probability density functions:
The KL divergence (), also known as relative entropy, quantifies the dissimilarity between two probability distributions.
Going back to our two probability distributions and
,
is calculated from the expected difference between the log-ratio of probabilities in distributions
and
for all possible events:
We use summation for discrete variables:
On the other hand, for continuous variables, we take the integral:
Let’s assume the probability distributions and
take on the values over
:
The cross-entropy is:
The KL divergence:
There are a few similarities between cross entropy and KL divergence:
However, these metrics aren’t the same:
Criterion | Cross-entropy | KL Divergence |
---|---|---|
Meaning | The expected number of bits to encode events from one distribution using another | The difference between two distributions |
Formulation | ||
Application | Loss function in classification tasks | Used in generative models |
In machine learning, cross-entropy is used as a loss function in classification tasks to calculate the difference between the observed data and the predicted output. It quantifies how close the predicted data is to the observed data. The KL divergence is used in generative models to encourage a model to generate samples more similar to the actual data.
In this article, we provided an overview of cross-entropy and KL divergence.
There are subtle differences between them. Cross-entropy quantifies the difference between two probability distributions by measuring the expected number of bits needed to approximate one probability distribution using the other distribution. In contrast, KL divergence quantifies how one probability distribution differs.