Introduction

In this tutorial, we’ll take a look at the concept of entropy and its applications within different branches of computer science.

Entropy is connected to the information theory developed by Shannon, which arose within the problems related to communication.

Communication

We can define communication as all the processes through which a mechanism actively enters into a relationship with another mechanism.

A communication system contains three basic elements: source, communication channel, and receiver.

Information theory is about answering questions regarding the measurement of information and its transmission, with or without noise.

Combinatorial Entropy

The derivation of the expression for information entropy allows us to understand its meaning.

Consider the possible fixed-length sequences of three different symbols: A, B, and C. Suppose that we have two instances of the symbol A, five of the symbol B, and three of the symbol C. Two possible sequences are:

B C B B C C A B A B

B A C B A C B B C B

Each of these combinations is called a permutation. Considering that the symbols of the same type are indistinguishable, the number of permutations is given by the multinomial coefficient. For our example, the number of permutations is:

    \[P=\left(\begin{array}{c} 10\\ 2,5,3 \end{array}\right)=\frac{10!}{2!5!3!}=2520\]

In general, for a sequence of length N formed by L distinct symbols of multiplicity n_{i}, we have:

    \[P=\left(\begin{array}{c} N\\ n_{1},n_{2},\ldots,n_{L} \end{array}\right)=\frac{N!}{\prod_{i=1}^{L}n_{i}!}\]

Suppose we want to uniquely encode each of the possible P permutations using a binary string. A string of length b bits can encode 2^{b} different possibilities.

In order to encode the P permutations, one must have 2^{b}=P. Taking the base 2 logarithm, we obtain that the length of the binary string must be:

    \[b=\log_{2}P\]

The average length of the binary string for the N symbols that make up our sequence is:

    \[H_{c}=\frac{1}{N}\log_{2}P=\frac{1}{N}\log_{2}\frac{N!}{\prod_{i=1}^{L}n_{i}!}\]

H_{c} is the combinatorial entropy and has units of bits/symbol. It expresses the average length of a binary string needed to encode each symbol of the original sequence.

Entropy in Information Theory

Entropy in information theory, also called information entropy or Shannon entropy, is denoted by H and can be derived from combinatorial entropy.

Operating on the logarithm in the expression of H_{c} we can write:

    \[H_{c}=\frac{1}{N}\left(\log_{2}N!-\sum_{i=1}^{L}\log_{2}n_{i}!\right)\]

We can approximate the logarithm of the factorial with the Stirling formula for N sufficiently large:

    \[\log_{2}N!=\frac{1}{\ln2}\ln N!=\frac{1}{\ln2}\left(N\ln N-N+\mathcal{O}(\ln N)\right)\]

Considering that N=\sum_{i=1}^{L}n_{i}, we have:

    \[H&\simeq&\frac{1}{\ln2}\frac{1}{N}\left(N\ln N-N-\sum_{i=1}^{L}n_{i}\ln n_{i}+\sum_{i=1}^{L}n_{i}\right)\\&=&\frac{1}{\ln2}\frac{1}{N}\left(\left(\sum_{i=1}^{L}n_{i}\right)\ln N-\sum_{i=1}^{L}n_{i}\ln n_{i}\right)\\&=&\frac{1}{\ln2}\frac{1}{N}\sum_{i=1}^{L}n_{i}\left(\ln N-\ln n_{i}\right)\\&=&-\sum_{i=1}^{L}\frac{n_{i}}{N}\log_{2}\frac{n_{i}}{N}\]

Taking into account that the ratio \frac{n_{i}}{N} is nothing more than the probability of occurrence of the i-th symbol in the sequence of length N, we obtain the final general expression for entropy:

    \[H=-\sum_{i}p_{i}\log_{2}p_{i}\]

Given the approximation that we have introduced, the information entropy is an upper limit to the combinatorial entropy, H>H_{c}.

H is often used as a function of the natural logarithm:

    \[H=-\sum_{i}p_{i}\ln p_{i}\]

In this case, the units are of nats/symbol. Both bits and nats are dimensionless units.

Entropy in the Continuous Case

In the passage from the discrete case to the continuous case, Shannon simply replaced the summation with an integral:

    \[h(x)=-\int p(x)\ln p(x)\,dx\]

This quantity, generally denoted by the symbol h, is called differential entropy. The integral implies the use of a probability density function instead of a probability distribution.

There is a problem here. Consider that p( x ) is given by a normal probability density of mean \mu and variance \sigma^{2}:

    \[p(x)=\mathcal{N}(\mu;\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ -\frac{(x-\mu)^{2}}{2\sigma^{2}}\right\}\]

Then, calculating the integral, the differential entropy is:

    \[h_{\mathcal{N}}=\ln(\sqrt{2\pi e}\sigma)\]

Note that the above expression is less than zero for \sigma<\sqrt{\frac{1}{2\pi e}}. Hence, differential entropy, unlike discrete entropy, can be negative. Moreover, h is not generally invariant under change of variable.

Care should, therefore, be taken in applying differential entropy. Its uncritical use as a simple generalization of discrete entropy can give rise to unexpected results.

Jaynes proposed a corrected version of differential entropy, known as relative entropy, to solve these problems:

    \[h(x)=-\int p(x)\ln\frac{p(x)}{m(x)}\,dx\]

The expression above is scale-invariant, as long as we measure p( x ) and m( x ) on the same interval. We can consider m( x ) as an a priori probability, often in the form of the uniform probability density.

The problems of differential entropy arise from the fact that an absolute measure of entropy in the continuum is not mathematically justified. Relative entropy measures the differential entropy concerning an arbitrary level, given by m( x ).

Entropy and Information

Shannon’s work originates, as noted by Von Neumann (who also suggested the name to Shannon), from Boltzmann’s observation in his work in statistical physics that entropy is “missing information”, considering that it is related to the number of possible alternatives for a physical system.

Information in information theory is not related to meaning. Shannon wrote:

The semantic aspects of communication are irrelevant to the technical ones.”

Information is a measure of the freedom we have when choosing a message. So, it is not about what is being transmitted, but what could be transmitted. It is not related to individual messages.

If we have a possible choice between two messages, we can arbitrarily indicate the choice of one of the two as a unit quantity of information (\log_{2}2 or bit, a term proposed by John W. Tukey). As we have seen, the bit arises when a message is encoded as a binary string.

6.1. Ergodic Processes

The symbol sequences of a message have, in general, different probabilities. We compose a message choosing among possible symbols in an alphabet. In real processes, the probability of choosing a symbol is not independent of previous choices.

Think of a message written in English, in which we compose the words with the symbols of the usual alphabet. The probability that after a j there is a vowel is much higher than the probability that there is an x, for example.

When we have processes in which we choose symbols by a set of probabilities, we deal with stochastic processes. When the choice of symbols of a stochastic process depends on the symbols or events previously chosen, we have a Markov process. If a Markov process leads to a statistic that is independent of the sample when the number of events is large, then we have an ergodic process.

Entropy is, therefore, a measure of uncertainty, surprise, or information related to a choice between a certain number of possibilities when we consider ergodic processes.

Maximum Entropy

H is maximum if the possibilities involved are equally likely. For two options, we have:

    \[H=-p\ln p-(1-p)\ln(1-p)\]

which leads to the following graphical representation, showing a maximum for p=0.5:

H-max

We can generalize this result to cases with more than two probabilities involved. It is possible to demonstrate that this maximum is unique.

If we use concrete functional forms of probability density, we have:

  1. The entropy of the normal density that we have seen previously, h_{\mathcal{\mathcal{N}}}, is maximum among all densities with the same variance. In other words, the maximum entropy distribution under constraints of mean and variance is the Gaussian.
  2. The entropy of uniform density is greatest among all possible probability densities. Not being asymptotic, the entropy of the uniform density cannot be calculated in the whole domain given its constant value in all points (the integral \int_{-\infty}^{\infty}\ldots\,dx does not exist), but it can be calculated in a finite range [a:b]. In this case, the probability for the continuous case is p=\frac{1}{b-a}, and the differential entropy holds:

    \[h_{\max}=h_{U}=-\int_{a}^{b}\frac{1}{b-a}\ln\frac{1}{b-a}\,dx=\ln(b-a)\]

which depends on the extent of the selected range. In the discrete case, the maximum entropy, for N events and entropy expressed in nats, is:

    \[H_{\max}=H_{U}=H\left(\frac{1}{N},\ldots,\frac{1}{N}\right)=\ln(N)\]

The Inevitable Physical Analogy

We have highlighted in the previous sections the influence of thermodynamics in the early stages of the development of information theory. However, information entropy is of enormous importance in other areas of physics as well.

8.1. Thermodynamic Entropy

The thermodynamic entropy for an ideal gas is given by the Boltzmann-Planck equation:

    \[S=k_{B}\ln W\]

where k_{B} is the Boltzmann constant, and W is the number of real microstates corresponding to the gas macrostate. The Boltzmann formula shows the relationship between entropy and the number of ways the atoms or molecules of a thermodynamic system can be arranged.

Note the similarity of the meaning of the Boltzmann equation and of what we have given to combinatorial entropy starting from the concept of permutation.

8.2. Order, Probability, and Entropy

We know that, in an isolated system, the disorder or entropy increases with each physical irreversible process until it reaches a maximum. The result is also valid for irreversible processes in adiabatic systems, in which there is no heat transfer with the outside. This fact has important macroscopic consequences.

If we inject gas into a container full of air and wait for sufficient time, we can observe that the gas will spontaneously diffuse into the air until it reaches the same concentration at all points.

Another example is the contact between two bodies at different temperatures. In this case, there will be a heat flow between the two bodies to equal their temperatures, which we can consider a measure of the concentration or density of energy. If the two bodies have different masses, they will have different amounts of energy at the end of the process, but the energy per unit of volume will be the same.

The second principle often manifests by the establishment of physical processes that try to equal some property in the systems. The result is the zeroing of the gradient of some physical observable. In isolated systems, the processes leading to an increase in entropy are spontaneous. The maximum entropy corresponds to the thermodynamical equilibrium.

The universe is an adiabatic and isolated system. When the maximum entropy is reached, there will no longer be any gradient of energy that will allow any spontaneous process. This conjecture is known as heat death or entropic death of the universe.

Note the similarity between the maximum thermodynamic entropy and the equality of physical properties in all points of the system, and the maximum information entropy that derives from equality in probabilities.

8.3. Uncertainty and Quantum Mechanics

The uncertainty principle is a fundamental natural limit. It concerns the possibility of measuring with arbitrary precision or certainty pairs of physical observables (conjugate variables).

Since entropy is a measure of uncertainty, it is no coincidence that we can formulate the uncertainty principle in terms of information entropy (entropic formulations).

Starting from the inequality proposed independently by Bourret, Everett, Hirschman, and Leipnik, satisfied by each function \psi and its Fourier transform \widetilde{\psi}, that connects respectively coordinate and momentum space:

    \[-\int\left|\psi(x)\right|^{2}\ln\left|\psi(x)\right|^{2}\,dx-\int\left|\widetilde{\psi}(p_{x})\right|^{2}\ln\left|\widetilde{\psi}(p_{x})\right|^{2}\,dp_{x}=h(x)+h(p_{x})\geq1+\ln\pi\]

Beckner, and Bialinicki-Birula and Micielski demonstrated:

    \[h(x)+h(p_{x})\geq\ln(\pi e\hbar)\]

where \hbar is related to the Planck constant.

Since the differential entropy of the normal probability density, as we saw above, is maximum among all distributions with the same variance, for a generic probability density p(z) we can write the inequality

    \[h(z)\leq\ln(\sqrt{2\pi e}\Delta z)\]

Substituting the last equation into the previous one, we have:

    \[\Delta x\Delta p\geq\frac{1}{2\pi e}\exp\left\{ h(x)+h(p_{x})\right\} \geq\frac{\hbar}{2}\]

That is the formulation of the uncertainty principle.

Entropy Properties

Some fundamental properties of entropy are:

  1. Continuity: changing the values of the probabilities by a minimal amount should only change the entropy by a small amount.
  2. Symmetry: the measure is invariant under re-ordering of the outcomes.
  3. Additivity: the amount of entropy should be independent of how we consider the process and how we divide it into parts.

Other Forms of Entropy

Given two random variables ( x,y ) , we can define their marginal probabilities ( p( x ) ,p( y ) ) , the joint probability p( x,y ), and the conditional probabilities ( p( x|y ) ,p( y|x ) ) , linked by the relationship

    \[p(x,y)=p(x|y)p(y)=p(y|x)p(x)\]

Since entropy is a function of distributions or probability densities, we can define analogous variants.

Joint entropy is a measure of the uncertainty associated with a set of variables. For discrete and continuous cases, it takes the form:

    \[H(x,y)=-\sum_{x}\sum_{y}p(x,y)\ln p(x,y)\]

    \[h(x,y)=-\iint p(x,y)\ln p(x,y)\,dxdy\]

where:

    \[H(x,y)\leq H(x)+H(y)\]

    \[h(x,y)\leq h(x)+h(y)\]

Equality is valid only if x and y are statistically independent.

Conditional entropy or equivocation quantifies the amount of information needed to describe the outcome of a random variable y given the value of another random variable x:

    \[H(x|y)=-\sum_{x}\sum_{y}p(x,y)\ln\frac{p(x,y)}{p(y)},\,H(y|x)=-\sum_{x}\sum_{y}p(x,y)\ln\frac{p(x,y)}{p(x)}\]

    \[h(x|y)=-\iint p(x,y)\ln p(x|y)\,dxdy,\,h(y|x)=-\iint p(x,y)\ln p(y|x)\,dxdy\]

Some important properties are:

    \[h(y|x)=h(x,y)-h(x)\]

    \[h(x|y)=h(x,y)-h(y)\]

with equivalent properties for the discrete case.

A very important quantity in information theory is mutual information (MI), which quantifies the amount of information obtained about one random variable by observing another random variable. It is a measure of the mutual dependence between the two variables:

    \[I(x,y)=\sum_{x}\sum_{y}p(x,y)\ln\frac{p(x,y)}{p(x)p(y)}\]

    \[I(x,y)=\iint p(x,y)\ln\frac{p(x,y)}{p(x)p(y)}\,dxdy\]

MI is positive and symmetric, I(x,y)=I(y,x), and is linked to the different forms of entropy through the relationship:

    \[I(x,y)=h(x)-h(x|y)=h(y)-h(y|x)\]

Communication and Transmission of Information

The relationship between effective entropy and maximum entropy is the relative entropy of the source. One minus relative entropy is redundancy.

For a communication channel of capacity C bits/s and a source with entropy H bits/symbol, the average transmission speed in an undisturbed channel is limited above by \frac{C}{H} in bits/symbol. The closer we get to this limit, the longer the encoding process.

A paradoxical fact is that in the presence of noise in the communication channel, the entropy of the received message is greater than that transmitted by the source. So, the received message has more information than the one sent because the disturbance introduces an entropy added to that of the pure source. The additional information is unwanted (noise), but from a formal point of view, it contributes to the total entropy of the message received.

Data Compression

Data compression is a notable example of the application of entropy and information theory concepts.

To transmit a message from a source to a receiver, we use a communication channel. The transmission involves a previous process of coding the message. We can try to reduce the original size (compression) through an algorithm.

Compression can be lossy or lossless. Using audio file compression as an example, mp3 is a lossy compression, while FLAC is lossless compression.

We will not study any concrete algorithm, but we will see the general laws governing data compression mechanisms.

12.1. Entropy and Coding

In coding, each symbol of the message is identified by a code or codeword, which can be of fixed or variable length. For example, we can encode the first 4 letters of the alphabet with A = 00, B = 01, C = 10, D = 11.

If each symbol is identified by probability p_{i} and codeword length l_{i}, the average codeword length is:

    \[L=\sum_{i=0}^{N-1}p_{i}l_{i}\]

The Source Coding Theorem or Shannon’s First Theorem establishes that it’s not possible to represent the outputs of an information source S by a source code whose average length is less than the source entropy:

    \[H(S)+1>L\geq H(S)\]

The average codeword length is limited below by the entropy of the source. The lower H( S ) is, the lower the value of L that we can theoretically reach. Hence, messages with lower entropy are more compressible.

12.2. Data Compression

Coding efficiency is:

    \[\eta=\frac{H(S)}{L}\]

The value of H( S ) is the limit to aim for when developing a compression algorithm.

An intuitive result is that if any encoded string has only one possible source string producing it, then we have unique decodability. An example of this type of encoding is the prefix code.

A prefix code is a code where no codeword is the prefix of any other codeword. For example, if we use variable length coding A = 110, B = 11, we do not have a prefix code, because the codeword of B is the prefix code of the codeword for A (substring 11).

In this context, an important result is Kraft-McMillan’s inequality:

    \[\sum_{i=0}^{N-1}2^{-l_{i}}\leq1\]

If codeword lengths satisfy the Kraft-McMillan inequality, then we can construct a prefix code with these codeword lengths. This ensures the uniqueness of the decoding.

If, for each symbol i, it is verified that p_{i}=2^{-l_{i}}, then we can prove that L=H( S ) and \eta =1 . When this condition does not occur, we can increase efficiency by extending the source, using a binary Shannon-Fano code.

The Shannon-Fano code for blocks of n symbols will have expected codeword length, L_{n}, less than 1+H( S n ) =1+nH( S ), the latter equality stemming from the definition of entropy. The expected codeword length for original source symbols will be less than:

    \[\frac{L_{n}}{n}=\frac{1+H(S^{n})}{n}=\frac{1+nH(S)}{n}=\frac{1}{n}+H(S)\]

By choosing n to be large enough, we can make this as close to the entropy, H( S ), to obtain \eta = 1. The price to pay is an increase in the complexity of the decoding process.

12.3. A Compression Scheme

As a conclusion of this section, we report a classic compression scheme.

Suppose a vector of \Sigma symbols, each of which has a probability p_{i}, and consider the problem of encoding with a binary alphabet \Gamma\equiv\{0,1\}. For an integer n>0 and real numbers \alpha >0 and \delta \in ( 0,1 ), a coding scheme, f, and decoding, g , can be formalized as follows:

    \[f:\Sigma^{n}\rightarrow\Gamma^{m}\]

    \[g:\Gamma^{m}\rightarrow\Sigma^{n}\]

This schema forms an (n,\alpha,\delta) compression-schema for the p_{i} if the following conditions are met:

    \[m=\alpha n\]

    \[\left\{ Pr[g(f(p_{1},p_{2},\ldots,p_{\Sigma}))]=p_{1},p_{2},\ldots,p_{\Sigma}\right\} > 1 - \delta\]

In practice, from the last condition, the coding of the symbols according to the probabilities p_{i} must allow a decoding process that approximates the original probabilities with the desired precision.

Entropy and Machine Learning

13.1. Cross-Entropy Error Function

In classifiers such as neural networks, we can use alternative error functions to quadratic error.

It is usual to minimize error functions E constructed from the negative logarithm of the likelihood function, starting from independent conditional probabilities of the target given the input p( t | x ):

    \[p(\mathbf{t}|\mathbf{x})=\prod_{k=1}^{C}p(t_{k}|\mathbf{x})\]

    \[E=-\sum_{n}\sum_{k=1}^{C}\ln p(t_{k}|\mathbf{x})\]

where \mathbf{x},\mathbf{t} are vectors of the n records of the dataset and C is the number of classes (network output units).

If we consider that the target belongs only to two mutually exclusive classes C_{1} and C_{2}., conditional probability is:

    \[p(\mathbf{t}|\mathbf{x})=\prod_{k=1}^{2}y_{k}^{t_{k}}(1-y_{k})^{1-t_{k}}\]

where y_{k} is the network output for unit k, function only of the input \mathbf{x}, and where t_{k}=1 if the input belongs to the class C_{1} and t_{k}=0 if it belongs to the class C_{2} . The error function becomes:

    \[E=-\sum_{n}\sum_{k=1}^{2}\left[t_{k}\ln y_{k}+(1-t_{k})\ln(1-t_{k})\right]\]

E is the cross-entropy error function and has a form that we have already seen in the discussion on the maximum value of entropy.

By analyzing the continuous case, we can demonstrate that the minimum of this error function is the same as that for the quadratic error.

13.2. Principle of Maximum Entropy

The Maximum Entropy or MaxEnt principle, proposed by Jaynes, establishes that the probability distribution that best represents knowledge within a problem is the one with the maximum entropy. Since the output of a neural network is, under certain conditions, an approximation to the conditional probability of the target given the input, the MaxEnt principle constitutes potentially an optimization criterion.

The MaxEnt principle is not without criticism. This led to the proposal of two similar principles:

  1. The principle of maximum mutual information (MaxMI), in which the optimal probability density is the one with the maximum MI between network outputs and targets.
  2. The principle of minimum mutual information (MinMI), in which the optimal probability density is the one with the minimum MI between network outputs and inputs.

In the latter case, if we know the marginal probabilities of inputs and targets (reasonable condition), MinMI is equivalent to the MaxEnt principle.

13.3. The Error of an Estimator

Considering a target t and an estimator given, for example, by the output y of a neural network, which depends on an input x, it is possible to obtain a lower limit of the expected quadratic error, which depends on the conditional differential entropy:

    \[\left\langle (t-y(x))^{2}\right\rangle \geq\frac{1}{2\pi e}\exp\left\{ 2h(y|x)\right\}\]

This expression is important in quantum mechanics, which again underlines the importance of the connection between physics and information theory.

Conclusion

In this tutorial, we’ve made an overview of the concept of entropy and the implications of its use in different branches of computer science. Given the historical development of information theory, the analysis of entropy cannot ignore the meaning of thermodynamic entropy.

The topic is extensive and cannot, of course, be dealt with exhaustively in a single article. Nevertheless, the themes dealt with in the text can give ideas for further insights.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.