1. Introduction

In this tutorial, we’ll study the Information Bottleneck Principle (IB). This principle allows for a qualitative understanding and provides quantitative evidence of how a multilevel neural network (DNN) works internally. The result is the clarification of a limit that can be used as a guiding principle in the training of a DNN.

IB is directly related to another principle that we can consider of an even more qualitative nature: the principle of Minimum Mutual Information (MinMI).

2. MinMI Principle

The basic problem in any predictive system based on neural networks is the identification of an unknown function that realizes an optimal mapping between input (X) and output (Y) of a dataset. The training process consists of the identification of a series of internal parameters of the neural network that allow reaching this optimum. We’ll call in the next h_i each intermediate or hidden layer of the network:

img 625a4367a699c

What happens in the hidden levels of the network and why this process works so well is largely unknown. This is the reason why they’re called black-box models.

The MinMI principle or Minimum Information Principle has been applied in the context of neural coding. It considers a basic quantity of interest, relevant for the identification of the optimum, the mutual information between input and output, that we can define respectively for the discrete and continuous cases as:


    \[I(X;Y)=\int\int p(x,y)\ln\frac{p(x,y)}{p(x)p(y)}\,dxdy\]

where x, y are random variables, p (x, y) is the joint probability, and p (x) and p (y) are the marginal probabilities. Mutual information is always positive.

I (X; Y) is a measure of the mutual dependence between the two variables. More specifically, it quantifies the amount of information that we obtain about one random variable by observing the other random variable.

Suppose a set of DNNs consistent with the observations and compatible with the problem. Each is characterized by a set of internal parameters that are the subject of the training procedure. MinMI establishes that the optimal structure is given by the one with the minimum mutual information.

2.1. MinMI Principle. Why?

The question in the title of this section is justified. At a superficial glance, since I (X; Y) is a measure of the dependence between X and Y, we might expect to maximize mutual information and not minimize it.

However, this is not the case. Of all the possible DNNs that we can build compatible with the problem, most structures explicit a map between inputs and outputs that contain additional superstructures to the real relationship in the data. Effects such as noise and collinearity are obstacles to achieving optimum.

This fact can be further clarified if we refer to the dimensionality of the dataset. Normally, the input X has a high dimensionality while the output Y has a low dimensionality. This means that, in general, most of X entropy is not very informative about Y. The relevant features of X are highly distributed and difficult to extract.

These aberrations contribute to I (X; Y) since, being forms of information (albeit not desired) that bind X and Y, they increase the value of I (X; Y). The minimization of mutual information brings our neural network closer, therefore, to the identification of the mapping that contains only the relevant information in order to build an efficient predictive system, that is, the true relationship present in the data.

2.2. Compression

One way of putting these considerations into practice, which also explains the efficiency of DNNs and other similar predictive systems, is to put the network in the condition of having to do compression.

Using non-formal language, suppose that the structure of a DNN is, in the hidden levels, somehow “neuron deficient”. In these conditions, it is not possible, in general, to transmit all the information present in the data from one hidden level to the next. The training process leads the DNN to seek a compromise, which is expressed as a compression of the original information.

Compression means loss of information, but it is a “controlled” loss since our control parameter is, in general, a measure of the deviation of the prediction with respect to the measured data, which exerts continuous pressure during the training process. The final result is a decrease in the value of I (X; Y), a value that is given by a structure that contains a relationship that is closer to the true relationship present in the data than the one before compression and where many of the superstructures have been eliminated.  Repeating this process on all hidden layers further calibrates the whole process.

In this discussion, it is implicitly assumed that there’s, of course, some relationship between X and Y. In other words, we know that I (X; Y)> 0. If X and Y are independent, then p (x, y) = p (x) p (y) and we’ve:


Under these conditions, there’s no relationship to find, and it is not possible to build a predictive system.

The concepts of compression and minimization of I (X; Y) lead us directly to the IB principle.

3. IB Principle

3.1. Data Processing Inequality and Markov Chains

Data Processing Inequality (DPI) is an information theory concept that states that no processing of data can increase its entropy. From the point of view of a dataset and of the predictive systems we’re considering, it can be translated as “post-processing on data cannot increase information”.

When we’ve processes in which we choose symbols by a set of probabilities, we deal with stochastic processes. When the choice of symbols for a stochastic process depends on the symbols or events previously chosen, we’ve got a Markov process.

If three random variables form a Markov chain, X \rightarrow Y \rightarrow Z, then the conditional probability of Z depends only on Y and is conditionally independent of X. Under these conditions, no process on Y can increase the information that Y contains about X, and the DPI can be formalized as:

    \[I (X; Y) \geq I (X; Z)\]

If we denote by I (X; Y | Z) the residual information between X and Y, i.e., the relevant information not captured by Z, then the preceding expression achieves equality for I (X; Y | Z) = 0, that is when it is verified that Y and Z contain the same amount of information about X.

3.2. Minimal Sufficient Statistics for the Input

The compression of input X allows capturing the relevant features, eliminating those irrelevant for the purposes of the prediction of Y. The MinMI principle states that this process leads to a decrease of I (X; Y). The minimum of this quantity allows us to identify the simplest mapping of X, which we’ll call \hat {X}, which captures the mutual information I (X; Y). \hat {X} is the minimum sufficient statistics of X with respect to Y.

The DPI allows us to qualitatively understand the reason for the MinMI principle since:

    \[I (X; Y) \geq I (\hat {X}; Y)\]

If we denote the output prediction with \hat {Y}, the DPI also provides another important relationship:

    \[I (X; Y) \geq I (Y; \hat {Y})\]

with equality if and only if \hat {X} is sufficient statistics.

We can view the process of identifying \hat {X} and prediction as a Markov chain:

    \[X \rightarrow \hat {X} \rightarrow Y\]

This approach is problematic. For a generic distribution p (X, Y) may not exist an exact minimal sufficient statistics, resulting in an incorrect Markov chain. It is possible, however, to identify \hat {X} by an alternate way.

3.3. Minimum Condition for Minimal Sufficient Statistics

Let us consider the Markov chain:

    \[Y \rightarrow X \rightarrow \hat {X}\]

We can consider the search for \hat {X} as the minimization of I (X; \hat {X}). This criterion is not sufficient on its own, as we can reduce this quantity by eliminating relevant information from the input. We must have another condition.

On the other hand, even if the identification of \hat {X} is generally given by the compression process and the minimization of I (X; Y), it is also true that sufficient statistics must be the most informative possible, that is, I (\hat {X}; Y) must be maximum.

These two conditions allow us to construct the following Lagrangian:

    \[\mathcal {L} [p (\hat {x} | x)] = I (X; \hat {X}) - \beta I (\hat {X}; Y )\]

with \beta> 0, being a problem-dependent parameter that balances the complexity of the representation, I (X; \hat {X}), and the amount of preserved relevant information, I (\hat {X}; Y). This function has a minimum that can be found through a variational procedure. If we consider the relevant information not captured by \hat {X}, it is possible to write the equivalent expression:

    \[\mathcal {L} [p (\hat {x} | x)] = I (X; \hat {X} ) + \beta I (X; Y | \hat {X})\]

We thus have a minimum criterion that can be applied to the optimization of a DNN.

3.4. IB in the DNNs

The previous discussion allows us to increase the understanding of some well-established heuristics in the training of neural networks, such as the search for architectures that are as compact as possible. In fact, the IB principle teaches us that a DNN learns by extracting the most informative features, approximating the minimal sufficient statistics of X.

In a DNN, each level depends solely on the output of the previous level. We can study it, therefore, as a Markov process:

    \[X \rightarrow h_ {j} \rightarrow h_ {i} \rightarrow Y\]

with i> j.

Since, according to the DPI, passing one level to the next cannot increase entropy, we can write:

    \[I (Y; X) \geq I (Y; h_ {j}) \geq I (Y; h_ { i}) \geq I (Y; \hat {Y})\]

We achieve equality at each pass if each level is a sufficient statistic of its input.

Each level, therefore, must convey the greatest possible amount of relevant information, while minimizing the complexity of the representation. That is, each level must maximize I (Y; h_ {i}) while minimizing I (h_ {i-1}; h_ {i}) (again, the MinMI principle. Note that this last quantity is the equivalent of I (X; Y) within a single network layer).

4. Conclusion

In this tutorial, we present a brief overview of the fundamental issues underlying the IB principle. It is a formalism with great explanatory potential on the internal functioning of DNNs, but at the same time, allows us to quantify what happens during the training process.

The complexity of the topic does not allow a complete discussion. Among the issues that we haven’t considered are the equations relating to the limits of generalization, which we can derive from the formalism, and the analysis of the BI distortion curve, from which it is possible to identify the bifurcation points, which can assimilate to phase transitions between different network topologies.

These aspects can be the starting point for further study for the interested reader.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.