1. Introduction

In this tutorial, we’ll study weight initialization techniques in artificial neural networks and why they’re important.

Initialization has a great influence on the speed and quality of the optimization achieved by the network training process.

2. Basic Notation

To illustrate the discussion, we’ll refer to classic fully interconnected feed-forward networks, such as the one illustrated in the following figure:

ANN feed forward

Each unit of the network performs a non-linear transformation (activation function) of a weighted sum of the outputs x_ {i} of the units of the previous level to generate its own output y:

    \[y=\mathcal{F}\left(w_{0}+\sum_{i}w_{i}x_{i}\right)\]

The bias is considered an additional unit with an output equal to 1 and weight w_{0} and has the function of y-intercept, without which the model generated by the network is forced to pass from the origin in the space of the problem, that is the point (\mathbf{x}=0,\mathbf{y}=0). The bias adds flexibility and allows modeling datasets for which this condition is not met.

3. Breaking the Symmetry

We basically have two possible extreme choices for initializing the weights of a neural network: select a single value for all the weights in the network, or generate them randomly within a certain range.

Best practices recommend using a random set, with an initial bias of zero. The reason lies in the need to “break the symmetry”, that is, the need to make each neuron perform a different computation. In conditions of symmetry, training can be severely penalized or even impossible.

Breaking the symmetry has two different aspects, depending on the scale at which we consider the question: from the point of view of the connections of a single network or the point of view of different networks.

3.1. Breaking the Symmetry Within the Units of a Network

If all units of the net have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way. Let’s see why.

In our article on nonlinear activation functions, we studied the classic learning mechanism based on the Delta rule (gradient descent), which provides a procedure for updating weights based on the presentation of examples.

Assuming for simplicity a network with a single layer that uses linear activation functions and the quadratic error between network output, y, and target, t, as a measure of the goodness of the prediction, using the P records of a dataset of measured data:

    \[E=\sum_{p=1}^{P}E^{p}=\frac{1}{2}\sum_{p=1}^{P}(d^{p}-t^{p})^{2}\]

the Delta rule provides the following expression for updating the weights:

    \[\Delta_{p}w_{j}=-\gamma\frac{\partial E^{p}}{\partial w_{j}}=\frac{\partial E^{p}}{\partial y^{p}}\frac{\partial y^{p}}{\partial w_{j}}=\gamma x_{j}(t^{p}-y^{p})\]

where \gamma is the learning rate.

Suppose we initialize all weights in the network with the same value. Then, regardless of the functional form chosen for the activation function, the difference \mathbf{t ^ {p} -y ^ {p}} will be identical for all units, and the new set of weights will all have the same numerical value.

We can think of this “symmetrical situation” as a constraint. Practice shows that it is harmful and does not allow for optimal training.

3.2. Breaking the Symmetry in Different Networks

Identifying the optimal neural network for a problem requires a test campaign, trying different structures and parameterization, to identify the network that generates the least error. The procedure can be automated with, for example, a genetic algorithm, which proposes different solutions and puts them in competition.

Let’s suppose instead that we carry out different trials using the same structure and the same parameters and weights of the network. In this case, all networks will have the same starting point in the error space of the problem.

As we saw in the previous section, many training algorithms study the variation of the error gradient with the variation of weight. Starting from the same point means that the gradient’s direction will always be the same or very similar between different trials, and the weights will then be updated in the same way.

This is another aspect of a symmetrical situation. The choice of different weights allows you to explore space in different ways and increases the probability of finding optimal solutions.

4. Random Initialization

We understood from the previous sections the need to initialize the weights randomly, but within what interval? The answer largely depends on the activation functions that our neural network uses.

Let’s consider the \tanh as an example:

tanh

The resolution of the curve is poor for extreme values of the argument. The variation of x for too large or too small values leads to small variations in the \tanh (vanishing gradient problem).

This fact gives us a criterion for the initialization range of weights, which should be located in an intermediate range. Some authors recommend [-1: 1], others as small as [-0.1: 0.1]. If we use the logistic activation function or the \tanh, a range [-0.5: 0.5] is adequate for most uses.

5. Advanced Random Initialization Techniques

The random initialization illustrated in the previous section considers that the generated weights are, within the selected range, equally probable. This is equivalent to the random generation according to a uniform distribution.

Other probability laws can be used, such as the Gaussian distribution. In this last case, the weights are not generated within an interval but normally distributed with a certain variance.

The techniques illustrated below give an estimate of these limits of variability: extension of the interval for uniform distribution, \mathcal {U}, and standard deviation for the Gaussian one, \mathcal {N}.

5.1. Xavier-Bengio Initialization

Xavier-Bengio initialization, also known as Xavier-Joshua initialization or Glorot initialization, can be used for the logistic activation function and hyperbolic tangent. It was derived by these authors considering the assumption of linear activation functions.

The logic of Xavier’s initialization method is to set an equal variance of the inputs and outputs of each layer to avoid the vanishing gradient problem and other aberrations.

We’ll call 2 \Delta the variability interval for weights following a uniform distribution (interval [- \Delta: \Delta]) and \sigma the standard deviation in the case of normal distribution with zero mean:

    \[\mathbf{W}\in\mathcal{U}\left[-\Delta,\Delta\right],\,\mathbf{W}\sim\mathcal{N}\left(0,\sigma\right)\]

For the logistic function, Young-Man Kwon, Yong-woo Kwon, Dong-Keun Chung, and Myung-Jae Lim give expressions:

    \[\mathbf{W}\in\mathcal{U}\left[-\sqrt{\frac{6}{n_{i}+n_{o}}},\sqrt{\frac{6}{n_{i}+n_{o}}}\right],\,\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{\frac{6}{n_{i}+n_{o}}}\right)\]

where \mathbf {W} is the weight matrix, and n_ {i} and n_ {o} are the number of input and output weight connections for a given network layer, also called \mathrm {fan_ {in}} and \mathrm {fan_ {out}} in the technical literature.

For the \tanh we have:

    \[\mathbf{W}\in\mathcal{U}\left[-\sqrt[4]{\frac{6}{n_{i}+n_{o}}},\sqrt[4]{\frac{6}{n_{i}+n_{o}}}\right],\,\mathbf{W}\sim\mathcal{N}\left(0,\sqrt[4]{\frac{6}{n_{i}+n_{o}}}\right)\]

Note that the \Delta and \sigma parameters act as a scale parameter applied to a specific probability distribution.

However, other expressions are more common in the technical literature. In particular, for the normal distribution:

    \[\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{\frac{1}{n_{i}}}\right)\]

with a variant given by:

    \[\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{\frac{2}{n_{i}+n_{o}}}\right)\]

5.2. He Initialization

Also called Kaiming initialization. This method is named after a famous paper by Kaiming He et al. published in 2005. It is almost similar to Xavier’s initialization, except that they use different scaling factors for the weights.

He et al. derived an initialization method by cautiously modeling the non-linearity of ReLUs, which makes extremely deep models (> 30 layers) difficult to converge. It is then associated with these activation functions.

Young-Man Kwon, Yong-woo Kwon, Dong-Keun Chung, and Myung-Jae Lim give the expressions:

    \[\mathbf{W}\in\mathcal{U}\left[-\sqrt{2}\sqrt{\frac{6}{n_{i}+n_{o}}},\sqrt{\frac{6}{n_{i}+n_{o}}}\right],\,\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{2}\sqrt{\frac{6}{n_{i}+n_{o}}}\right)\]

Here, too, it is more common to use the following expression, suitable for the normal distribution:

    \[\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{\frac{2}{n_{i}}}\right)\]

There are solid theoretical justifications for this technique. Given that a proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially (vanishing gradient problem), He et al. arrived in their work at the following conditions to avoid this type of aberration:

    \[\frac{1}{2}n_{i}\sigma^{2}=1\]

which leads to the expression of the text.

It is still possible to obtain a more general expression, given by:

    \[\mathbf{W}\sim\mathcal{N}\left(0,\sqrt{\frac{2}{(1+a^{2})n_{i}}}\right)\]

where a is the negative slope of the rectifier used after the current layer. a = 0 for ReLU by default, which leads back to the text expression.

6. Other Forms of Initialization

Many other methods have been proposed. Scientific packages make many of these techniques available. For example, Keras has the following possibilities:

  • Zeros: initialization to 0
  • Ones: initialization to 1
  • Constant: initialization to a constant value
  • RandomNormal: initialization with a normal distribution
  • RandomUniform: initialization with a uniform distribution
  • TruncatedNormal: initialization with a truncated normal distribution
  • VarianceScaling: initialization capable of adapting its scale to the shape of weights
  • Orthogonal: initialization that generates a random orthogonal matrix
  • Identity: initialization that generates the identity matrix
  • lecun_uniform: LeCun uniform initializer
  • glorot_normal: Xavier normal initializer
  • glorot_uniform: Xavier uniform initializer
  • he_normal: He normal initializer
  • lecun_normal: LeCun normal initializer
  • he_uniform: He uniform variance scaling initializer

7. Conclusion

In this article, we have made an overview of some weight initialization techniques within neural networks. Apparently, a secondary topic actually affects the quality of the results and the speed of convergence of the training process.

All these techniques have solid theoretical justifications and are aimed at mitigating or solving highly studied technical problems, such as the vanishing gradient problem.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.