Nonlinear Activation Functions in a Backpropagation Neural Network

1. Introduction

In this tutorial, we’ll study the nonlinear activation functions most commonly used in backpropagation algorithms and other learning procedures.

The reasons that led to the use of nonlinear functions have been analyzed in a previous article.

2. Feed-Forward Neural Networks

Backpropagation algorithms operate in fully interconnected Feed-Forward Neural Networks (FFNN):

with units that have the structure:

The $\phi$ function performs a transformation of the weighted sum of the inputs:

$\Sigma=w_{0}+\sum_{i=1}^{N}w_{i}x_{i}$

We discuss the FFNNs in more detail in our linear models article.

3. Guide to the Activation Function Family

Let’s consider as an example the following nonlinear function:

From the considerations made in our article on linear functions, it is clear that a plane (a linear model) cannot approximate the function of the Figure.

In the introduction, we used the term “unit” and not “neuron”. The biological analogy is misleading in many cases.

Neural networks are mathematical and statistical procedures and have limited similarity to biological neural networks. This analogy, however, may be useful in some contexts, and we’ll discuss the output of the neuron to justify the choice of certain families of nonlinear functions as activation functions.

Biological neurons transmit an electrical signal. This propagates via the cell membrane through the axon to the downstream neurons. The potential difference and the electrical signal are the results of an attempt by the neuron to re-establish the electrical equilibrium when there is a change in the concentration of ions K $^+$ and Na $^+$ (action potential).

The characteristics and variation over time of this mechanism were explained in 1952 by Hodgkin and Huxley, who received the Nobel Prize for Medicine in 1963 for this work.

3.1. Synapsis

The transmission of the signal from one neuron to another takes place through a synaptic transmission and can be of two types. If the two neurons touch each other, then there is a direct passage of the electrical signal (electrical synapsis). If the two neurons don’t touch, the signal propagates through the use of a chemical neurotransmitter (chemical synapsis).

Electrical synapsis is generally faster given the physical contact between the two cells and is bidirectional, having traditionally associated with reflex responses. However, a signal of this type is not modulated and cannot adapt to changes in environmental conditions.

Evolution has therefore generated chemical synapsis, typically in higher animals, of a unidirectional nature that allows the signal to be enhanced or inhibited.

Chemical synapsis has lower transmission rates than electrical ones, but allow plasticity and adaptation to the environmental conditions that lead to learning, that is, the calibration of neuronal connections.

If we want to force an analogy, the adjustable weights in an artificial neural network are a mathematical model analogous to biological chemical synapsis.

3.2. Step Function

The output of a biological neuron is digital, it is fired or not.

The membrane potential reaches a maximum value for a very limited time. The inclusion of the concept of time in the modeling of the signal of units in artificial neural networks has led to spiking neural networks, a very active research field. The signal generation occurs when its value exceeds a threshold and therefore has a discontinuous nature.

Below we illustrate this behavior or in other words a step function, which we can consider an equivalent of activation function in biological neurons:

4. Backpropagation

Let’s now briefly treat the theory behind backpropagation. Excellent references are the classic Krose and Van der Smagt paper and Bishop‘s book.

4.1. The Universal Approximation Theorem

Backpropagation is an algorithm we develop for training multilayer neural networks.

However, the need for this type of networks seems to contradict the Universal Approximation Theorem. This result establishes that a network with a single hidden layer and nonlinear activation functions can approximate practically every continuous function. The first version was demonstrated by Cybenko in 1989 for sigmoid activation functions.

Unfortunately, theoretical results are one thing and application is another:

It’s possible to show how the number of units in networks with a single layer grows exponentially with the size of the problem, being inapplicable or unable to carry out learning in some cases, obtaining poor generalization skills
Furthermore, Universal Approximation Theorem tells us nothing about the learning process

Further research has refined the original Cybenko result.

In 1991, Hornik showed that it’s not the specific choice of the activation function, but rather the multilayer feed-forward architecture itself which gives neural networks the potential of being universal approximators. These considerations justify the search for algorithms, such as backpropagation, applicable to multilayer neural networks.

4.2. Learning Rules

In supervised neural networks, the measured data for the target can be used to calibrate the output of the network given a certain input.

The learning process consists of the variation of the weights of the net which allows the best possible prediction of that target. This variation occurs through the application of a rule.

Virtually all learning rules for models of this type are a variant of the Hebbian learning rule suggested by Hebb in 1949. Some examples are:

$\Delta w_{jk}=\gamma y_{j}y_{k}$

$\Delta w_{jk}=\gamma y_{j}(t_{k}-y_{k})$

where the weight $w_{jk}$ refers to the connections between the units $j$ and $k$ .

Note that the output of a unit of a layer before the output layer becomes the input of the current unit ( $y_j = x_k$ ).

$\gamma$ is a proportionality constant called learning rate, and it is one of the parameters of the network to be optimized.

$t_k$ is the target measured for the output unit $k$ .

The last rule is particularly intuitive, in that it realizes a variation of the weight which depends on the difference between the output and the desired value, and can be obtained with a little differential calculus using linear activation functions.

4.3. The Delta Rule

Suppose a network with a single layer that uses the quadratic error between the network output and target as a measure of the goodness of the prediction, using the $P$ records of a dataset of measured data:

$E=\sum_{p=1}^{P}E^{p}=\frac{1}{2}\sum_{p=1}^{P}(d^{p}-t^{p})^{2}$

with the network output given by an activation function which is a linear weighted sum:

$y=w_{0}+\sum_{j}w_{j}x_{j}$

The factor $1/2$ in the expression of the error is arbitrary and serves to obtain a unit coefficient during the differentiation process.

For a pattern $p$ , the Delta rule connects the weight variation with the error gradient:

$\Delta_{p}w_{j}=-\gamma\frac{\partial E^{p}}{\partial w_{j}}$

The negative sign indicates that we try to minimize the error by varying the weight, and for this reason, it is also called the gradient descent method.

The derivative can be calculated with the chain rule:

$\frac{\partial E^{p}}{\partial w_{j}}=\frac{\partial E^{p}}{\partial y^{p}}\frac{\partial y^{p}}{\partial w_{j}}$

From the previous expressions for error and output, we have:

$\frac{\partial y^{p}}{\partial w_{j}}=x_{j}$

$\frac{\partial E^{p}}{\partial y^{p}}=-(t^{p}-y^{p})$

which leads to the final expression:

$\Delta_{p}w_{j}=\gamma x_{j}(t^{p}-y^{p})$

4.4. The Credit Assignment Problem

In the previous article mentioned above, we’ve seen that a multilayer network with linear activation functions reduces to a one-layer network with units containing linear activation functions. In this case, the Delta rule as derived in the previous section is an adequate learning rule.

If we want to increase the expressive abilities of the network using multiple layers and nonlinear activation functions, we need to find a learning rule for the weights connecting two arbitrary layers. For the weights connected to the output one, it’s possible to hypothesize in principle a procedure similar to that of the previous section.

The comparison between the unit output and the target can allow obtaining a learning rule. However, for pre-output layers, how do we vary the weights to minimize the error?

In this case, we do not have the help of the target as a guide to direct the search. Further, the variation of the weights in the hidden layers causes an output that becomes the input for the downstream layers, and so on, up to the output layer. This is problematic because the output layer is the only layer where we can use the target to understand if the overall variation of all weights has generated a minor error.

This difficulty is known as the Credit Assignment Problem, and it posed a major obstacle to the development of what we now call deep learning. That is, up until 1986 when Rumelhart, Hinton, and Williams published a now-classic article, where they described the backpropagation algorithm. A similar algorithm has been known since the 1970s, but the authors showed how it could be used as a learning procedure.

4.5. The Generalized Delta Rule

The idea is to generalize the Delta rule and use the chain rule to obtain an expression for the variation of the weights even of the hidden layers:

$\Delta_{p}w_{jk}=-\gamma\frac{\partial E^{p}}{\partial w_{jk}}$

For two interconnected layers, $j$ and $k$ with $k>j$ , the weighted sum is generalized as:

$s_{k}^{p}=w_{0k}+\sum_{j}w_{jk}y_{j}^{p}$

Using generic nonlinear activation functions for network units:

$y_{k}^{p}=\phi\left(s_{k}^{p}\right)$

the chain rule allows obtaining the following expressions for the units of the output layer $o$ and of a hidden layer $h$ :

$\frac{\partial E^{p}}{\partial s_{o}^{p}}=-(t_{o}^{p}-y_{o}^{p})\phi_{o}'(s_{o}^{p})$

$\frac{\partial E^{p}}{\partial s_{h}^{p}}=\phi'_{h}(s_{h}^{p})\sum_{o=1}^{N_{o}}\frac{\partial E^{p}}{\partial s_{o}^{p}}w_{ho}$

The above expressions involve the derivative of the activation function $\phi^{ ' }$ , and therefore require continuous functions.

4.6. From Biological Neuron to Nonlinear Artificial Neural Networks

The considerations we’ve made so far allow us a criterion for choosing nonlinear mathematical functions as activation functions. They must be continuous and differentiable, as required by the backpropagation, and reproduce the trend of the output of the biological neuron.

We’ll study two possible categories: sigmoid functions and the ReLU family.

5. Sigmoid Activation Functions

Sigmoid functions are bounded, differentiable, real functions that are defined for all real input values, and have a non-negative derivative at each point. Here are some important sigmoid functions and their main features.

5.1. Logistic

The logistic function has an output in the range $[0:1]$ :

$\phi(s_{k})=\frac{1}{1+e^{-s_{k}}}=\frac{e^{s_{k}}}{1+e^{s_{k}}}$

It has the advantage of having a smooth gradient but the disadvantage of being computationally expensive.

5.2. Hyperbolic Tangent

The hyperbolic tangent has advantages and disadvantages of the logistics function, but an output in the range $[-1:1]$ :

$\phi(s_{k})=\tanh(s_{k})$

5.3. Softmax

We can use Softmax in classification problems, typically in the output layer:

$\phi(s_{k})=\frac{e^{s_{k}}}{\sum_{i}e^{s_{j}}}$

5.4. Inverse Hyperbolic Tangent (arctanh)

$arctanh$ is similar to the above, but is used less:

$\phi(s_{k})=\mathrm{\tanh^{-1}}(s_{k})=\frac{1}{2}\ln\left(\frac{1+s_{k}}{1-s_{k}}\right)$

5.5. Gudermannian

The Gudermannian function relates circular functions and hyperbolic functions without explicitly using complex numbers:

$\phi(s_{k})=\int_{0}^{s_{k}}\frac{dt}{\cosh t}=2\tan^{-1}\left[\tanh\left(\frac{s_{k}}{2}\right)\right]$

5.6. Error Function

The error function is also called the Gauss error function. In statistics, for non-negative values of $x$ , and a random variable $y$ that is normally distributed with mean 0 and variance $1/2$ erf $x$ is the probability that $y$ falls in the range $[-x:x]$ :

$\phi(s_{k})=\mathrm{erf}(s_{k})=\frac{2}{\sqrt{\pi}}\int_{0}^{s_{k}}e^{-t^{2}}dt$

5.7. Generalised Logistic

The generalised logistic function reduces to the logistics function for $a=1$ :

$\phi(s_{k})=\left(1+e^{-s_{k}}\right)^{-a},\,a>0$

5.8. Kumaraswamy Function

The Kumaraswamy function is another generalization of the logistic function. It reduces to the logistic function for $a=b=1$ :

$\phi(s_{k})=1-\left[1-\left(\frac{1}{1+e^{-s_{k}}}\right)^{a}\right]^{b}$

5.9. Smoothstep Function

The smoothstep function is:

$\phi(s_{k})=\left\{ \begin{array}{ll} \left(\int_{0}^{1}(1-t^{2})^{N}\,dt\right)^{-1}\int_{0}^{s_{k}}(1-t^{2})^{N}\,dt,\, & |s_{k}|\leq1\\ \mathrm{sgn}(s_{k}), & |s_{k}|\geq1 \end{array}\right.,\,N\geq1$

5.10. Algebraic Function

And finally the algebraic function:

$\phi(s_{k})=\frac{s_{k}}{\sqrt{1+s_{k}^{2}}}$

5.11. Problems of the Sigmoid Activation Functions

Applying the functions listed as activation functions generally requires a rescaling of the dataset of the problem under consideration.

If we use the logistic function, for example, our target must be normalized in the range $[0:1]$ so that the values of the function can approximate it. This need is common to all activation functions, not only to sigmoid ones.

However, these functions suffer from a saturation effect for large and small values of $s_k$ , which decreases the resolving power of the network:

This mechanism leads to the so-called vanishing gradient problem. That is the cancellation of the gradient under certain conditions, which can prevent the learning process. This drawback can be mitigated by using a narrower normalization interval, for example $[0.1:0.9]$ for the logistic function or $[-0.9: 0.9]$ for the $\tanh$ .

Furthermore, sigmoid functions are, in general, computationally intensive.

6. Rectified Linear Units. the ReLU Family

6.1. General Characteristics

The ReLU family has numerous advantages:

biological plausibility
better gradient propagation: fewer vanishing gradient problems compared to sigmoid activation functions that saturate in both directions
computationally efficient: only comparison, addition, and multiplication
scale-invariant

The following figure shows the graphic representation of some of the functions mentioned below:

6.2. ReLU

Although it looks like a linear function, ReLU has a derivative function and allows for backpropagation:

$\phi(s_{k})=\max(0,s_{k})$

However, it suffers from some problems. First, the Dying ReLU problem. When inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn. This is a form of the vanishing gradient problem.

In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model’s capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using Leaky ReLU instead, which assigns a small positive slope for $x<0$ .

Also, it’s an unbounded function, which means that it doesn’t have a maximum or minimum value.

6.3. Leaky ReLU

Leaky ReLU mitigates the dying ReLU Problem by setting a small slope in the negative part and permits the application of backpropagation for negative values:

$\phi(s_{k})=\left\{ \begin{array}{ll} s_{k}, & x>0\\ 0.01s_{k}, & \mathrm{otherwise} \end{array}\right.$

However, it does not allow consistent predictions for $s_k < 0$ .

6.4. Parametric ReLU

The parametric ReLU allows inserting the parameter a into the learning process, instead of defining an arbitrary value like the Leaky ReLU:

$\phi(s_{k})=\left\{ \begin{array}{ll} s_{k}, & x>0\\ as_{k}, & \mathrm{otherwise} \end{array}\right.$

6.5. Noisy ReLU

Another one is an extension of ReLU, called Noisy ReLU. The main difference is that output contains noise generated by a Gaussian probability density $\mathcal{N}$ with zero mean and standard deviation $\sigma$ :

$\phi(s_{k})=\max(0,s_{k}+\mathcal{N}(0,\sigma(s_{k}))$

6.6. ELU

Exponential linear units (ELU) try to make the mean activations closer to zero, which speeds up learning. ELUs can obtain higher classification accuracy than ReLUs:

$\phi(s_{k})=\left\{ \begin{array}{ll} s_{k}, & x>0\\ a(e^{s_{k}}-1), & \mathrm{otherwise} \end{array}\right.,\,a>0$

6.7. Softplus or SmoothReLU

A smooth approximation to the ReLU (Softplus/SmoothReLU) function has a remarkable feature that its derivative is the logistic function:

$\phi(s_{k})=\ln\left(1+e^{s_{k}}\right)$

6.8. Swish Function

The Swish function was developed by Google, and it has superior performance with the same level of computational efficiency as the ReLU function:

$\phi(s_{k})=\frac{s_{k}}{1+e^{-s_{k}}}$

7. Complex Nonlinear Activation Functions

7.1. Complex Problems

All the activation functions that we considered previously are real. But there are application fields that use models with complex functions.

A typical example is electromagnetic systems, many of which model some type of wave phenomenon with amplitude and phase. The complex character derives from the fact that waves are periodic functions that can be represented by imaginary exponentials because from Euler’s formula:

$e^{ix}=\cos x+i\sin x$

where $i$ is the imaginary unit.

Other application fields where we can use the imaginary activation functions are electromagnetic and optical waves, electrical signals in analog and digital circuits; electron wave; superconductors; quantum computation; sonic and ultrasonic waves; periodic typology and metric; high stability in recurrent dynamics; chaos and fractals; quaternions.

7.2. Activation Functions in Complex-Valued Neural Networks

We can treat intrinsically complex problems with a neural network in several ways. For example, we can design a network that contains the real and imaginary components of the model as separate output units, using only real functions, but it is also possible to treat them directly.

The topic is very broad. The important issue is that it is possible to derive an extension to backpropagation that makes use of complex nonlinear activation functions, resulting in the so-called Complex-Valued Neural Networks (CVNN).

These activation functions use the expressions of some of the sigmoid functions that we have analyzed in the previous sections. Each unit of the network produces a complex output, which we can aggregate and it becomes the complex input of the units of the next layer.

Normally, we consider two main forms of functions, the real-imaginary-type activation functions:

$\phi(s_{k})=f_{\Re}(s_{k})+if_{\Im}(s_{k})=\tanh(\Re(s_{k}))+i\tanh(\Im(s_{k}))$

and amplitude-phase-type activation functions:

$\phi(s_{k})=\tanh(|s_{k}|)e^{i\arg(u)}$

where $arg$ is the argument of the complex number $s_k$ , and $\Re$ and $\Im$ its real and imaginary components.

8. Conclusion

In this tutorial, we’ve given an overview of the nonlinear activation functions used in the backpropagation algorithm.

Instead of making a simple list of the mathematical functions and their characteristics, we’ve tried to give a coherent approach to the question, highlighting the problems and needs, which lead, from a logical point of view, towards the treatment of the text.

Some lesser-known extensions, such as imaginary functions, have more recently made it possible to extend the application of feed-forward neural networks and backpropagation algorithm to new problems.

We can also discover many other nonlinear activation functions to train networks with algorithms other than backpropagation. For example, Radial Basis Functions (RBF) use Gaussian functions; however, in this article, we focused on the backpropagation mechanism.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung