1. Introduction

In this tutorial, we’ll go through the maxout, a widely used extension of the ReLU activation function employed in deep learning. We’ll present its mathematical methodology, illustrate it with a concrete example, and discuss its primary advantages and constraints.

2. What Is Maxout?

In an effort to develop a more reliable activation function than ReLU that improves the neural network‘s performance, Ian Goodfellow first proposed the maxout activation function in the paper “Maxout Networks” in 2013. The authors of the study develop an activation that utilizes multiple ReLu activation functions over the input and take the maximum values among them as the output.

The mathematical approach of the Maxout is defined as:

(1)   \begin{equation*} \begin{aligned} f(x) = max(w_1x+b_1, w_2x+b_2, ..., w_k*x+b_k) \end{aligned} \end{equation*}

where x is the input, w_1, w_2, \dots w_k, and b_1, b_2, \dots, b_k are the weights and biases of the k ReLU activation functions.

It should be noted that the network learns the weight and bias values throughout the training phase by employing a method termed backpropagation. A hyperparameter called \textbf{k} must also be learned and set before the training process can start. The choice of k is crucial in the architecture of the neural network since it determines the complexity of the network as well. Also, a model with a higher k is able to acquire more input data features, but there is always the risk of overfitting.

3. Example of Maxout Algorithm

Let’s say we have an input vector x  = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}^{T}. We’ll apply k = 2 ReLU activation functions. Also, suppose that

w_1 = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix} , b_1 = \begin{bmatrix} -1 & -1 &  \end{bmatrix}^{T}  and  w_2 = \begin{bmatrix} 2 & 3 & 4 & 5 \\ 6 & 7 & 8 & 9 \end{bmatrix} , b2 = \begin{bmatrix} 1 & 1   \end{bmatrix}^{T}.

The ReLU function replaces any negative values of the above dot product:

ReLU_1(x) = max(0,w_1*x+b_1) = max(0, \begin{bmatrix} 29 & 69   \end{bmatrix}^{T})

and ReLU_2(x) = max(0,w_2*x+b_2) = max(0,\begin{bmatrix} 41 & 78   \end{bmatrix}^{T})

To take the maxout output, we apply the max function over ReLU_1 and ReLU_2.

MaxOut(x) = max(ReLU_1(x), ReLU_2(x)) = \begin{bmatrix} 41 & 78   \end{bmatrix}^{T}.

Note that in real-world applications, the size of x, w, and b are of larger dimensions and mainly depend on the complexity of the problem and the deep learning architecture.

4. Advantages and Disadvantages

Maxout activation comes with some benefits and some limitations as well. First of all, the addition of maxout as the activation function allows the network to learn multiple features of the input, and therefore the overall efficiency is improved. Moreover, maxout provides more robustness and generalization on the model, while its complexity can be controlled with the k hyperparameter.

On the other hand, maxout is computationally expensive due to the application of multiple ReLU activation functions. Another limitation is the hyperparameter tuning of the network. The selection of \textbf{k} is time and computationally demanding. Lastly, the interpretability of the network is reduced. As the complexity of the model increased, it became difficult to debug and understand how deeper variables of the network work and make predictions.

5. Conclusion

The choice of the activation function relies on the task and the design of the particular problem, despite the fact that maxing out has a number of benefits.

In this tutorial, we introduced the maxout activation function, discussed an example, and analyzed its main advantages and disadvantages.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.