What Is Maxout in a Neural Network? | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll go through the maxout, a widely used extension of the ReLU activation function employed in deep learning. We’ll present its mathematical methodology, illustrate it with a concrete example, and discuss its primary advantages and constraints.

2. What Is Maxout?

In an effort to develop a more reliable activation function than ReLU that improves the neural network‘s performance, Ian Goodfellow first proposed the maxout activation function in the paper “Maxout Networks” in 2013. The authors of the study develop an activation that utilizes multiple ReLu activation functions over the input and take the maximum values among them as the output.

The mathematical approach of the Maxout is defined as:

(1) $\begin{equation*} \begin{aligned} f(x) = max(w_1x+b_1, w_2x+b_2, ..., w_k*x+b_k) \end{aligned} \end{equation*}$

where $x$ is the input, $w_1, w_2, \dots w_k$ , and $b_1, b_2, \dots, b_k$ are the weights and biases of the $k$ ReLU activation functions.

It should be noted that the network learns the weight and bias values throughout the training phase by employing a method termed backpropagation. A hyperparameter called $\textbf{k}$ must also be learned and set before the training process can start. The choice of $k$ is crucial in the architecture of the neural network since it determines the complexity of the network as well. Also, a model with a higher $k$ is able to acquire more input data features, but there is always the risk of overfitting.

3. Example of Maxout Algorithm

Let’s say we have an input vector $x = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}^{T}$ . We’ll apply $k = 2$ ReLU activation functions. Also, suppose that

$w_1 = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix}$ , $b_1 = \begin{bmatrix} -1 & -1 & \end{bmatrix}^{T}$ and $w_2 = \begin{bmatrix} 2 & 3 & 4 & 5 \\ 6 & 7 & 8 & 9 \end{bmatrix}$ , $b2 = \begin{bmatrix} 1 & 1 \end{bmatrix}^{T}$ .

The ReLU function replaces any negative values of the above dot product:

$ReLU_1(x) = max(0,w_1*x+b_1) = max(0, \begin{bmatrix} 29 & 69 \end{bmatrix}^{T})$

and $ReLU_2(x) = max(0,w_2*x+b_2) = max(0,\begin{bmatrix} 41 & 78 \end{bmatrix}^{T})$

To take the maxout output, we apply the max function over $ReLU_1$ and $ReLU_2$ .

$MaxOut(x) = max(ReLU_1(x), ReLU_2(x)) = \begin{bmatrix} 41 & 78 \end{bmatrix}^{T}$ .

Note that in real-world applications, the size of $x$ , $w$ , and $b$ are of larger dimensions and mainly depend on the complexity of the problem and the deep learning architecture.

4. Advantages and Disadvantages

Maxout activation comes with some benefits and some limitations as well. First of all, the addition of maxout as the activation function allows the network to learn multiple features of the input, and therefore the overall efficiency is improved. Moreover, maxout provides more robustness and generalization on the model, while its complexity can be controlled with the $k$ hyperparameter.

On the other hand, maxout is computationally expensive due to the application of multiple ReLU activation functions. Another limitation is the hyperparameter tuning of the network. The selection of $\textbf{k}$ is time and computationally demanding. Lastly, the interpretability of the network is reduced. As the complexity of the model increased, it became difficult to debug and understand how deeper variables of the network work and make predictions.

5. Conclusion

The choice of the activation function relies on the task and the design of the particular problem, despite the fact that maxing out has a number of benefits.

In this tutorial, we introduced the maxout activation function, discussed an example, and analyzed its main advantages and disadvantages.

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung

1. Introduction

2. What Is Maxout?

3. Example of Maxout Algorithm

4. Advantages and Disadvantages

5. Conclusion