What Is the Time Complexity for Training a Neural Network Using Backpropagation?

1. Overview

In this tutorial, we’ll explore the time complexity of training a neural network using backpropagation.

We’ll discuss the mathematical foundations, analyze the factors affecting the time complexity, and provide examples to illustrate the concepts.

2. Understanding Neural Networks and Backpropagation

Before we analyze the time complexity, let’s briefly revisit the basics of backpropagation. We’ll assume that all training data is used in each training iteration – that is, we’re using batch gradient descent, and the descent is not stochastic.

Backpropagation is an algorithm that trains neural networks by adjusting the weights to minimize the error between the predicted and actual outputs. In our neural network, the weights are associated with layers, so we denote the weight connecting the neuron $i$ in layer $\ell$ to neuron $j$ in layer $\ell+1$ as $w_{ij}^{(\ell)$ .

The weights are updated using gradient descent:

(1) $\begin{equation*} w_{ij}^{(\ell)\ \text{new}} = w_{ij}^{(\ell)\ \text{old}} - \eta \frac{\partial J}{\partial w_{ij}^{(\ell)}} \end{equation*}$

Where:

$\eta$ is the learning rate – it controls the step size during the weight update; a lower learning rate means smaller updates and potentially more precise convergence but may require more iterations
$J$ is the loss function (e.g., mean squared error or cross-entropy) – it quantifies the difference between the network’s predictions and the actual target values; the goal of training is to minimize this loss
$\frac{\partial J}{\partial w_{ij}^{(\ell)}}$ is the partial derivative of the loss for the weight $w_{ij}^{(\ell)}$ – it measures how sensitive the loss function is to changes in the weight; this gradient tells us the direction to adjust $w_{ij}^{(\ell)}$ to decrease the loss

2.1. Backpropagation

The backpropagation algorithm involves main phases: the forward and backward passes.

In a forward pass, we compute the network outputs by passing the input object to the input layer, through each hidden layer, and to the output layer. This phase uses the current weights and biases to generate predictions.

The second step is a backward pass. Starting from the output layer, we compute the gradient $\frac{\partial J}{\partial w_{ij}^{(\ell)}}$ of the loss function concerning each weight $w_{ij}^{\ell}$ (and bias). In doing so, we apply the chain rule of calculus.

Then, we adjust each weight $w_{ij}^{(\ell)}$ by subtracting the product of the learning rate $\eta$ and the gradient. This step moves the weights in the direction that reduces the loss.

By iteratively performing these two phases over multiple epochs and training samples, the network learns the optimal weights and biases that minimize the loss function, improving its predictions on the training data.

Understanding these parameters and their roles in the network is crucial because they directly impact the computational complexity of training. The number of weights and biases determines the computations required during forward and backward passes, influencing the overall time complexity.

3. Time Complexity in Training Neural Networks

Training a neural network can be computationally intensive. It’s important to distinguish between time complexity and actual training time.

Time complexity is the computational time required for training scales with the input size, typically expressed using Big O notation. It describes the asymptotic behavior as the input size grows.

Actual training time is the duration of training. It’s influenced by hardware capabilities, software optimizations, and implementation efficiency. While these factors affect the actual time, they don’t change the underlying time complexity class.

3.1. Factors Affecting Time Complexity

Several factors impact the asymptotic time complexity of training a neural network:

Number of layers: more layers mean more computations during both the forward and backward passes
Number of neurons per layer: having more neurons increases the number of connections and computation; in many cases, this affects the time complexity quadratically
Number of training samples: more samples require more iterations through the network per epoch, leading to linear growth in time complexity with respect to training samples
Number of epochs: more epochs mean the dataset is processed multiple times, resulting in linear growth in time complexity
Complexity of activation functions: Some activation functions are computationally more expensive than others; while they may not change the complexity class, they affect the constant factors in the time complexity

3.2. Factors Influencing Actual Time

While these factors determine the asymptotic growth of the computational time, hardware and software optimizations also influence the actual training time.

Utilizing GPUs, TPUs, parallel processing, and optimized libraries can significantly reduce the actual training time.

However, these optimizations affect the constants in the time calculation and don’t change the asymptotic time complexity class (Big O notation). They make computations faster but don’t alter the fundamental way training time scales with input size.

4. Analyzing Time Complexity

In this section, we’ll analyze the time complexity of training a neural network using backpropagation. We’ll focus on the number of basic computational operations required, such as multiplications and additions, which are fundamental to weight updates and activation computations.

To simplify the analysis, we’ll assume that the neural network is rectangular, meaning all layers have the same number of neurons ( $N$ ).

4.1. Time Complexity of the Forward Pass

Each neuron computes its output during the forward pass based on inputs from the previous layer.

For a single layer, the time complexity is proportional to the number of computations required:

(2) $\begin{equation*} T_{forward} = O(N^2) \end{equation*}$

For the entire network with $L$ layers, the time complexity per sample is:

(3) $\begin{equation*} T_{forward} = O(L \times N^2) \end{equation*}$

We compute the weighted sum of inputs for each neuron and apply an activation function to it, typically a single operation.

4.2. Time Complexity of the Backward Pass

The backward pass involves computing gradients and updating weights based on the errors of the neurons in the next layer. The computations are similar to the forward pass but in reverse order.

The time complexity per sample is also quadratic:

(4) $\begin{equation*} T_{backward} = O(L \times N^2) \end{equation*}$

4.3. Total Time Complexity

Now that we know the time complexity of both the forward and backward passes, the total time complexity per sample is:

(5) $\begin{equation*} T_{sample} = T_{forward} + T_{backward} = O(L \times N^2) \end{equation*}$

If we consider all samples over all epochs, the total time complexity is:

(6) $\begin{equation*} T_{total} = E \times S \times T_{sample} = O(E \times S \times L \times N^2) \end{equation*}$

Where:

$E$ is the number of epochs
$S$ is the number of training samples

This formula shows that the training time increases linearly with the number of epochs and samples but quadratically with the number of neurons per layer.

5. Example Calculation of Number of Operations

Now that we know the time complexity, let’s play and check how many computing operations it will take to classify digits from 0 to 9 based on the provided image.

5.1. Network Architecture

Suppose we have:

Input Layer: 784 neurons (e.g., for 28×28 pixel images in the MNIST dataset)
Hidden Layer: 128 neurons
Output Layer: 10 neurons (for digit classification from 0 to 9)

5.2. Parameters

Number of Layers: 2 (since we have one hidden layer between input and output)
Number of Training Samples: 60,000 samples
Number of Epochs: 10 epochs

5.3. Calculating the Number of Operations

We can summarize the computations required during training in the following table:

Computation Step	Calculation	Number of Operations
Forward Pass: Input to Hidden Layer	$T_{forward\_1} = N_{input} \times N_{hidden} = 784 \times 128$	$100,352$
Forward Pass: Hidden to Output Layer	$T_{forward\_2} = N_{hidden} \times N_{output} = 128 \times 10$	$1,280$
Total Forward Pass per Sample	$T_{forward\_sample} = T_{forward\_1} + T_{forward\_2} = 100,352 + 1,280$	$101,632$
Backward Pass per Sample	Similar computations as Forward Pass	$101,632$
Total Computation per Sample	$T_{sample} = T_{forward\_sample} + T_{backward\_sample} = 101,632 + 101,632$	$203,264$
Total Number of Operations	$T_{total} = E \times S \times T_{sample} = 10 \times 60,000 \times 203,264$	$121,958,400,000$

This calculation demonstrates the significant computational effort, emphasizing the importance of optimizing the network and algorithms to reduce the carbon footprint in more complex problems.

6. Strategies to Optimize Training Time

To manage and reduce training time, we can implement several strategies.

6.1. Mini-Batch Gradient Descent

Instead of processing one sample at a time (Stochastic Gradient Descent), we can process mini-batches ( $B$ ) of samples:

(7) $\begin{equation*} \text{Number of Batches} = \frac{S}{B} \end{equation*}$

Each batch processes $B$ samples. The time complexity per batch is:

(8) $\begin{equation*} T_{\text{batch}} = B \times T_{\text{sample}} = O(B \times L \times N^2) \end{equation*}$

The total time complexity per epoch is:

(9) $\begin{equation*} T_{\text{epoch}} = \frac{S}{B} \times T_{\text{batch}} = \frac{S}{B} \times O(B \times L \times N^2) = O(S \times L \times N^2) \end{equation*}$

As we can see, the total time complexity per epoch remains $O(S \times L \times N^2)$ , similar to processing the data sample by sample.

However, mini-batch gradient descent allows for computational optimizations:

Processing batches enable vectorized operations, which are optimized in numerical libraries and can be executed efficiently on modern CPUs and GPUs
Batches can be processed in parallel, taking advantage of multi-core processors and GPUs, reducing the actual training time

By leveraging these optimizations, mini-batch gradient descent can significantly reduce the actual training time compared to processing samples individually, even though the asymptotic time complexity remains the same.

6.2. Using Efficient Activation Functions

Activation functions like ReLU (Rectified Linear Unit) are computationally less expensive than functions like sigmoid or tanh.

Let’s compare ReLU and sigmoid:

(10) $\begin{equation*} \text{ReLU: } f_{\text{ReLU}}(x) = \max(0, x) \quad \text{vs} \quad \text{Sigmoid: } f_{\text{sigmoid}}(x) = \frac{1}{1 + e^{-x}} \end{equation*}$

Computing the sigmoid function involves calculating the exponential function, which is computationally more intensive than the simple comparison in ReLU.

By using ReLU, we reduce the computational cost per neuron, which can significantly decrease the actual training time, especially in deep networks with many neurons.

6.3. Network Pruning and Regularization

Reducing the number of neurons and layers can decrease the computational load. However, penalizing large weights through regularization techniques like L1/L2 regularization doesn’t remove neurons or layers; it only encourages the weights to be small, possibly near zero. Even if weights become very small, the associated neurons and connections are still present in the network, and computations involving them still occur. We can use techniques like dropout or pruning to reduce the network size and computational cost-effectively.

Dropout randomly deactivates a subset of neurons during training. The fraction of deactivated neurons is usually a constant, so this strategy doesn’t reduce the computational complexity of training but can lead to sparser models.

With pruning, we remove neurons or connections with weights below a certain threshold. This can reduce the model size and computational cost during inference, but the training time remains largely unaffected.

Thus, while regularization techniques help prevent overfitting and can lead to simpler models, they don’t necessarily reduce the time complexity during training.

6.4. Parallel and Distributed Computing

We can also utilize specialized GPUs or distributed systems designed for ML training, which can significantly speed up training.

GPUs are designed for the parallel processing of large data blocks, which makes them ideal for performing matrix operations in neural networks. They can handle thousands of computations simultaneously, greatly accelerating training.

On the other hand, tools like TensorFlow and PyTorch support distributed training across multiple devices or machines. By distributing the computational workload, we can train large models more efficiently.

6.5. Optimized Algorithms

Finally, we can use advanced optimization algorithms that accelerate convergence, reducing the number of epochs required for training and, in turn, the total computational time.

For example, algorithms like Adam and RMSProp adjust the learning rate during training based on the historical gradients. They often converge faster than standard gradient descent methods.

Another one is momentum, which helps accelerate gradients in the relevant direction and dampens oscillations. It adds a fraction of the previous update to the current update, which can lead to faster convergence.

7. Conclusion

In this article, we explored the time complexity of training neural networks using backpropagation. It’s significantly influenced by the network’s architecture, the dataset’s size, and the algorithms’ computational efficiency.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung