
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: October 20, 2024
In this tutorial, we’ll explore the time complexity of training a neural network using backpropagation.
We’ll discuss the mathematical foundations, analyze the factors affecting the time complexity, and provide examples to illustrate the concepts.
Before we analyze the time complexity, let’s briefly revisit the basics of backpropagation. We’ll assume that all training data is used in each training iteration – that is, we’re using batch gradient descent, and the descent is not stochastic.
Backpropagation is an algorithm that trains neural networks by adjusting the weights to minimize the error between the predicted and actual outputs. In our neural network, the weights are associated with layers, so we denote the weight connecting the neuron in layer
to neuron
in layer
as
.
The weights are updated using gradient descent:
(1)
Where:
The backpropagation algorithm involves main phases: the forward and backward passes.
In a forward pass, we compute the network outputs by passing the input object to the input layer, through each hidden layer, and to the output layer. This phase uses the current weights and biases to generate predictions.
The second step is a backward pass. Starting from the output layer, we compute the gradient of the loss function concerning each weight
(and bias). In doing so, we apply the chain rule of calculus.
Then, we adjust each weight by subtracting the product of the learning rate
and the gradient. This step moves the weights in the direction that reduces the loss.
By iteratively performing these two phases over multiple epochs and training samples, the network learns the optimal weights and biases that minimize the loss function, improving its predictions on the training data.
Understanding these parameters and their roles in the network is crucial because they directly impact the computational complexity of training. The number of weights and biases determines the computations required during forward and backward passes, influencing the overall time complexity.
Training a neural network can be computationally intensive. It’s important to distinguish between time complexity and actual training time.
Time complexity is the computational time required for training scales with the input size, typically expressed using Big O notation. It describes the asymptotic behavior as the input size grows.
Actual training time is the duration of training. It’s influenced by hardware capabilities, software optimizations, and implementation efficiency. While these factors affect the actual time, they don’t change the underlying time complexity class.
Several factors impact the asymptotic time complexity of training a neural network:
While these factors determine the asymptotic growth of the computational time, hardware and software optimizations also influence the actual training time.
Utilizing GPUs, TPUs, parallel processing, and optimized libraries can significantly reduce the actual training time.
However, these optimizations affect the constants in the time calculation and don’t change the asymptotic time complexity class (Big O notation). They make computations faster but don’t alter the fundamental way training time scales with input size.
In this section, we’ll analyze the time complexity of training a neural network using backpropagation. We’ll focus on the number of basic computational operations required, such as multiplications and additions, which are fundamental to weight updates and activation computations.
To simplify the analysis, we’ll assume that the neural network is rectangular, meaning all layers have the same number of neurons ().
Each neuron computes its output during the forward pass based on inputs from the previous layer.
For a single layer, the time complexity is proportional to the number of computations required:
(2)
For the entire network with layers, the time complexity per sample is:
(3)
We compute the weighted sum of inputs for each neuron and apply an activation function to it, typically a single operation.
The backward pass involves computing gradients and updating weights based on the errors of the neurons in the next layer. The computations are similar to the forward pass but in reverse order.
The time complexity per sample is also quadratic:
(4)
Now that we know the time complexity of both the forward and backward passes, the total time complexity per sample is:
(5)
If we consider all samples over all epochs, the total time complexity is:
(6)
Where:
This formula shows that the training time increases linearly with the number of epochs and samples but quadratically with the number of neurons per layer.
Now that we know the time complexity, let’s play and check how many computing operations it will take to classify digits from 0 to 9 based on the provided image.
Suppose we have:
We can summarize the computations required during training in the following table:
Computation Step | Calculation | Number of Operations |
---|---|---|
Forward Pass: Input to Hidden Layer | ||
Forward Pass: Hidden to Output Layer | ||
Total Forward Pass per Sample | ||
Backward Pass per Sample | Similar computations as Forward Pass | |
Total Computation per Sample | ||
Total Number of Operations |
This calculation demonstrates the significant computational effort, emphasizing the importance of optimizing the network and algorithms to reduce the carbon footprint in more complex problems.
To manage and reduce training time, we can implement several strategies.
Instead of processing one sample at a time (Stochastic Gradient Descent), we can process mini-batches () of samples:
(7)
Each batch processes samples. The time complexity per batch is:
(8)
The total time complexity per epoch is:
(9)
As we can see, the total time complexity per epoch remains , similar to processing the data sample by sample.
However, mini-batch gradient descent allows for computational optimizations:
By leveraging these optimizations, mini-batch gradient descent can significantly reduce the actual training time compared to processing samples individually, even though the asymptotic time complexity remains the same.
Activation functions like ReLU (Rectified Linear Unit) are computationally less expensive than functions like sigmoid or tanh.
Let’s compare ReLU and sigmoid:
(10)
Computing the sigmoid function involves calculating the exponential function, which is computationally more intensive than the simple comparison in ReLU.
By using ReLU, we reduce the computational cost per neuron, which can significantly decrease the actual training time, especially in deep networks with many neurons.
Reducing the number of neurons and layers can decrease the computational load. However, penalizing large weights through regularization techniques like L1/L2 regularization doesn’t remove neurons or layers; it only encourages the weights to be small, possibly near zero. Even if weights become very small, the associated neurons and connections are still present in the network, and computations involving them still occur. We can use techniques like dropout or pruning to reduce the network size and computational cost-effectively.
Dropout randomly deactivates a subset of neurons during training. The fraction of deactivated neurons is usually a constant, so this strategy doesn’t reduce the computational complexity of training but can lead to sparser models.
With pruning, we remove neurons or connections with weights below a certain threshold. This can reduce the model size and computational cost during inference, but the training time remains largely unaffected.
Thus, while regularization techniques help prevent overfitting and can lead to simpler models, they don’t necessarily reduce the time complexity during training.
We can also utilize specialized GPUs or distributed systems designed for ML training, which can significantly speed up training.
GPUs are designed for the parallel processing of large data blocks, which makes them ideal for performing matrix operations in neural networks. They can handle thousands of computations simultaneously, greatly accelerating training.
On the other hand, tools like TensorFlow and PyTorch support distributed training across multiple devices or machines. By distributing the computational workload, we can train large models more efficiently.
Finally, we can use advanced optimization algorithms that accelerate convergence, reducing the number of epochs required for training and, in turn, the total computational time.
For example, algorithms like Adam and RMSProp adjust the learning rate during training based on the historical gradients. They often converge faster than standard gradient descent methods.
Another one is momentum, which helps accelerate gradients in the relevant direction and dampens oscillations. It adds a fraction of the previous update to the current update, which can lead to faster convergence.
In this article, we explored the time complexity of training neural networks using backpropagation. It’s significantly influenced by the network’s architecture, the dataset’s size, and the algorithms’ computational efficiency.