## 1. Introduction

The emergence of deep learning has changed the landscape of artificial intelligence, revolutionising domains like computer vision and natural language processing. Amid the core challenges encountered when training deep neural networks, the vanishing gradient problem has emerged as a critical challenge. This issue significantly impacts the convergence and performance of models with multiple layers.

The vanishing gradient problem becomes notably pronounced in the context of a Recurrent Neural Network (RNN), a neural network designed for processing sequential data. As the word “recurrent” indicates, RNNs have connections that loop back on themselves. This allows them to uphold a hidden state that captures insights from previous time steps within the sequence. However, this architectural design can also cause the vanishing gradient problem.

In this tutorial, we’ll delve into the concept of the vanishing gradient problem. Furthermore, we will explore how the Long Short-Term Memory network (LSTM), a variant of RNN, tries to improve this problem.

## 2. Recap: what is RNN?

A Recurrent Neural Network (RNN) is a type of artificial neural network designed to process sequential data by taking into account the order and time dependencies of the input data. **Compared to traditional feedforward neural networks, where information flows in one direction from input to output, RNNs contain an RNN cell that loops back on itself.** This allows them to maintain a hidden state that captures information from previous time steps in the sequence.

Unlike the conventional feedforward neural networks with a linear information flow from input to output, RNNs possess a distinct characteristic: a looping RNN cell. This mechanism enables them to retain a hidden state, effectively absorbing insights from prior moments in the sequence.

The computation of the hidden state at a given time step follows the equation below.

Hidden state:

Here, represents the input at time step , and stand for the input and hidden state weights, denotes the bias, and represents the activation function.

Furthermore, we calculate the output based on the hidden state.

Output:

Throughout the training process, the weights undergo updates via backpropagation through time. This iterative process entails the computation of gradients and the subsequent weight adjustments, all tailored to train RNNs on sequences of data. This technique empowers RNNs to capture the temporal dependencies. However, it also presents challenges with gradient vanishing and the efficiency of computations when dealing with long sequences.

## 3. Vanishing Gradient Problem

In deep neural networks, gradients serve as indicators of the weight adjustments to minimise the gap between predicted and actual outcomes. Yet, they often get weaker as they traverse through layers, potentially causing a slowdown or even a halt in learning within the initial layers.

This phenomenon is what’s referred to as the vanishing gradient problem. This challenge grows more pronounced in deep architectures and networks employing activation functions that squash input values.

### 3.1. Vanishing Gradient in RNN

**In the context of RNNs, the vanishing gradient problem arises from the recurrent connections and the propagation of gradients through different time steps.** Let’s consider a vanilla RNN cell that processes a sequence of inputs over time steps denoted as .

- At every time step , the RNN cell takes both the current input and the hidden state from the previous time step to calculate a new hidden state and produce an output.
- The hidden state at time step influences the hidden state at time step . This dependency on previous time steps causes the gradient propagation through a chain of recurrent connections.
- In the process of backpropagation through time, we calculate gradients by iteratively employing the chain rule across the various time steps. This iterative process begins from the final time step and proceeds in reverse.

**The vanishing gradient problem in RNNs occurs because, as the gradients are propagated backwards through time, they can become very small due to the repeated multiplication of gradients in each time step.**

If initialising the RNN cell with small weights or if the sequence is long, the gradients can diminish to the point where they are effectively zero. As a result, the earlier time steps in the sequence receive very weak gradient signals during training. Moreover, their corresponding weights are updated slowly, or not at all. This leads to the loss of important information and slow convergence of learning.

## 4. How Does LSTM Improve the Vanishing Gradient Problem?

LSTMs, introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, were designed to focus on overcoming the vanishing gradient problem. **To do so, LSTM leverages gating mechanisms to control the flow of information and gradients.** This helps prevent the vanishing gradient problem and allows the network to learn and retain information over longer sequences.

The figure below shows the structure of an LSTM cell:

There are three gates included in LSTMs: the input gate, the forget gate, and the output gate. These gates control the flow of information through the LSTM cell, allowing it to decide what to remember, what to forget, and what to output. Furthermore, these gates allow LSTMs to control the flow of gradients through time, effectively addressing the vanishing gradient problem.

### 4.1. Forget Gate

The forget gate decides which information to forget from the previous hidden state and the current input. Using a sigmoid activation function, it produces values between 0 and 1, ideally indicating the fraction of data that passes through the cell state.

In addition, the forget gate lets LSTM hold onto crucial data and let go of the unnecessary, ensuring the gradients remain relevant.

### 4.2. Input Gate and Output Gate

In contrast to the forget gate, the input gate establishes which new details should be added to the cell state. It has two parts: a sigmoid activation function that selects values to update the cell and a tanh activation function that produces new candidate values for the cell state.

Meanwhile, the output gate defines what data should be presented as the LSTM cell’s current output. It processes the updated cell state and the present hidden state to produce the output. The output gate delivers meaningful results by refining the data inside the cell state. These gates are crucial in pinpointing the essential data for the given task, preserving significant information across extended sequences, and thus combating the vanishing gradient issue.

Both the input and output gates multiply their results with other factors. Their nature ensures that if the output is near 1, gradients pass unhindered, but if it’s around 0, the flow is stopped. These gates help select which data is essential for the task at hand. Therefore, they help maintain only crucial information over longer sequences, which relieves the vanishing gradient problem.

### 4.3. Memory Cell State

Combining results from the input gate and the previous cell state refines the cell’s state at the current time step. This memory cell state holds data across various time steps, allowing the network to capture longer-range dependencies. Due to the additive update mechanism, the LSTM’s memory cell ensures gradients remain consistent over lengthy sequences.

With these gate mechanisms in place, LSTMs better handle the vanishing gradient issue. They excel in tasks with an understanding of long-term sequences and dependencies.

## 5. Limitation of LSTM

**While Long Short-Term Memory (LSTM) networks effectively address the vanishing gradient problem and recognise long-term patterns in sequences, they come with their own set of challenges.**

**5.1. Gradient Explosion**

Even though LSTMs are better at addressing the vanishing gradient problem than traditional RNNs, they can still suffer from gradient explosion in certain cases. Gradient explosion happens when the gradients become extremely large during backpropagation, especially with the very long input sequence or initialising the network with large weights.

**5.2. Limited Contextual Information**

LSTMs can capture dependencies over longer sequences than simple RNNs, but they still have a limited context window. When processing extremely long sequences, LSTMs might struggle to remember earlier information and replace the older data with newer inputs. This happens when the forget gate becomes overly dominant.

**5.3. Complexity and Training Time**

LSTMs, with their gating mechanisms and multiple components, are more complex than the vanilla RNNs. As a result, they possess about four times the parameters of a basic RNN. This complexity can lead to longer training times and require more careful hyperparameter tuning. Additionally, training deep LSTM networks remains challenging due to vanishing gradients in very deep architectures.

**5.4. Gating Mechanism Sensitivity**

LSTM’s performance relies on the precise configuration of its gating mechanism. Inappropriate gate weight setups or poorly chosen hyperparameters might degrade the model’s learning efficiency, giving rise to gradient-related problems.

**5.5. ****Adapting to Task Complexity**

The extent to which LSTMs can address the vanishing gradient issue and identify long-range dependencies is influenced by the complexity of the task, the chosen architectural design, and the size and quality of training data.

In practice, we often experiment with different variations of LSTM architectures, initialisation strategies, and training techniques to achieve the best performance on specific tasks while mitigating the vanishing gradient problem. Additionally, newer architectures like Transformer-based models, such as the GPT series, have gained popularity for their ability to capture dependencies across long sequences using self-attention mechanisms. These new models can provide an alternative to LSTM-based solutions in certain scenarios.

## 6. Conclusion

The vanishing gradient problem has been a longstanding obstacle in training deep neural networks since gradients tend to diminish as they pass through layers.

In this article, we explained how the Long Short-Term Memory (LSTM) network is a remedy for this problem in RNNs. By using specialised gates like the forget, input, and output gates, LSTMs effectively regulate gradient flow. Furthermore, they empower the network to retain vital information and recognise long-term patterns.

However, LSTMs are still facing their challenges, including gradient explosions, costly training processes, and constraints in handling more complicated tasks.