Common Causes of NaNs During Training | Baeldung on Computer Science

1. Introduction

While training neural networks, we can sometimes get NANs (Not-a-Number) as output.

In this tutorial, we’ll review the common causes of NaNs in training neural networks and discuss how to prevent this.

2. What Does NaN Mean in Training?

Training a neural network means adjusting weights to optimize performance, usually through error backpropagation. Training depends on several factors, including the network architecture parameters and the hyperparameters, such as the loss function, optimizer, and learning rates.

If these different factors are not configured correctly, they can often cause erroneous outputs, such as NaNs. The appearance of NaNs usually means that the neural network has encountered an error during training.

3. Causes of NaNs

There are several causes of NaNs, which we discuss in detail here.

3.1. Errors in the Input Data

Real-world data often contains erroneous, missing, and inconsistent values due to human errors, data-collection issues, and other factors. These errors can sometimes manifest as NaNs in the input data.

During neural network training, a neuron takes values from the input data, weights, and bias and applies a mathematical operation that yields an output. The mathematical operation yields a NaN output if a NaN value is passed as input to a neuron. Let’s consider a dataset of food prices in different cities, expressed in USD:

City	Eggs	Milk
London	2.5	7.9
Venice	NaN	8.9
Prague	4.6	NaN

The second and third rows have NaN values for one of the three variables. Performing a mathematical operation on them will yield NaNs as output.

3.2. Learning Rate

In machine learning, the learning rate is a parameter that determines the magnitude of the step at each iteration of gradient descent. The goal is minimizing the loss function, hence reducing the output error.

Setting the optimal learning rate is crucial for performance. If the learning rate is too high, it can lead to very large parameter updates during gradient descent, resulting in unusually large values and potentially causing what is known as ‘exploding gradients.’ Exploding gradients occur when the updates become significantly large during training, leading to unstable training and causing NaNs.

3.3. Activation Function

The activation function transforms an input into an output through a mathematical operation during the forward pass. Let’s consider the softmax activation as an example:

$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}$

Softmax takes input values and outputs a set of probabilities. However, studies have shown that the softmax equation can sometimes result in a division by 0 in the denominator. This can lead to the NaNs during training.

3.4. Loss Function

The loss function computes the magnitude of the error between the model’s input and the actual output. Similar to the activation function, it can result in numerical errors, which can cause NaNs.

For example, the error calculated by the loss function could be so large that it exceeds the expected values of the loss function, resulting in NaNs.

4. How to Avoid NaNs

Avoiding NaNs requires addressing their underlying causes. Let’s explore how we can avoid NaNs based on the causes we looked at:

4.1. Data Preprocessing

Data preprocessing is a crucial step before training that involves any techniques applied to data to transform it from its raw state to a usable format. This includes handling categorical values, standardizing values, removing outliers, and handling missing and inconsistent values to prepare the data before training.

To address errors such as NaNs in the input data, we can employ a preprocessing imputation approach that replaces the NaNs with zeroes:

City	Eggs	Milk
London	2.5	7.9
Venice	0	8.9
Prague	4.6	0

This ensures the input data has no occurrence of NaNs before being passed to the model for training. Instead of zeros, we can use column means, medians, or any value considered neutral.

4.2. Hyperparameter Tuning

Hyperparameter optimization or tuning refers to finding the optimal hyperparameters for a neural network model that minimizes the loss function. These hyperparameters include parameters like learning rate and batch size. This is often an iterative process that involves testing different combinations of hyperparameters to find values that yield the best performance from the neural network.

In cases where NaNs result from the learning rate, extensive hyperparameter tuning can help set the optimal learning rate for the model’s training process. An adaptive learning rate or a learning rate scheduling technique can automatically adjust the learning rate during training. Learning rate schedulers automatically modify the learning rate according to a predefined schedule, while adaptive techniques automatically adjust the learning rate to optimize the loss function.

4.3. Activation Function Robustness

To handle NaNs caused by activation functions, we can implement activation functions whose computations are robust enough to handle numerical errors.

In the case of division-by-zero errors, implementing error-handling mechanisms can help to ensure that NaNs are not propagated through the network. For example, we can modify the computation of softmax by adding a small value $\epsilon$ to the denominator to prevent division by zero:

$\text{Softmax}(x_i) = \frac{e^{x_i}}{\epsilon + \sum_{j=1}^{N} e^{x_j}} \qquad (\epsilon > 0, \epsilon \approx 0)$

4.4. Loss Function Stability

To address loss function instability, we can employ numerically stable loss functions. Like the activation functions discussed previously, we can modify the mathematical operations within loss functions to ensure they do not yield NaNs. This reduces the risk of NaNS being propagated throughout the network.

Additionally, we can scale the results of the loss function computations approaching extremely large or small values, as these can manifest as NaNs. This involves setting a predefined range for the loss function’s outputs.

4.5. Gradient Clipping

Gradient clipping is a technique to limit the value of the gradients within a predefined range. For example, it is a common approach in gradient clipping to set a threshold or range for computed gradients during training. Values beyond the predefined threshold are scaled down to be within the predefined range. This approach helps to prevent very large values that manifest as NaNs during training.

5. Recap: Causes and Actions

In the previous sections, we reviewed the different causes of NaNs in training and the measures that can be taken to prevent NaNs. Now, let’s link the causes with the measures we can take to address them:

Causes	Actions to take
Input data errors	Extensive data preprocessing
Learning rate	Hyperparameter tuning and gradient clipping
Activation function	Implement robust activation functions
Loss function	Stabilize the loss function

By understanding the causes of NaNs and the appropriate measures that can be applied, we can minimize the occurrence of NaNs and effectively train neural networks.

6. Conclusion

In this article, we provided an overview of the common causes of NaNs during the training of neural networks. These sources include data errors, learning-rate issues, activation function abnormalities, and problems with loss function computations.

To avoid NaNs, we can clean data, tune learning rates, clip the gradients, and use robust activation and loss functions that aren’t susceptible to NaNs.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex