Data Normalization Before or After Splitting a Data Set?

1. Overview

Normalization is a widely used technique for improving the performance of machine learning models. But when is the proper time to apply it – before or after splitting our data set?

In the following lines, we’ll see an overview of normalization and splitting to frame the context of this question.

Once we know the basics, we’ll learn the correct order to follow and why we should proceed that way.

2. Normalization Overview

2.1. What We Mean by Normalization

Normalization is the transformation applied to numeric features of a model to make it perform better. The term is often confused with some of the specific methods to implement it. However, in this tutorial, we’ll regard it as “feature scaling“.

The most popular implementations are:

Rescaling (min-max normalization): scales your values in the range [0, 1]:
$z = \frac{x - min(x)}{max(x) - min(x)}$
Mean normalization: centers your values around 0 while keeping them in the range [-0.5, +0.5]:
$z = \frac{x - \bar{x}}{max(x) - min(x)}$
Standardization: scales your values to have mean 0 and standard deviation 1:
$z = \frac{x - \bar{x}}{\sigma}$

2.2. A Two-Way Street

The parameters used to normalize data during training (min, max, mean, and standard deviation) are required to use the model, both in the input and output directions:

Input: The model was trained with normalized data, so any input will have to be normalized onto the training scale before being fed to the model
Output: The model will return a normalized prediction, so we must denormalize it onto the original scale before sending it to the user

3. Why Splitting

The main purpose of a test dataset is to have the least biased evaluation of the model’s performance. To achieve that, the evaluation process must be done using examples never seen by the model during training.

The train and test datasets are split in this way:

Training data: Examples used to calculate the model parameters (like weights in a neural network, for instance)
Test data: Independent set of data used to evaluate the model performance.

We can’t use test data for training because test data should be the closest to new data ever seen by the model.

4. Normalization: Before or After Splitting?

4.1. Doing It Before

Let’s imagine we apply min-max normalization before splitting: Min and max values will include test data, so normalized values of test examples will always be in the range [0, 1]. That also means we won’t have any normalization issues in the evaluation step which could affect our model’s performance. Knowing beforehand the performance behavior of test data should raise a red flag.

With new data, however, we can’t guarantee those values will remain in the range [0, 1] once normalized. Considering the model was trained using examples in the range [0, 1], data in a different range could reasonably make the model perform worse.

5. Doing Things Right

So, what are the steps to do it right?

Training

Split your data to obtain the test set, and don’t use it for normalization or training
Normalize using the training data, and save the obtained parameters

Prediction

Normalize your input using the parameters obtained during training
Denormalize predictions using those same parameters to return the user output in the original scale

Evaluation

Predict your test set getting real-scale predictions (previous step)
Compare those predictions with your reference examples to obtain the evaluation metrics

6. Conclusion

In this tutorial, we explained why we should split before normalization to avoid a biased evaluation.

Finally, we learned a few simple steps to correctly use normalization and splitting in a machine learning project.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex