Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 28, 2025
Normalization is a widely used technique for improving the performance of machine learning models. But when is the proper time to apply it – before or after splitting our data set?
In the following lines, we’ll see an overview of normalization and splitting to frame the context of this question.
Once we know the basics, we’ll learn the correct order to follow and why we should proceed that way.
Normalization is the transformation applied to numeric features of a model to make it perform better. The term is often confused with some of the specific methods to implement it. However, in this tutorial, we’ll regard it as “feature scaling“.
The most popular implementations are:
The parameters used to normalize data during training (min, max, mean, and standard deviation) are required to use the model, both in the input and output directions:
The main purpose of a test dataset is to have the least biased evaluation of the model’s performance. To achieve that, the evaluation process must be done using examples never seen by the model during training.
The train and test datasets are split in this way:
We can’t use test data for training because test data should be the closest to new data ever seen by the model.
Let’s imagine we apply min-max normalization before splitting: Min and max values will include test data, so normalized values of test examples will always be in the range [0, 1]. That also means we won’t have any normalization issues in the evaluation step which could affect our model’s performance. Knowing beforehand the performance behavior of test data should raise a red flag.
With new data, however, we can’t guarantee those values will remain in the range [0, 1] once normalized. Considering the model was trained using examples in the range [0, 1], data in a different range could reasonably make the model perform worse.
So, what are the steps to do it right?
Training
Prediction
Evaluation
In this tutorial, we explained why we should split before normalization to avoid a biased evaluation.
Finally, we learned a few simple steps to correctly use normalization and splitting in a machine learning project.