Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 28, 2025
In this tutorial, we’ll explain the Scikit-learn (Sklearn) Pipeline class and how to use it.
Scikit-learn or Sklearn is a popular machine learning library for Python programming language. It provides various algorithms for classification, regression, clustering, model selection, data preprocessing, and many more. Sklearn is well-documented and user-friendly, making it a popular choice for both beginners and experienced developers.
One of its useful but perhaps less commonly utilized classes is Pipeline, which we’ll explain further below.
The Pipeline class in Sklearn is a utility that helps automate the process of transforming data and applying models. Often in machine learning modeling, we need to sequentially combine several steps on both the training and test data. For example, we want to standardize the input features, apply PCA, and predict with logistic regression.
With the Pipeline class, these steps can be easily combined into one object and then applied to training and test data.
Key features of the Sklearn Pipeline:
To help better understand the Pipeline class, we will present a few examples below.
As an example, we will use a simple Iris data set from Sklearn for multi-class classification. We’ll load the data, split it into training and test sets, select the best 2 features based on the ANOVA F-value, standardize the features, and use logistic regression to predict classes.
One approach without the Pipeline class would look like this:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load and split dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature selection
selector = SelectKBest(score_func=f_classif, k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)
# Train the model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
The same example with Pipeline class looks like this:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
# Load and split dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a pipeline with feature selection, scaling, and model training
pipeline = Pipeline([
('feature_selection', SelectKBest(score_func=f_classif, k=2)),
('scaler', StandardScaler()),
('logistic_regression', LogisticRegression())
])
# Train the pipeline
pipeline.fit(X_train, y_train)
# Make predictions
y_pred = pipeline.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Notice that in the first example, we need to apply feature selection and scaling to the training and test sets separately. Also, we need to have a variable to store output data after every preprocessing step (X_train_selected and X_train_scaled).
In the second example with the Pipeline class, if we need to change some data preprocessing steps, we would only need to modify Pipeline initialization.
In this article, we explained and provided an example of the Sklearn Pipeline class. The pipeline class reduces code complexity, ensures consistency, and minimizes the risk of errors, making it useful for both beginners and experienced developers. The examples clearly show how Pipeline can turn a complicated task into a more manageable and elegant solution.