Bagging, Boosting, and Stacking in Machine Learning

1. Introduction

Bagging, boosting, and stacking belong to a class of machine learning algorithms known as ensemble learning algorithms. Ensemble learning involves combining the predictions of multiple models into one to increase prediction performance.

In this tutorial, we’ll review the differences between bagging, boosting, and stacking.

2. Bagging

Bagging, also known as bootstrap aggregation, is an ensemble learning technique that combines the benefits of bootstrapping and aggregation to yield a stable model and improve the prediction performance of a machine-learning model.

In bagging, we first sample equal-sized subsets of data from a dataset with bootstrapping, i.e., we sample with replacement. Then, we use those subsets to train several weak models independently. A weak model is one with low prediction accuracy. In contrast, strong models are very accurate. To get a strong model, we aggregate the predictions from all the weak models:

So, there are three steps:

Sample equal-sized subsets with replacement
Train weak models on each of the subsets independently and in parallel
Combine the results from each of the weak models by averaging or voting to get a final result

The results are aggregated by averaging the results for regression tasks or by picking the majority class in classification tasks.

2.1. Algorithms That Use Bagging

The main idea behind bagging is to reduce the variance in a dataset, ensuring that the model is robust and not influenced by specific samples in the dataset.

For this reason, bagging is mainly applied to tree-based machine learning models such as decision trees and random forests.

2.2. Pros and Cons of Bagging

Here’s a quick summary of bagging:

Pros	Cons
Reduces overall variance	High number of weak models may reduce model interpretability
Increases models’ robustness to noise in the data

2.3. Implementing Bagging (Almost) From Scratch

In this section, we’ll implement bagging. For simplicity’s sake, we’ll use Scikit-Learn to access a well-known learning dataset, the base learner model, and some utility functions for tasks like splitting the dataset. We’ll also use NumPy to deal with the data as arrays, including randomly selecting a subset of the data to train the weak learners:

from sklearn.base import clone
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from scipy.stats import mode

# Define the SimpleBag class without inheritance
class SimpleBag:
    def __init__(self, base_estimator=None, n_estimators=10, subset_size=0.8):
        self.base_estimator = base_estimator if base_estimator else DecisionTreeClassifier(max_depth=1, max_features=1)
        self.n_estimators = n_estimators
        self.subset_size = subset_size
        self.base_learners = []
        self.is_fitted = False

    def fit(self, X, y):
        n_samples = X.shape[0]
        subset_size = int(n_samples * self.subset_size)
        self.base_learners = []

        for _ in range(self.n_estimators):
            indices = np.random.choice(range(n_samples), size=subset_size, replace=True)
            X_subset, y_subset = X[indices], y[indices]
            cloned_estimator = clone(self.base_estimator)
            cloned_estimator.fit(X_subset, y_subset)
            self.base_learners.append(cloned_estimator)
        
        self.is_fitted = True

    def predict(self, X):
        if not self.is_fitted:
            raise Exception("This SimpleBag instance is not fitted yet.")
        
        predictions = np.array([learner.predict(X) for learner in self.base_learners]).T
        final_predictions, _ = mode(predictions, axis=1)
        return final_predictions.ravel()

We initialize the SimpleBag with the following parameters:

base_estimator: we use DecisionTreeClassifier by default, but any weak learner adhering to the SciKit Learn interface can be used. Decision trees are known for their high variance. Still, by creating multiple models in different subsets of the data, we reduce the overall variance courtesy of the central limit theorem.
n_estimators: The number of base estimators in the ensemble. More estimators typically lead to better performance but increased computational cost and risk of overfitting.
subset_size: The fraction of the training dataset used to bootstrap each weak learner. This controls the size of the subsets and is a key parameter in bagging, as it affects the diversity of the models in the ensemble.

In the fit() method, we used random sampling with replacement (np.choice()) to create a bootstrapped subset of the training data. Then, We cloned the base estimator and trained the new instance on the bootstrapped subset. Finally, we add the newly trained model to the base_learners list.

In the predict() method, we use each trained model to predict an input. Then, aggregate the predictions to make a final prediction. In this case, we add all predictions to an array and choose the most common one as our overall result. We used Scipy’s mode function to simplify finding the most common prediction.

Now, using the Iris dataset, we’ll compare the performance of our completed ensemble model with that of a single decision tree.

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate a single Decision Tree
single_tree = DecisionTreeClassifier(max_depth=1, max_features=1)
single_tree.fit(X_train, y_train)
single_tree_predictions = single_tree.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)

# Initialize, fit, and evaluate the SimpleBag model
simple_bag = SimpleBag(n_estimators=100, subset_size=0.5)
simple_bag.fit(X_train, y_train)
simple_bag_predictions = simple_bag.predict(X_test)
simple_bag_accuracy = accuracy_score(y_test, simple_bag_predictions)

print(f'Accuracy of the single Decision Tree model: {single_tree_accuracy:.2f}')
print(f'Accuracy of the SimpleBag ensemble model: {simple_bag_accuracy:.2f}')

The single DecisionTreeClassifier model, configured with max_depth=1 and max_features=1, yields an acceptable accuracy of 0.63. This performance is remarkable but typical for a small data set like the one we’re using, but Decision Trees are known for their high variance and tendency to overfit.

In our SimpleBag ensemble, we used multiple decision trees trained in different bags of the training data, which reduces the overall variance of our predictions, courtesy of the central limit theorem. If we run the code multiple times, we will notice that our results vary significantly from matching a weak learner to achieving perfect accuracy. This is due to the small data and our simplistic sampling process, but it’s enough to illustrate the bagging approach.

2.4. Using Existing Bagging Models

Instead of implementing the algorithm from scratch, we should leverage models from well-established libraries. Not only does it work less, but a library with a broad user base will be higher quality, faster, and have fewer bugs than in-house code.

Scikit-learn is the most popular library for basic ML algorithms for Python. It offers a comprehensive suite of tools and algorithm implementation, including one for bagging known as BaggingClassifier:

from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the BaggingClassifier with a DecisionTreeClassifier as the base estimator
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=2, max_features=1),
    n_estimators=10,
    random_state=42
)

# Fit the model on the training data
bagging_model.fit(X_train, y_train)

# Predict on the test data
predictions = bagging_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)

print(f'Accuracy of scikit-learn BaggingClassifier: {accuracy:.2f}')

In this example, we replicate the setup of our custom bagging approach but utilize the BaggingClassifier provided by scikit-learn. We specify a DecisionTreeClassifier with max_depth=2 and max_features=1 as the base estimator to match our earlier experiment closely. The BaggingClassifier handles the complexities of training each base estimator on bootstrapped samples and aggregating their predictions, streamlining the process into a few lines of code.

3. Boosting

In boosting, we train a sequence of models. Each model is trained on a weighted training set. We assign weights based on the errors of the previous models in the sequence.

The main idea behind sequential training is to have each model correct the errors of its predecessor. This continues until the predefined number of trained models or some other criteria are met.

During training, instances that are classified incorrectly are assigned higher weights to give some form of priority when trained with the following model:

Additionally, weaker models are assigned lower weights than strong models when combining their predictions into the final output.

So, we first initialize data weights to the same value and then perform the following steps iteratively:

Train a model on all instances
Calculate the error on model output over all instances
Assign a weight to the model (high for good performance and vice-versa)
Update data weights: give higher weights to samples with high errors
Repeat the previous steps if the performance isn’t satisfactory or other stopping conditions are met

Finally, we combine the models into the one we use for prediction.

3.1. Algorithms That Use Boosting

Boosting generally improves the accuracy of a machine learning model by improving the performance of weak learners. We typically use XGBoost, CatBoost, and AdaBoost.

These algorithms apply different boosting techniques and are most noted for achieving excellent performance.

3.2. Pros and Cons of Boosting

Boosting has many advantages but isn’t without shortcomings:

Pros	Cons
Improves overall accuracy	Can be computationally expensive
Reduces overall bias by improving on the weakness of the previous model	Sensitive to noisy data
	Model dependency may allow for replication of errors

The decision to use boosting depends on whether the data aren’t noisy and our computational capabilities.

3.3. Implementing Boosting (Almost) From Scratch

There are various boosting algorithms, all based on adjusting the training, which are used to predict the next learner performance of the previously trained learners.

Gradient Boosting, eXtreme Gradient Boosting, and Light Gradient Boosting algorithms fit the new learners to the residual errors, using a gradient descend approach on the loss function to minimize others. However, they make different tradeoffs between speed and performance, regularization use, and their ability to deal with sparse or high amounts of data. They are usable for both classification and regression scenarios.

Categorical Boosting specializes in categorization problems. It can be used without requiring data preprocessing, such as converting the categories to one-hot encoding. It’s also overfitting resistant.

Adaptive Boosting was also initially developed for classification tasks, but it has been modified for regression problems. It uses changing sample weights to direct the training of new learners to perform better in training data in which the previous learners performed poorly. We will now implement a basic adaptative boosting algorithm on top of scikit-learn models.

Because we need to pass the sample weights to the underlying models, we can only use base models that support passing weights to their fitting method; Decision Trees, Logistic Regression, Ridge Classifier, and Support Vector Machines algorithms in scikit-learn allow us to pass weights to the training process:

from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

class SimpleMultiClassBoosting(BaseEstimator, ClassifierMixin):
    def __init__(self, base_estimator=None, n_estimators=50):
        self.base_estimator = base_estimator if base_estimator is not None else DecisionTreeClassifier(max_depth=1)
        self.n_estimators = n_estimators
        self.learners = []
        self.learner_weights = []
        self.label_encoder = LabelEncoder()

    def fit(self, X, y):
        # Convert labels to [0, n_classes-1]
        y_encoded = self.label_encoder.fit_transform(y)
        n_classes = len(self.label_encoder.classes_)
        
        # Initialize weights uniformly
        sample_weights = np.full(X.shape[0], 1 / X.shape[0])
        
        for _ in range(self.n_estimators):
            learner = clone(self.base_estimator)
            learner.fit(X, y_encoded, sample_weight=sample_weights)
            learner_pred = learner.predict(X)
            
            # Compute weighted error rate (misclassification rate)
            incorrect = (learner_pred != y_encoded)
            learner_error = np.mean(np.average(incorrect, weights=sample_weights))
            
            # Compute learner weight using SAMME algorithm
            learner_weight = np.log((1 - learner_error) / (learner_error + 1e-10)) + np.log(n_classes - 1)
            if learner_error >= 1 - (1 / n_classes):
                break  # Stop if the learner is no better than random guessing
            
            # Increase the weights of misclassified samples
            sample_weights *= np.exp(learner_weight * incorrect * (sample_weights > 0))
            sample_weights /= np.sum(sample_weights)  # Normalize weights
            
            # Save the current learner
            self.learners.append(learner)
            self.learner_weights.append(learner_weight)
    
    def predict(self, X):
        # Collect predictions from each learner
        learner_preds = np.array([learner.predict(X) for learner in self.learners])
        
        # Weighted vote for each sample's prediction across all learners
        weighted_preds = np.zeros((X.shape[0], len(self.label_encoder.classes_)))
        for i in range(len(self.learners)):
            weighted_preds[np.arange(X.shape[0]), learner_preds[i]] += self.learner_weights[i]
        
        # Final prediction is the one with the highest weighted vote
        y_pred = np.argmax(weighted_preds, axis=1)
        # Convert back to original class labels
        return self.label_encoder.inverse_transform(y_pred)

In the above implementation, we start by encoding the labels using scikit-learn’s LabelEncoder, and then we proceed to train the weak learners, adjusting the weights at each step:

At the start, we assign an equal weight to each sample. For a dataset with samples, each sample’s initial weight is set to $1/ N$ .

On each iteration, we train a weak learner, calculate a learner weight, and recalculate the sample weights:
- The current weak learner is trained on the training data using the current sample weights.
- After training, we evaluate the learner’s performance, calculating its weighted error rate.
- Then, we calculated the error rate for this learner and use it to calculate the weight we’ll give to this learner’s predictions when making the overall prediction of our ensemble. If the learner error shows a performance worse than random guessing, then we stop the training and do not add more learners to the ensemble.
- Finally, we update the weights based on the predictions of the current learner. Samples the learner incorrectly classified are given more weight, and samples that were correctly classified have their weights reduced.

To find the final prediction of our ensemble, we have to aggregate the individual predictions of the weak learners, but we have to weigh the votes and not simply count as we did in the bagging example:

We make predictions using all the weak learners and store the result in a NumPy array.
Then, we calculate the overall prediction. We start by creating an array with NxC dimensions, where N is the number of entries we’re classifying, and C is the number of possible categories. For each sample, we add the learner weight to the category it predicted.
We chose the class that receives the highest total weight across all learners as the final prediction for the input.

When comparing our ensemble to a single weak learner, we would appreciate the improved performance another interesting point is that the performance is more stable across multiple runs than our simple bagging implementation:

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate a single Decision Tree
single_tree = DecisionTreeClassifier(max_depth=1, max_features=1)
single_tree.fit(X_train, y_train)
single_tree_predictions = single_tree.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)

# Initialize, fit, and evaluate the SimpleBag model
simple_bag = SimpleMultiClassBoosting(n_estimators=100)
simple_bag.fit(X_train, y_train)
simple_bag_predictions = simple_bag.predict(X_test)
simple_bag_accuracy = accuracy_score(y_test, simple_bag_predictions)

print(f'Accuracy of the single Decision Tree model: {single_tree_accuracy:.2f}')
print(f'Accuracy of the SimpleMultiClassBoosting ensemble model: {simple_bag_accuracy:.2f}')

The base learner achieves an accuracy from 0.43 to 0.63, and our simple boosting ensemble achieves a consistent accuracy of 0.93. The consistent performance is an excellent contrast to the variability of our bagging implementation; in simple terms, our boosting implementation doesn’t use randomness when training the model.

3.4. Using Existing Boosting Models

As with the bagging algorithm, it’s almost always better to use a library implementing these algorithms, scikit-learn includes implementations for different boosting strategies: AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier, GradientBoostingRegressor.

Since our simple implementation is a simplified Adaptative Boosting algorithm, we can compare it to scikit-learn’s AdaBoostClassifier:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a shallow Decision Tree Classifier
shallow_decision_tree = DecisionTreeClassifier(max_depth=1)

# Initialize the AdaBoost Classifier using the shallow Decision Tree as the base estimator
ada_clf = AdaBoostClassifier(estimator=shallow_decision_tree, n_estimators=100, random_state=42, algorithm="SAMME")

# Train the AdaBoost model on the training set
ada_clf.fit(X_train, y_train)

# Make predictions on the test set
predictions = ada_clf.predict(X_test)

# Evaluate and print the model's accuracy on the test set
print("Accuracy:", accuracy_score(y_test, predictions))

Running it repeatedly, we obtain a stable 1.0 accuracy. Besides the higher accuracy, we also get an extensively tested model and more options that we can use to tweak how we train our model.

4. Stacking

In stacking, the predictions of base models are fed as input to a meta-model (or meta-learner). The job of the meta-model is to take the predictions of the base models and make a final prediction:

The base and meta-models don’t have to be of the same type. For example, we can pair a decision tree with a support vector machine (SVM).

Here are the steps:

Construct base models on different portions of the training data
Train a meta-model on the predictions from the base models

4.1. Pros and Cons of Stacking

We can summarize stacking as follows:

Pros	Cons
Combines the benefits of different models into one	May take a longer time to train and aggregate the predictions of different types of models
Increases overall accuracy	Training several base models and a meta-model increases complexity

4.2. Implementing Stacking (Almost) From Scratch

As before, we can use the base models in scikit-learn to write a simple stacking implementation:

import numpy as np
from sklearn.base import clone
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

class SimpleStacking:
    def __init__(self, base_learners, meta_learner):
        self.base_learners = base_learners
        self.meta_learner = meta_learner
        self.fitted_base_learners = []

    def fit(self, X, y):
        meta_features = []
        self.fitted_base_learners = []
        
        # Train base learners and generate cross-validated predictions to serve as meta-features
        for base_learner in self.base_learners:
            fitted_learner = clone(base_learner).fit(X, y)
            self.fitted_base_learners.append(fitted_learner)
            preds = fitted_learner.predict(X) 
            meta_features.append(preds)

        # Stack meta-features horizontally
        meta_features = np.array(meta_features).T
        
        # Train the meta-learner on the meta-features
        self.meta_learner.fit(meta_features, y)

    def predict(self, X):
        # Generate meta-features for new data
        meta_features = [learner.predict(X) for learner in self.fitted_base_learners]
        meta_features = np.array(meta_features).T
        #meta_features = np.array(meta_features)
        # Final prediction from meta-learner
        return self.meta_learner.predict(meta_features)

Similarly to the previous ensembles, we train a set of base learners, in this case, we require the user of our class to pass an array of instantiated base learners, we train each learner in parallel as we did in bagging, but in this example, we used all the data to train each model. The main difference from before is that we use the prediction of each base learner as a feature to train a meta-learner.

In this approach, each base learner and the meta learner can use different algorithms; for example we can use a Decision Tree and Logistic Regression as base models and a Support Vector Machine as the meta learner:

# Load the Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base learners
base_learners = [
    DecisionTreeClassifier(max_depth=1, max_features=1),
    LogisticRegression(random_state=42)
]

# Define meta-learner
meta_learner = SVC(probability=True, random_state=42)

# Initialize and train the SimpleStacking model
stacking_model = SimpleStacking(base_learners, meta_learner)
stacking_model.fit(X_train, y_train)

# Make predictions and evaluate the model
predictions = stacking_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Stacking Model Accuracy: {accuracy}")

Our simple learner achieves 1.0 accuracy in the Iris Dataset, but it has weaknesses that will limit the performance in more complex problems. Since we’re training the meta-learner directly on the output of the base learners, we might end up overfitting the training data. A common way to address this problem is to use a cross-validation prediction (cross_val_predict) approach to prevent the input data from leaking to the meta-learner.

Another issue is that we used the final class prediction from the base learners; this drops uncertainty information because it assumes the base learners are sure about the prediction. This is easier to fix, so we need to use predict_prob(), which assigns a probability to each possible category and uses this as input for the meta-learner.

We foresee how adding those improvements will slowly complicate our code, which is why we should favor using the models from a library in practice.

4.3. Stacking With SciKit-Learn

Using the StackingClassifier is as simple as with any of the other ensembles, we need to instantiate the base and meta learners since we need them to instantiate the stacking class:

from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base learners
base_learners = [
    ('decision_tree', DecisionTreeClassifier(max_depth=1)),
    ('lr', LogisticRegression())
]

# Define the meta-learner
meta_learner = SVC(probability=True, random_state=42)

# Initialize the Stacking Classifier with the base learners and the meta-learner
stack_clf = StackingClassifier(estimators=base_learners, final_estimator=meta_learner, cv=5)

# Train the stacking classifier
stack_clf.fit(X_train, y_train)

# Make predictions on the test set
predictions = stack_clf.predict(X_test)

# Evaluate and print the accuracy of the model
print("Stacking Model Accuracy:", accuracy_score(y_test, predictions))

Not only is this classifier more sophisticated than our straightforward approach, but it’s also flexible. We can use the constructor arguments to change if the base learners should connect to the meta learner using prediction probabilities or the class. On top of that, we can decide to use the base learners as extra features, feeding the raw input and the results of the base learners to the meta-learner.

5. Differences Between Bagging, Boosting, and Stacking

The main differences between bagging, boosting, and stacking are in the approach, base models, subset selection, goals, and model combination:

Criteria	Bagging	Boosting	Stacking
Approach	Parallel training of weak models	Sequential training of weak models	Aggregates the predictions of multiple models into a meta-model
Base Models	Homogenous	Homogenous	Can be heterogenous
Subset Selection	Random sampling with replacement	Subsets are not required	Subsets are not required
Goal	Reduce variance	Reduce bias	Reduce variance and bias
Model Combination	Majority voting or averaging	Weighted majority voting or averaging	Using an ML model

The selection of the technique to use depends on the overall objective and task at hand. Bagging is best when the goal is to reduce variance, whereas boosting is the choice for reducing bias. If the goal is to reduce variance and bias and improve overall performance, we should use stacking.

6. Other Libraries and Implementations

SciKit-Learn is the most popular library that implements foundational machine learning models in Python; besides those foundational models, it also implements several ensemble models, including bagging, different boosting strategies and stacking.

Other popular Python libraries, like XGBoost, LightGBM, or CatBoost, focus on gradient boosting models and do not have a stand-alone implementation of bagging models. However, they all include parameters to control subsampling when training weak learners, adding the core benefit of bagging to boosting algorithms.

H2O is general and implements multiple algorithms in the JVM. Besides providing a language to train and use models and a native interface for Spark, it also provides an REST API that can be accessed from Python or R or natively from Spark using its native language. Although it does not have a general bagging model, it implements Random Forest. It also has subsampling control for other ensemble models.

Weka is another Java ML library that is mainly used for academic purposes. The project was started at the University of Waikato.

7. Conclusions

In this article, we provided an overview of bagging, boosting, and stacking. Bagging trains multiple weak models in parallel. Boosting trains multiple homogenous weak models in sequence, with each successor improving on its predecessor’s errors. Stacking trains multiple models (that can be heterogeneous) to obtain a meta-model.

The choice of ensemble technique depends on the goal and task, as all three techniques aim to improve the overall performance by combining the power of multiple models.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex