Skip to content

How Gradient Boosting Actually Works Under the Hood: Building It from Scratch in Python

DS
LDS Team
Let's Data Science
14 minAudio
Listen Along
0:00/ 0:00
AI voice

Every data science interview eventually arrives at the same question: "How does gradient boosting actually work?" You can say "it builds trees sequentially" and watch the interviewer nod politely, or you can explain that it performs gradient descent in function space, fitting each new tree to the negative gradient of the loss. The second answer gets the offer. The gap between those two answers is exactly what this article closes — by building gradient boosting from scratch, line by line, in Python.

Gradient boosting is the dominant algorithm for structured (tabular) data. According to ML Contests' 2024 analysis, 74% of winning Kaggle solutions for tabular problems used some form of gradient boosting. The three production-grade libraries it spawned — XGBoost (now at version 3.2), LightGBM (4.6), and CatBoost (1.2.10) — sit behind fraud detection systems, credit scoring pipelines, and recommendation engines at companies from Stripe to Spotify. Jerome Friedman formalized the algorithm in his 2001 paper Greedy Function Approximation: A Gradient Boosting Machine (now cited over 26,000 times), framing it as gradient descent in function space rather than parameter space. That framing is the key insight, and we'll build it from the ground up using a house-price prediction example.

Single decision trees and their limits

A single decision tree partitions the feature space into rectangular regions and assigns a constant prediction to each region. It can model non-linear relationships without feature scaling, but it has a critical weakness: high variance. Small perturbations in the training data can produce wildly different tree structures. An unpruned tree memorizes noise — it creates deep branches that capture random fluctuations rather than real patterns.

Picture predicting house prices. A deep tree might learn that one specific house at $487,000 with 3.2 bathrooms gets its own leaf node. That's memorization, not generalization. On new data, that leaf is useless.

Two ensemble strategies fix this problem:

StrategyHow It WorksWhat It ReducesExample
BaggingTrain many independent trees on random data subsets, average predictionsVarianceRandom Forest
BoostingTrain a sequence of weak trees, each correcting the previous ensemble's errorsBiasGradient Boosting

Bagging trains trees in parallel and averages them while boosting trains trees sequentially on residualsClick to expandBagging trains trees in parallel and averages them while boosting trains trees sequentially on residuals

Bagging builds trees in parallel and averages them. Boosting builds trees sequentially and sums them. Gradient boosting belongs to the boosting family, and the "gradient" part describes how it identifies the mistakes to correct — through the gradient of a loss function.

Key Insight: Random Forest reduces variance by averaging many high-variance trees. Gradient boosting reduces bias by stacking many low-variance (shallow) trees. They solve opposite problems, which is why your choice depends on whether your baseline model underfits or overfits. For more on this tradeoff, see The Bias-Variance Tradeoff.

The boosting intuition with house prices

Before any math, here's the core idea in one sentence: start with a rough guess, measure how wrong it is, train a small model to predict the errors, add a fraction of that correction, and repeat.

We'll use a house-price prediction problem as our running example throughout the entire article. Suppose you're predicting prices in a neighborhood where homes sell for $200,000 to $500,000.

Iteration 0 — Start with the average. The mean house price is $320,000. That becomes the initial prediction for every house. Some homes are worth $180,000 more than the mean, others $120,000 less. These gaps are the residuals.

Iteration 1 — Train a small tree on the residuals. A shallow tree (depth 2, just 4 leaf nodes) learns rough patterns: "houses with more than 2,000 sq ft have positive residuals (we underestimated them); houses near busy roads have negative residuals (we overestimated them)."

Iteration 2 — Add a fraction of the correction. Instead of adding the tree's full prediction, multiply it by a learning rate (say 0.1). If the tree says +$100,000 for a particular house, only add +$10,000. This cautious step prevents overcorrection.

Iterations 3 through 200 — Repeat. Compute new residuals (errors remaining after all corrections so far), fit another shallow tree, add another scaled correction. After 200 iterations, the accumulated corrections closely approximate true prices.

Each tree is weak on its own — a depth-2 tree makes at most 4 distinct predictions. But the sum of 200 small, targeted corrections produces a strong model. That's the entire philosophy.

The loss function and its gradient

The word "gradient" in gradient boosting refers to the gradient (partial derivative) of a loss function. For regression, the standard choice is mean squared error (MSE) loss:

L(yi,F(xi))=12(yiF(xi))2L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2

Where:

  • yiy_i is the true target value (actual house price for house ii)
  • F(xi)F(x_i) is the model's current prediction for house ii
  • The 12\frac{1}{2} factor is a convenience that cancels with the exponent during differentiation

In Plain English: The loss measures how far off each house-price prediction is, squared so that a $50,000 miss hurts four times more than a $25,000 miss.

The negative gradient of this loss with respect to the prediction F(xi)F(x_i) is:

LF(xi)=yiF(xi)-\frac{\partial L}{\partial F(x_i)} = y_i - F(x_i)

Where:

  • LF(xi)-\frac{\partial L}{\partial F(x_i)} is the direction of steepest descent in function space
  • yiF(xi)y_i - F(x_i) is simply the residual — the gap between truth and prediction

In Plain English: For MSE loss, the negative gradient is the residual. When we compute actual_price - predicted_price and train a tree on those gaps, we're performing gradient descent in function space. The tree approximates the direction that most reduces the loss, and the learning rate controls the step size.

These negative gradient values are formally called pseudo-residuals. For MSE, pseudo-residuals equal actual residuals. For other loss functions (absolute error, Huber loss, log-loss), pseudo-residuals take a different form, but the algorithm stays identical: compute the negative gradient of whatever loss you choose, fit a tree to those values, and update.

Pro Tip: This generality is Friedman's key contribution. Any differentiable loss function plugs into the same framework. That's why the algorithm isn't called "residual boosting" — it works with gradients far beyond simple residuals.

The complete gradient boosting algorithm

Here is the full algorithm for gradient boosting regression with MSE loss, laid out step by step.

Gradient boosting algorithm flow showing initialization, residual computation, tree fitting, and prediction update loopClick to expandGradient boosting algorithm flow showing initialization, residual computation, tree fitting, and prediction update loop

Input: Training data {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^{N}, number of iterations MM, learning rate ν\nu, maximum tree depth dd.

Step 1 — Initialize with a constant. Set the initial prediction to the value that minimizes the total loss. For MSE, that's the mean of all targets:

F0(x)=yˉ=1Ni=1NyiF_0(x) = \bar{y} = \frac{1}{N}\sum_{i=1}^{N} y_i

Where:

  • F0(x)F_0(x) is the initial model (a constant function)
  • yˉ\bar{y} is the arithmetic mean of all NN target values
  • This is the prediction before any trees are added

In Plain English: The initial "model" is just the average house price — $320,000 for every house. It's wrong for every individual house, but it's the single number that's least wrong on average.

Step 2 — For each iteration m=1,2,,Mm = 1, 2, \ldots, M:

(a) Compute pseudo-residuals:

rim=yiFm1(xi)for i=1,,Nr_{im} = y_i - F_{m-1}(x_i) \quad \text{for } i = 1, \ldots, N

(b) Fit a regression tree hm(x)h_m(x) with maximum depth dd to the pseudo-residuals {(xi,rim)}\{(x_i, r_{im})\}.

(c) Update the model:

Fm(x)=Fm1(x)+νhm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)

Where:

  • Fm(x)F_m(x) is the updated model after iteration mm
  • Fm1(x)F_{m-1}(x) is the model from the previous iteration
  • ν\nu is the learning rate (typically 0.01 to 0.3)
  • hm(x)h_m(x) is the tree fitted at iteration mm

In Plain English: The new prediction for each house equals the old prediction plus a small correction. If the learning rate is 0.1 and tree mm predicts a +$80,000 correction for a particular house, we only add +$8,000. The remaining $72,000 gap gets addressed by later trees.

Step 3 — Output the final model:

FM(x)=F0(x)+νm=1Mhm(x)F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)

The final house price prediction is the mean plus the weighted sum of all tree corrections. That's the entire algorithm — nothing hidden.

Preparing the data

To build gradient boosting from scratch, we need a dataset where the pattern is clear enough to verify visually. We'll generate a synthetic quadratic curve y=3x2y = 3x^2 with small Gaussian noise — think of it as a simplified version of our house-price scenario where a single feature (square footage, normalized) has a curved relationship with price.

<!— EXEC —>

python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
X = np.random.rand(200, 1) - 0.5          # 200 points in [-0.5, 0.5]
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)  # quadratic + noise

plt.figure(figsize=(8, 5))
plt.scatter(X, y, color='black', s=10, alpha=0.6)
plt.title("Synthetic Data: y = 3x² + noise")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Expected output: A U-shaped scatter plot centered at zero. Points near x=±0.5x = \pm 0.5 have yy values close to 0.75, while points near x=0x = 0 cluster around 0. A linear model can't capture this shape, but a sequence of shallow trees can approximate it through additive corrections.

Step 1 — Initializing with the mean

The first prediction is the same for every data point: the mean of y.

<!— EXEC —>

python
import numpy as np

np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

F0 = np.mean(y)
predictions = np.full(y.shape, F0)

print(f"Initial prediction (F0): {F0:.4f}")
print(f"Shape of predictions: {predictions.shape}")
print(f"All predictions identical: {np.all(predictions == predictions[0])}")

Expected output:

code
Initial prediction (F0): 0.2637
Shape of predictions: (200,)
All predictions identical: True

At this point, every house gets the same prediction. All the information about individual differences sits in the residuals — the gaps between these flat predictions and the actual targets.

Step 2 — Computing pseudo-residuals

Pseudo-residuals measure how far each prediction is from the truth. For MSE loss, they're simply y - predictions:

<!— EXEC —>

python
import numpy as np

np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

F0 = np.mean(y)
predictions = np.full(y.shape, F0)

residuals = y - predictions
print(f"Residual stats:")
print(f"  Mean: {np.mean(residuals):.6f}")
print(f"  Min:  {np.min(residuals):.4f}")
print(f"  Max:  {np.max(residuals):.4f}")
print(f"  Std:  {np.std(residuals):.4f}")

Expected output:

code
Residual stats:
  Mean: -0.000000
  Min:  -0.3309
  Max:  0.5420
  Std:  0.2266

Data points near the parabola's edges (large x|x|, high yy) have large positive residuals — the mean underestimates them. Points near x=0x = 0 (low yy) have negative residuals — the mean overestimates them. The residuals encode the entire structure the mean missed.

Step 3 — Fitting the first weak learner

We train a shallow decision tree (depth 2, at most 4 leaf nodes) on the residuals, not on the original target:

<!— EXEC —>

python
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

F0 = np.mean(y)
predictions = np.full(y.shape, F0)
residuals = y - predictions

tree_1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_1.fit(X, residuals)

print(f"Tree 1 unique predictions: {np.unique(tree_1.predict(X)).shape[0]}")
print(f"Tree 1 leaf values: {np.sort(np.unique(tree_1.predict(X)))}")

Expected output:

code
Tree 1 unique predictions: 4
Tree 1 leaf values: [-0.14618237  0.18851751  0.2368173   0.38563456]

The tree splits the 200 data points into 4 groups and assigns each group an average residual. Those 4 values are crude corrections — but they're pointed in the right direction.

Pro Tip: We use scikit-learn's DecisionTreeRegressor as the base learner because implementing a tree-splitting algorithm from scratch would double this article's length without adding insight into the boosting loop itself. The boosting logic — the residuals, the learning rate, the sequential accumulation — is what we build by hand.

Step 4 — Updating predictions with the learning rate

The tree's predictions approximate the residuals. We add a fraction of those predictions to our current model:

<!— EXEC —>

python
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

F0 = np.mean(y)
predictions = np.full(y.shape, F0)
residuals = y - predictions

tree_1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_1.fit(X, residuals)

learning_rate = 0.1
mse_before = np.mean((y - predictions)**2)
predictions = predictions + learning_rate * tree_1.predict(X)
mse_after = np.mean((y - predictions)**2)

print(f"MSE before tree 1: {mse_before:.6f}")
print(f"MSE after tree 1:  {mse_after:.6f}")
print(f"Improvement:        {((mse_before - mse_after) / mse_before * 100):.1f}%")

Expected output:

code
MSE before tree 1: 0.051353
MSE after tree 1:  0.043774
Improvement:        14.8%

One tree with a cautious learning rate of 0.1 reduced error by about 17%. Not bad for a model that can only make 4 distinct predictions. The remaining 83% gets chipped away by the next 199 trees.

The learning rate and number of trees tradeoff

The learning rate ν\nu controls how aggressively each tree's correction is applied. This creates a fundamental tradeoff:

Learning rate tradeoff in gradient boosting showing high vs low learning rate consequencesClick to expandLearning rate tradeoff in gradient boosting showing high vs low learning rate consequences

A learning rate of 1.0 means each tree's full correction is applied. The model tries to fix all remaining error in one shot, which memorizes noise and overfits. A learning rate of 0.01 means only 1% of each correction is applied — the model takes tiny steps, needs thousands of trees, but those tiny steps average out noise and produce better generalization.

Friedman called this technique shrinkage and demonstrated in his 2001 paper that even dramatic shrinkage (learning rates of 0.01–0.1) with hundreds or thousands of trees consistently outperformed aggressive learning rates with fewer trees.

Learning RateTrees NeededTraining SpeedGeneralization
0.011000–5000SlowBest
0.1200–500ModerateVery good
0.3100–200FastGood
1.050–100Very fastPoor (overfits)

Pro Tip: In practice, set the learning rate between 0.01 and 0.1, then increase n_estimators until validation error stops improving. Use early stopping (n_iter_no_change parameter) to find the optimal number automatically. This is standard practice with XGBoost 3.2, LightGBM 4.6, and CatBoost 1.2.10 as of March 2026.

Building gradient boosting from scratch

Here is the complete implementation. The class stores the initial mean and the list of fitted trees, then reconstructs predictions by replaying the additive sequence:

<!— EXEC —>

python
import numpy as np
from sklearn.tree import DecisionTreeRegressor


class GradientBoostingFromScratch:
    """Gradient Boosting Regressor built from scratch using MSE loss."""

    def __init__(self, n_estimators=200, learning_rate=0.1, max_depth=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
        self.train_losses = []

    def fit(self, X, y):
        # Step 1: Initialize with the target mean
        self.initial_prediction = np.mean(y)
        predictions = np.full(y.shape, self.initial_prediction)

        for m in range(self.n_estimators):
            # Step 2a: Compute pseudo-residuals (negative gradient of MSE)
            residuals = y - predictions

            # Record training MSE before this iteration's correction
            mse = np.mean(residuals ** 2)
            self.train_losses.append(mse)

            # Step 2b: Fit a shallow tree to pseudo-residuals
            tree = DecisionTreeRegressor(
                max_depth=self.max_depth, random_state=42
            )
            tree.fit(X, residuals)
            self.trees.append(tree)

            # Step 2c: Update predictions with shrinkage
            predictions += self.learning_rate * tree.predict(X)

            if m % 50 == 0:
                print(f"  Iteration {m:3d} | MSE: {mse:.6f}")

        final_mse = np.mean((y - predictions) ** 2)
        print(f"  Final        | MSE: {final_mse:.6f}")

    def predict(self, X):
        # Replay the additive sequence: mean + sum of scaled tree predictions
        predictions = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        return predictions


# Generate synthetic data
np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

# Train from scratch
gbm = GradientBoostingFromScratch(
    n_estimators=200, learning_rate=0.1, max_depth=2
)
gbm.fit(X, y)

Expected output:

code
  Iteration   0 | MSE: 0.051353
  Iteration  50 | MSE: 0.001558
  Iteration 100 | MSE: 0.001239
  Iteration 150 | MSE: 0.001028
  Final        | MSE: 0.000842

Every line maps directly to the algorithm defined earlier. The fit method initializes with the mean, loops through iterations computing residuals and fitting trees, and accumulates scaled corrections. The predict method replays the same additive sequence on new data. The whole thing is about 30 lines of actual logic.

Watching boosting learn iteration by iteration

One of the most instructive views is watching the model's approximation improve at specific checkpoints:

<!— EXEC —>

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor


class GradientBoostingFromScratch:
    def __init__(self, n_estimators=200, learning_rate=0.1, max_depth=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
        self.train_losses = []

    def fit(self, X, y):
        self.initial_prediction = np.mean(y)
        predictions = np.full(y.shape, self.initial_prediction)
        for m in range(self.n_estimators):
            residuals = y - predictions
            self.train_losses.append(np.mean(residuals ** 2))
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=42)
            tree.fit(X, residuals)
            self.trees.append(tree)
            predictions += self.learning_rate * tree.predict(X)

    def predict(self, X):
        predictions = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        return predictions


np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

gbm = GradientBoostingFromScratch(n_estimators=200, learning_rate=0.1, max_depth=2)
gbm.fit(X, y)

fig, axes = plt.subplots(1, 4, figsize=(20, 4), sharey=True)
checkpoints = [1, 5, 20, 200]
X_plot = np.linspace(-0.5, 0.5, 300).reshape(-1, 1)

for ax, n_trees in zip(axes, checkpoints):
    preds = np.full(X_plot.shape[0], gbm.initial_prediction)
    for tree in gbm.trees[:n_trees]:
        preds += gbm.learning_rate * tree.predict(X_plot)

    ax.scatter(X, y, color='black', s=5, alpha=0.3)
    ax.plot(X_plot, preds, color='red', linewidth=2)
    ax.set_title(f"{n_trees} tree{'s' if n_trees > 1 else ''}")
    ax.set_xlabel("x")

axes[0].set_ylabel("y")
plt.suptitle("Gradient Boosting: Progressive Fit", fontsize=14)
plt.tight_layout()
plt.show()

Expected output: Four panels showing the model's evolution. With 1 tree, it's a rough step function (4 steps). With 5 trees, the U-shape starts emerging. By 20 trees, the parabola is clearly recognizable. At 200 trees, the red curve tightly follows the data. This progression is the boosting principle made visible — each tree adds a small correction, and the accumulation produces a smooth, accurate approximation.

The training loss curve

<!— EXEC —>

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor


class GradientBoostingFromScratch:
    def __init__(self, n_estimators=200, learning_rate=0.1, max_depth=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
        self.train_losses = []

    def fit(self, X, y):
        self.initial_prediction = np.mean(y)
        predictions = np.full(y.shape, self.initial_prediction)
        for m in range(self.n_estimators):
            residuals = y - predictions
            self.train_losses.append(np.mean(residuals ** 2))
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=42)
            tree.fit(X, residuals)
            self.trees.append(tree)
            predictions += self.learning_rate * tree.predict(X)

    def predict(self, X):
        predictions = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        return predictions


np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

gbm = GradientBoostingFromScratch(n_estimators=200, learning_rate=0.1, max_depth=2)
gbm.fit(X, y)

plt.figure(figsize=(10, 5))
plt.plot(range(len(gbm.train_losses)), gbm.train_losses, color='blue', linewidth=1.5)
plt.title("Training Loss Over Boosting Iterations")
plt.xlabel("Iteration (number of trees)")
plt.ylabel("Mean Squared Error")
plt.grid(True, alpha=0.3)
plt.show()

print(f"MSE at iteration 0:   {gbm.train_losses[0]:.6f}")
print(f"MSE at iteration 50:  {gbm.train_losses[50]:.6f}")
print(f"MSE at iteration 199: {gbm.train_losses[199]:.6f}")

Expected output:

code
MSE at iteration 0:   0.051353
MSE at iteration 50:  0.001558
MSE at iteration 199: 0.000845

The loss drops steeply in the first 30–50 iterations as the trees capture the quadratic shape, then flattens as remaining error is mostly irreducible noise. This "elbow" shape is characteristic of gradient boosting. When the validation loss starts rising while training loss keeps falling, you've hit overfitting — that's your signal to stop.

Verifying against scikit-learn

The real test of a from-scratch implementation: does it match the production library? We'll train scikit-learn's GradientBoostingRegressor with identical hyperparameters and compare predictions head to head.

<!— EXEC —>

python
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error


class GradientBoostingFromScratch:
    def __init__(self, n_estimators=200, learning_rate=0.1, max_depth=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None

    def fit(self, X, y):
        self.initial_prediction = np.mean(y)
        predictions = np.full(y.shape, self.initial_prediction)
        for m in range(self.n_estimators):
            residuals = y - predictions
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=42)
            tree.fit(X, residuals)
            self.trees.append(tree)
            predictions += self.learning_rate * tree.predict(X)

    def predict(self, X):
        predictions = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        return predictions


np.random.seed(42)
X = np.random.rand(200, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(200)

# Our implementation
gbm = GradientBoostingFromScratch(n_estimators=200, learning_rate=0.1, max_depth=2)
gbm.fit(X, y)

# Scikit-learn with matching hyperparameters
sklearn_gbm = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=2,
    loss='squared_error',
    random_state=42
)
sklearn_gbm.fit(X, y)

custom_pred = gbm.predict(X)
sklearn_pred = sklearn_gbm.predict(X)

print(f"From-scratch MSE:  {mean_squared_error(y, custom_pred):.6f}")
print(f"Scikit-learn MSE:  {mean_squared_error(y, sklearn_pred):.6f}")
print(f"Max absolute diff: {np.max(np.abs(custom_pred - sklearn_pred)):.6f}")
print(f"Mean absolute diff: {np.mean(np.abs(custom_pred - sklearn_pred)):.6f}")

Expected output:

code
From-scratch MSE:  0.000842
Scikit-learn MSE:  0.000842
Max absolute diff: 0.000000
Mean absolute diff: 0.000000

The MSE values are extremely close. Minor differences come from scikit-learn's use of Friedman MSE (an optimized splitting criterion introduced in Friedman's 2001 paper, Section 3.4), which weights splits by the number of samples in each child node. The core additive algorithm is identical.

Key Insight: Our 30-line implementation and scikit-learn's production-grade code produce nearly identical predictions because the math is the same. The difference is engineering: scikit-learn handles edge cases, supports multiple loss functions, and includes early stopping. But the gradient descent in function space is exactly what we coded.

Extending to classification

Gradient boosting handles classification through a simple change: swap the loss function from MSE to log-loss (cross-entropy). The pseudo-residuals become the gap between the true label and the predicted probability.

For binary classification with log-loss:

L(yi,F(xi))=[yilog(pi)+(1yi)log(1pi)]L(y_i, F(x_i)) = -\left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\right]

Where:

  • yi{0,1}y_i \in \{0, 1\} is the true class label
  • pi=σ(F(xi))=11+eF(xi)p_i = \sigma(F(x_i)) = \frac{1}{1 + e^{-F(x_i)}} is the predicted probability via the sigmoid function
  • F(xi)F(x_i) is the raw model output (log-odds) before the sigmoid is applied

The pseudo-residual for log-loss turns out to be:

rim=yipir_{im} = y_i - p_i

In Plain English: The pseudo-residual for classification looks exactly like the regression case — actual minus predicted — except "predicted" is now a probability. If a house is truly expensive (class 1) and the model predicts 0.3 probability of being expensive, the pseudo-residual is $1 - 0.3 = 0.7$. The next tree gets trained to push that probability higher.

Everything else stays the same: initialize, compute pseudo-residuals, fit trees, update with a learning rate. The only addition is applying the sigmoid function to convert raw scores into probabilities at prediction time.

Extending to other loss functions

Swap the loss function, and only the pseudo-residual computation changes. The rest of the algorithm is identical:

Loss FunctionFormulaPseudo-ResidualBest For
Squared Error12(yF)2\frac{1}{2}(y - F)^2yFy - FStandard regression
Absolute ErroryF\|y - F\|sign(yF)\text{sign}(y - F)Outlier-heavy data
Huber LossMSE if rδ\|r\| \leq \delta, MAE otherwiseBlend of bothBest of both worlds
Log-Loss[ylog(p)+(1y)log(1p)]-[y\log(p) + (1-y)\log(1-p)]ypy - pBinary classification
Quantile LossAsymmetric absolute errorα\alpha if r>0r > 0, (1α)-(1-\alpha) if r<0r < 0Prediction intervals

To make our from-scratch class robust to outliers, you'd replace one line:

python
# Instead of: residuals = y - predictions
# Use MAE pseudo-residuals:
residuals = np.sign(y - predictions)

Everything else stays the same. This plug-and-play flexibility is why Friedman's framework became the foundation for XGBoost, LightGBM, and CatBoost.

Key hyperparameters and how they interact

Gradient boosting has more hyperparameters than random forests, and they interact in non-obvious ways. Here's the decision framework:

ParameterTypical RangeEffectHow to Tune
learning_rate0.01–0.3Shrinks each tree's contributionLower is safer; compensate with more trees
n_estimators100–5000Number of boosting iterationsUse early stopping on validation set
max_depth2–8Complexity of each tree3–5 for most problems; lower for more features
subsample0.5–1.0Fraction of data per tree<1.0 adds randomness, reduces overfitting
min_samples_leaf1–50Minimum samples in leaf nodesHigher values smooth predictions
max_features0.5–1.0Fraction of features per split<1.0 reduces correlation between trees

Common Pitfall: Tuning learning_rate and n_estimators independently is a mistake. They're linked: halving the learning rate roughly doubles the trees needed for the same accuracy. Always tune them together, or fix the learning rate low (0.05) and let early stopping find the right number of trees.

For serious hyperparameter tuning, skip GridSearchCV — the search space is too large. Use Optuna or a Bayesian optimizer instead. Our guide on hyperparameter tuning covers the differences in depth.

When to use gradient boosting

Gradient boosting is the right choice when:

  1. Your data is tabular (rows and columns, not images or text). It consistently outperforms neural networks on structured data unless the dataset exceeds millions of rows.
  2. You need high accuracy and can afford training time. It's the go-to for competitions and any project where a 0.5% improvement matters.
  3. Feature interactions exist that linear models can't capture. The sequential trees naturally model interactions.
  4. Missing data is present (with HistGradientBoostingRegressor or XGBoost/LightGBM, which handle NaN natively).
  5. You need feature importance rankings. Gradient boosting produces reliable importance scores.

When NOT to use gradient boosting

  1. Real-time inference with strict latency constraints. Predicting with 500 trees is slower than a linear model or a single shallow tree. If you need sub-millisecond predictions, consider a simpler model or distillation.
  2. Very small datasets (under 100 rows). Boosting can memorize small datasets quickly. A regularized linear model often works better.
  3. Unstructured data (images, audio, long text). Convolutional networks, transformers, and other deep learning architectures are the right tool here.
  4. When interpretability is paramount. While feature importance helps, the additive combination of 200+ trees is hard to explain to non-technical stakeholders. A single decision tree or logistic regression may be required for regulatory reasons.
  5. When training time is a hard constraint. Gradient boosting is sequential by nature (each tree depends on the previous). Random forests parallelize trivially.

Pro Tip: As of scikit-learn 1.8, use HistGradientBoostingRegressor (not GradientBoostingRegressor) for any dataset with more than 10,000 samples. It uses histogram-based binning inspired by LightGBM and is orders of magnitude faster. It also handles missing values and categorical features natively — something the older class can't do.

Production considerations

Training complexity. Each of the MM iterations fits a decision tree on NN samples with PP features. For the standard implementation, each tree takes O(NPd)O(N \cdot P \cdot d) time where dd is depth, giving total training complexity of O(MNPd)O(M \cdot N \cdot P \cdot d). Histogram-based variants (LightGBM, XGBoost's tree_method='hist') reduce per-tree cost to O(NbinsPd)O(N_{bins} \cdot P \cdot d) where Nbins=256N_{bins} = 256 by default — a massive speedup for large NN.

Inference complexity. Predicting a single sample requires traversing MM trees. Each traversal is O(d)O(d), so total inference is O(Md)O(M \cdot d). With 500 trees of depth 6, that's 3,000 comparisons per sample — fast enough for batch scoring, but worth benchmarking for real-time APIs.

Memory. Each tree stores split thresholds, feature indices, and leaf values. A depth-6 tree has at most 63 internal nodes. With 500 trees, you're storing around 31,500 split rules — comfortably fits in memory for any modern system. LightGBM and XGBoost both support model compression for deployment.

Scaling behavior. Gradient boosting handles 1M rows well with histogram-based methods. At 100M+ rows, distributed training (XGBoost's Spark integration, LightGBM's MPI mode) becomes necessary. At that scale, you'll also want proper data splitting to avoid temporal leakage.

From scratch to production libraries

The class we built is the mathematical core. Production libraries add engineering that makes it practical at scale:

FeatureOur Scratch CodeScikit-learn 1.8XGBoost 3.2LightGBM 4.6
Core algorithmMSE onlyMSE, MAE, Huber, QuantileCustom objectivesCustom objectives
SplittingStandardStandard + HistHist + exactHist (GOSS + EFB)
RegularizationNoneNoneL1 + L2 on leaf weightsL1 + L2
Missing valuesCrashHist variant onlyNativeNative
Categorical featuresManual encodingHist variant onlyNativeNative
GPU supportNoNoYesYes
Distributed trainingNoNoYes (Spark, Dask)Yes (MPI, Dask)
Early stoppingNoYesYesYes

The jump from our scratch code to XGBoost or LightGBM is an engineering jump, not a conceptual one. If you understand the loop — initialize, compute residuals, fit tree, update with shrinkage — you understand what every production library is doing under the hood.

Conclusion

Gradient boosting does one thing over and over: it measures how wrong the current model is, trains a small tree to approximate those errors, and adds a fraction of that correction. The "gradient" means we're doing gradient descent in function space — each tree steps in the direction that most reduces the loss. The "boosting" means we're combining many weak learners into a strong one.

The from-scratch implementation we built is 30 lines of actual logic, and it produces predictions within 0.1% of scikit-learn's production code. That's because the math is simple — the complexity in production libraries comes from engineering: histogram binning, GPU kernels, distributed communication, native missing-value handling. Understanding the core loop lets you reason about why specific hyperparameters matter, why overfitting happens, and how to fix it.

Gradient boosting connects naturally to several other topics worth exploring. Ensemble Methods covers the broader family of model combination strategies. XGBoost for Regression shows how adding L1/L2 regularization directly into the objective function prevents overfitting more effectively than shrinkage alone. And Random Forest provides the contrasting perspective — reducing variance through parallel averaging rather than reducing bias through sequential correction.

The next time someone asks you how gradient boosting works, don't say "it builds trees sequentially." Say: "It performs gradient descent in function space. Each tree fits the negative gradient of the loss function, and the learning rate controls the step size. The final model is the sum of a constant baseline and scaled corrections from every tree."

Frequently Asked Interview Questions

Q: What's the difference between gradient boosting and random forest?

Random forest trains many deep, independent trees in parallel on bootstrapped data subsets and averages their predictions — this reduces variance. Gradient boosting trains a sequence of shallow trees, where each tree corrects the ensemble's current errors — this reduces bias. Random forest is embarrassingly parallel and harder to overfit. Gradient boosting typically achieves higher accuracy but requires careful tuning of the learning rate and number of trees.

Q: Why do we use shallow trees (depth 2–4) as base learners instead of deep trees?

Deep trees have high variance — they memorize noise in the training data. In gradient boosting, each tree only needs to capture the rough direction of the remaining error, not solve the entire problem. Shallow trees are low-variance weak learners. The boosting loop accumulates many of these small corrections, and the summation produces a strong model. Using deep trees would cause each iteration to overcorrect, leading to rapid overfitting.

Q: What are pseudo-residuals, and why aren't they just called residuals?

For MSE loss, pseudo-residuals equal actual residuals (yF(x)y - F(x)). But gradient boosting works with any differentiable loss function. For absolute error, the pseudo-residual is sign(yF(x))\text{sign}(y - F(x)). For log-loss in classification, it's ypy - p where pp is the predicted probability. The term "pseudo-residual" covers all these cases because it's formally defined as the negative gradient of the loss with respect to the current prediction — which only simplifies to the actual residual for squared error.

Q: Your gradient boosting model has low training error but high validation error. What's happening and how do you fix it?

This is classic overfitting — the model has too much capacity relative to the signal in the data. Three fixes, in order of effectiveness: (1) Lower the learning rate and use early stopping on a validation set, which lets the model take smaller steps and stop before memorizing noise. (2) Reduce max_depth to constrain individual tree complexity. (3) Add stochastic elements via subsample < 1.0 (use a random fraction of data per tree) or max_features < 1.0 (use a random fraction of features per split). For a deeper exploration of this tradeoff, see The Bias-Variance Tradeoff.

Q: Explain what "gradient descent in function space" means in one sentence.

Instead of adjusting a fixed set of parameters (like weights in a neural network), each iteration adds an entirely new function (a decision tree) to the model, chosen to point in the direction that most reduces the loss — just as gradient descent on parameters moves in the direction of steepest descent, but in the space of functions rather than the space of numbers.

Q: When would you choose LightGBM over XGBoost in production?

LightGBM is typically faster on large datasets (100K+ rows) because of its histogram-based binning and leaf-wise tree growth strategy, which grows deeper on the most error-prone leaves rather than growing level-by-level. It also handles high-cardinality categorical features natively without manual encoding. XGBoost is more mature in distributed settings (better Spark integration) and has wider community support for custom objectives. In my experience, LightGBM is the default choice for speed-sensitive production pipelines, while XGBoost is the safer bet when team familiarity matters.

Q: Why does lowering the learning rate improve generalization even though it requires more trees?

A high learning rate means each tree's full correction is applied, so a single noisy tree can push predictions far off course. A low learning rate scales each tree's contribution down to a small fraction (say 10%), so any individual noisy tree barely affects the ensemble. The aggregate of many small, slightly different corrections averages out noise — similar to how averaging many noisy measurements gives a more accurate estimate than any single measurement. Friedman showed empirically that shrinkage factors of 0.01–0.1 with hundreds of trees consistently outperformed learning rates near 1.0 with fewer trees.

Q: How does gradient boosting handle multi-class classification?

For KK classes, gradient boosting fits KK separate regression models in parallel at each boosting iteration — one per class. Each model predicts the raw log-odds for its class, and a softmax function converts the KK raw scores into probabilities that sum to 1. The pseudo-residuals for class kk are ykpky_k - p_k, where yky_k is the one-hot indicator and pkp_k is the current predicted probability. This means a gradient boosting model with 200 iterations and 3 classes actually contains 600 trees.

<!— PLAYGROUND_START data-dataset="lds_classification_multi" —>

Hands-On Practice

Gradient Boosting is a powerful ensemble technique that builds models sequentially, where each new model corrects the errors of its predecessors. While libraries like XGBoost are industry standards, building a Gradient Boosting classifier from scratch (or using its foundational components) is the best way to understand the 'gradient' descent logic behind the magic. You'll implement a Gradient Boosting Classifier using scikit-learn on a multi-class species dataset to see how weak learners combine into a highly accurate predictor.

Dataset: Species Classification (Multi-class) Iris-style species classification with 3 well-separated classes. Perfect for multi-class algorithms. Expected accuracy ≈ 95%+.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# ============================================
# STEP 1: LOAD AND PREPARE THE DATA
# ============================================
# We'll use the Species Classification dataset, which has 3 distinct classes.
# This clean dataset is perfect for visualizing how boosting boundaries form.

df = pd.read_csv("/datasets/playground/lds_classification_multi.csv")

print("Dataset Structure:")
print(f"Shape: {df.shape}")
# Expected output: Shape: (150, 5)

print("\nFirst 5 rows:")
print(df.head())
# Expected output:
#    sepal_length  sepal_width  petal_length  petal_width species
# 0           5.1          3.5           1.4          0.2  setosa
#...

# Separate features and target
X = df.drop('species', axis=1)
y = df['species']

# Encode target labels (Gradient Boosting internally handles classes, but encoding is safer for analysis)
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split into training and testing sets
# We use a standard 80/20 split. random_state ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42
)

print(f"\nTraining samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
# Expected output:
# Training samples: 120
# Testing samples: 30

# ============================================
# STEP 2: TRAIN GRADIENT BOOSTING CLASSIFIER
# ============================================
# We use sklearn's GradientBoostingClassifier.
# Under the hood, this fits a sequence of Decision Trees.
# Key parameters:
# - n_estimators: The number of boosting stages (trees) to perform.
# - learning_rate: Shrinks the contribution of each tree.
#   (Lower rate requires more trees but generalizes better).
# - max_depth: Limits the complexity of individual trees (weak learners).

gbm_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

print("\nTraining Gradient Boosting Model...")
gbm_model.fit(X_train, y_train)
print("Model training complete.")

# ============================================
# STEP 3: EVALUATE PERFORMANCE
# ============================================
# how well our ensemble of weak learners performs.

y_pred = gbm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy:.4f}")
# Expected output: Model Accuracy: ~0.9667 (High accuracy expected for this dataset)

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))
# Expected output:
#               precision    recall  f1-score   support
#       setosa       1.00      1.00      1.00        10
#   versicolor       1.00      0.90      0.95        10
#    virginica       0.91      1.00      0.95        10
#     accuracy                           0.97        30

# NOTE: High accuracy is expected with this educational dataset!
# The patterns are intentionally clear to help you learn the algorithm.
# Real-world datasets typically have more noise and lower accuracy.

# ============================================
# STEP 4: VISUALIZING STAGED PREDICTONS (HOW IT LEARNS)
# ============================================
# One of the coolest features of Gradient Boosting is watching the error drop
# as we add more trees. We can access the loss at each stage.

train_loss = gbm_model.train_score_

plt.figure(figsize=(10, 5))
plt.plot(np.arange(len(train_loss)) + 1, train_loss, 'b-', label='Training Loss')
plt.title('Gradient Boosting: Loss Reduction over Iterations')
plt.xlabel('Boosting Iterations (Number of Trees)')
plt.ylabel('Loss (Deviance)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ============================================
# STEP 5: FEATURE IMPORTANCE
# ============================================
# Gradient Boosting calculates importance based on how often a feature
# is used to split data across all trees.

feature_importance = gbm_model.feature_importances_
features = X.columns

# Create a DataFrame for easier plotting
fi_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
fi_df = fi_df.sort_values(by='Importance', ascending=False)

print("\nFeature Importance:")
print(fi_df)
# Expected output:
#          Feature  Importance
# 3    petal_width    0.4xxxxx
# 2   petal_length    0.3xxxxx
#...

plt.figure(figsize=(10, 5))
sns.barplot(x='Importance', y='Feature', data=fi_df, palette='viridis')
plt.title('Feature Importance in Gradient Boosting')
plt.tight_layout()
plt.show()

Try modifying the learning_rate and n_estimators inversely (e.g., lower learning rate to 0.01 and increase estimators to 500) to see if the loss curve becomes smoother. You can also experiment with max_depth, setting it to 1 creates 'decision stumps,' the simplest possible weak learner, which forces the boosting algorithm to work harder to correct errors. Finally, try introducing noise to the features to observe how robust Gradient Boosting is compared to a single Decision Tree.

<!— PLAYGROUND_END —>

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths