Regression Trees and Random Forest: From Single Splits to Ensemble Power

DS
LDS Team
Let's Data Science
12 min readAudio
Regression Trees and Random Forest: From Single Splits to Ensemble Power
0:00 / 0:00

Linear models often feel like trying to fit a square peg into a round hole. While algorithms like Linear Regression provide a solid foundation for simple relationships, real-world data is rarely a straight line. Data is messy, non-linear, and filled with complex interactions that defy simple equations.

If you have ever played "20 Questions," you intuitively understand Decision Trees. Instead of drawing a line through data, trees slice the data into smaller, more homogeneous groups based on specific rules. When we combine hundreds of these trees, we get a Random Forest—one of the most versatile and robust algorithms in the machine learning arsenal.

This guide explores the mechanics of Regression Trees, why single trees often fail, and how Random Forests use the power of ensembles to deliver state-of-the-art performance.

What is a regression tree?

A regression tree is a non-parametric model that predicts a continuous value by splitting the dataset into distinct subsets. The algorithm recursively partitions the data based on feature values until the subsets are sufficiently homogeneous, predicting the average target value of the samples in each final leaf node.

The Intuition: Divide and Conquer

Unlike Polynomial Regression, which tries to fit a curved line to the data points, a regression tree approximates the function using a staircase pattern. It breaks the feature space into rectangles (or hyper-rectangles in higher dimensions) and assigns a single prediction value to every observation that falls inside that rectangle.

Think of estimating the price of a house. A linear model creates a formula. A regression tree asks questions:

  1. Is the house larger than 2,000 sq ft?
    • Yes: Is it in the downtown area?
    • No: Does it have a garage?

At the end of these questions, you arrive at a "leaf" containing a subset of houses. The prediction for a new house following that same path is simply the average price of the training houses in that leaf.

🔑 Key Insight: Regression trees are "piecewise constant." They do not produce a smooth curve; they produce a step function. This allows regression trees to model arbitrary non-linear relationships without complex feature engineering.

How does the algorithm decide where to split?

The algorithm selects the optimal split point by iterating through every unique value of every feature and calculating which separation results in the greatest reduction of impurity—specifically, the Mean Squared Error (MSE) or Variance.

Recursive Binary Splitting

The process is "greedy," meaning the algorithm makes the best decision at the current step without worrying about future steps. Here is the mathematical logic:

  1. The Goal: We want to split a node containing NN samples into two child nodes (Left and Right) such that the errors in the children are lower than the error in the parent.
  2. The Metric: In regression, we use Variance or MSE (Mean Squared Error) as the measure of impurity.

For a node tt with NtN_t samples, the MSE is: MSEt=1Ntit(yiy^t)2MSE_t = \frac{1}{N_t} \sum_{i \in t} (y_i - \hat{y}_t)^2 Where y^t\hat{y}_t is the mean target value of samples in node tt.

When evaluating a split, the algorithm calculates the weighted average MSE of the two potential child nodes: Cost=NleftNtotalMSEleft+NrightNtotalMSErightCost = \frac{N_{left}}{N_{total}} MSE_{left} + \frac{N_{right}}{N_{total}} MSE_{right}

The algorithm calculates this cost for every possible split on every feature and chooses the one with the lowest cost.

Python Implementation of a Single Tree

Let's visualize how a Decision Tree fits non-linear data compared to a linear model.

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

# 1. Generate synthetic non-linear data (Sine wave)
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel()
# Add some noise
y[::5] += 3 * (0.5 - np.random.rand(16))

# 2. Fit Regression Models
linear_model = LinearRegression()
tree_model_1 = DecisionTreeRegressor(max_depth=2) # Simple tree
tree_model_2 = DecisionTreeRegressor(max_depth=5) # Complex tree

linear_model.fit(X, y)
tree_model_1.fit(X, y)
tree_model_2.fit(X, y)

# 3. Predict for plotting
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_lin = linear_model.predict(X_test)
y_1 = tree_model_1.predict(X_test)
y_2 = tree_model_2.predict(X_test)

# 4. Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="Data")
plt.plot(X_test, y_lin, color="red", label="Linear Regression", linewidth=2)
plt.plot(X_test, y_1, color="cornflowerblue", label="Tree (depth=2)", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="Tree (depth=5)", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Linear Regression vs. Decision Trees")
plt.legend()
plt.show()

What to observe in the output:

  1. Linear Regression: Fails completely to capture the sine wave pattern. It just draws a straight line through the middle.
  2. Tree (depth=2): Creates a coarse "staircase" with only a few steps. It captures the general trend but lacks detail (Underfitting).
  3. Tree (depth=5): Fits the curve much better but starts to react to the random noise points (Overfitting).

Why do single decision trees often fail?

Single decision trees often fail because they suffer from high variance, meaning they are incredibly sensitive to small changes in the training data. A minor change in one data point can result in a completely different tree structure, leading to poor generalization on new data.

The Overfitting Trap

Decision trees are enthusiastic learners. If we don't stop them (using hyperparameters like max_depth), a tree will keep splitting until every single leaf contains only one data point. The training error becomes zero, but the model has memorized the noise rather than the signal.

This is the classic Bias-Variance Tradeoff:

  • Linear Regression: High Bias (too simple), Low Variance.
  • Deep Decision Tree: Low Bias (fits perfectly), High Variance (unstable).

To solve this, we need a way to keep the low bias of the tree but reduce the high variance. Enter the Random Forest.

What is a Random Forest?

A Random Forest is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the average prediction of the individual trees. By combining the predictions of many uncorrelated models, the forest reduces variance and produces a more stable, accurate result than any single tree.

The Power of "Wisdom of the Crowd"

Imagine you are guessing the weight of a cow at a fair. If you ask one person (one tree), they might be way off. If you ask 100 people and take the average, the errors tend to cancel each other out, and the average is remarkably close to the truth.

However, this only works if the people (trees) have different perspectives. If everyone talks to each other and agrees on the same logic, the crowd makes the same mistake. Random Forest ensures diversity through two key mechanisms:

1. Bagging (Bootstrap Aggregating)

Instead of training all trees on the exact same dataset, Random Forest trains each tree on a random subset of the data.

  • Bootstrap: We sample NN examples from the training data with replacement. Some rows appear multiple times; others not at all.
  • Aggregating: We average the predictions of all trees.

2. Feature Randomness

This is the "Random" in Random Forest. When a normal decision tree splits a node, it looks at every feature to find the best split. In a Random Forest, at each split, the algorithm creates a random subset of features (e.g., only looking at 3 out of 10 columns) and searches for the best split within that subset.

QUOTABLE: "Feature randomness forces trees to be different. It prevents a single dominant feature from dictating the structure of every tree in the forest, thereby decorrelating the models and improving the ensemble's robustness."

How do we implement Random Forest in Python?

We use Scikit-Learn's RandomForestRegressor. Below, we apply it to the same sine wave data to see how it smooths out the "steps" of the single tree.

python
from sklearn.ensemble import RandomForestRegressor

# 1. Configure the Random Forest
# n_estimators=100: Build 100 trees
# max_depth=5: Constrain individual tree complexity
regr_rf = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)

# 2. Fit the model
regr_rf.fit(X, y)

# 3. Predict
y_rf = regr_rf.predict(X_test)

# 4. Visualization Comparison
plt.figure(figsize=(10, 6))
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="Data")
plt.plot(X_test, y_2, color="yellowgreen", alpha=0.5, label="Single Tree (depth=5)", linewidth=2)
plt.plot(X_test, y_rf, color="blue", label="Random Forest (100 trees)", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Single Tree vs. Random Forest")
plt.legend()
plt.show()

Expected Outcome: You will notice the Random Forest prediction line (Blue) is much smoother than the Single Tree (Green). It doesn't have the jagged, sharp steps. By averaging 100 slightly different "staircases," the Random Forest approximates the curve much more naturally.

What are the critical hyperparameters to tune?

The performance of a Random Forest depends heavily on n_estimators, max_depth, min_samples_split, and max_features. Properly tuning these parameters balances the model's ability to learn patterns against the risk of overfitting and computational cost.

1. n_estimators (Number of Trees)

  • What it does: The number of trees in the forest.
  • Impact: generally, more is better. More trees increase stability and reduce variance. However, returns diminish after a certain point, and computational cost increases linearly.
  • Recommendation: Start with 100 or 200.

2. max_features (Feature Subset Size)

  • What it does: The number of features to consider when looking for the best split.
  • Impact: Smaller numbers make trees more different (less correlated) but might cause trees to miss important signals.
  • Recommendation: For regression, the default is often 1.0 (all features), but sqrt (square root of total features) or log2 are common tuning values to introduce more randomness.

3. max_depth

  • What it does: The maximum number of levels in each tree.
  • Impact: deeper trees capture more complex patterns but overfit easily.
  • Recommendation: While Random Forests are harder to overfit than single trees, limiting depth (e.g., 10-20) can speed up training and reduce model size.

4. min_samples_leaf

  • What it does: The minimum number of samples required to be at a leaf node.
  • Impact: Increasing this number smooths the model, as the tree cannot isolate small groups of noise points.
  • Recommendation: A value of 3-5 is often better than the default of 1 for noisy data.

What are the limitations of tree-based regression?

Tree-based regression models cannot extrapolate predictions beyond the range of the training data and can be biased when features differ significantly in scale or number of categories.

1. The Extrapolation Problem

This is the most significant weakness. If your training data has house prices between $100k and $500k, and you ask the model to predict a mansion that should cost $2M, a Random Forest will likely predict close to $500k.

  • Why? The model predicts the average of the leaf node. It cannot continue a trend line (slope) outside the bounds it has seen.
  • Contrast: A Linear Regression model would continue the line upward effectively.

2. Sparse Data

Random Forests do not perform as well as linear models on very high-dimensional, sparse data (like text data with TF-IDF), where the relationship is often linear.

Feature Importance: A "Free" Insight

One massive advantage of Random Forests is interpretability regarding what matters. Since the algorithm tracks how much the impurity decreases for each feature split, we can calculate Feature Importance.

python
# Accessing feature importance
import pandas as pd

# Assuming X was a DataFrame with column names
# feats = pd.DataFrame(index=X.columns) 
# feats['Importance'] = regr_rf.feature_importances_

# For our numpy example:
print("Feature Importances:", regr_rf.feature_importances_)

This score tells you exactly which variables are driving your predictions, acting as a powerful tool for feature selection and business insight.

Conclusion

Regression Trees provide a flexible way to model non-linear relationships by partitioning data into manageable chunks. While a single tree is prone to high variance and overfitting, the Random Forest solves this by aggregating the wisdom of hundreds of diverse trees.

Random Forest is often the "first choice" algorithm for tabular data because it requires minimal data preparation (no scaling needed) and generally performs well out of the box.

However, remember the golden rule: Trees cannot extrapolate. If your problem involves predicting trends into the future (time series) or values outside your training range, you may need to combine trees with linear trends or look into other algorithms.

To better understand the foundations of model fitting, review our guide on Linear Regression. If you are dealing with complex curves but want to stick to equation-based modeling, check out Polynomial Regression.


Hands-On Practice

Understanding the theoretical differences between single Regression Trees and Random Forest ensembles is crucial, but seeing them perform side-by-side on real data solidifies that knowledge. In this hands-on tutorial, you will build both a single Decision Tree Regressor and a Random Forest Regressor to predict house prices, directly observing how ensemble methods reduce variance and improve stability compared to single trees. We will use a housing dataset containing features like square footage and lot size to demonstrate how these non-linear models capture complex pricing patterns that simple linear equations might miss.

Dataset: House Prices (Linear) House pricing data with clear linear relationships. Square footage strongly predicts price (R² ≈ 0.87). Perfect for demonstrating linear regression fundamentals.

Try It Yourself

Linear Regression
Loading editor...
0/50 runs

Linear Regression: 500 house records for price prediction

Try increasing max_depth in the single tree model to 10 or 15 and observe how the training accuracy might increase while test accuracy often drops—a classic sign of overfitting. Conversely, try changing n_estimators in the Random Forest to 10 and then 200 to see how adding more trees stabilizes the prediction but yields diminishing returns after a certain point. Experimenting with min_samples_split is also highly recommended to understand how restricting leaf size changes the model's granularity.