Skip to content

Regression Trees and Random Forest: From Single Splits to Ensemble Power

DS
LDS Team
Let's Data Science
12 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice

You have a dataset of house prices. Square footage, bedroom count, age of the property. A linear regression model draws one straight line through the data and calls it a day. But what if a fourth bedroom adds $80,000 in some neighborhoods and only $15,000 in others? What if homes over 3,000 square feet follow a completely different pricing curve than smaller ones? A regression tree handles these situations naturally, without you engineering a single interaction term. A random forest takes that idea and turns it into one of the most reliable predictors in applied machine learning.

We'll use a single running example throughout: predicting house prices from square footage, bedroom count, and property age. Every formula, every code block, and every diagram ties back to this scenario.

Recursive binary splitting process for regression tree constructionClick to expandRecursive binary splitting process for regression tree construction

Regression trees split data into homogeneous regions

A regression tree is a non-parametric model that predicts a continuous value by recursively partitioning the feature space into rectangular regions. Each final region (called a leaf) stores a single prediction: the average target value of the training samples that landed there.

Think of it as playing a game of twenty questions with your data. Instead of fitting a global equation, the tree asks a series of yes/no questions about feature values. "Is square footage above 2,000?" If yes, "Are there more than 3 bedrooms?" Each answer narrows the search until you reach a group of similar houses, and the prediction is simply their average price.

Unlike polynomial regression, which fits a single curved function to the entire dataset, a regression tree creates a piecewise constant approximation. It doesn't produce a smooth curve; it produces a step function. That distinction matters: trees capture arbitrary non-linear relationships and feature interactions without any manual feature engineering.

Key Insight: A regression tree's prediction for any new observation is just the average target value of the training samples in the same leaf. The tree's entire job is to group similar training examples together as efficiently as possible.

Recursive binary splitting finds the best questions to ask

Recursive binary splitting is the greedy algorithm that builds a regression tree from top to bottom. At each node, it tests every possible split on every feature and picks the one that reduces prediction error the most.

The process works like this: given a node containing NN samples, the algorithm evaluates splitting on feature jj at threshold ss. It sends all samples with xjsx_j \leq s to the left child and the rest to the right. The quality of each candidate split is measured by the weighted MSE of the two resulting groups.

Cost(j,s)=NLNMSEL+NRNMSER\text{Cost}(j, s) = \frac{N_L}{N} \cdot \text{MSE}_L + \frac{N_R}{N} \cdot \text{MSE}_R

Where:

  • NLN_L and NRN_R are the number of samples in the left and right child nodes
  • NN is the total number of samples in the parent node
  • MSEL=1NLiL(yiyˉL)2\text{MSE}_L = \frac{1}{N_L} \sum_{i \in L}(y_i - \bar{y}_L)^2 is the mean squared error in the left child
  • MSER=1NRiR(yiyˉR)2\text{MSE}_R = \frac{1}{N_R} \sum_{i \in R}(y_i - \bar{y}_R)^2 is the mean squared error in the right child
  • yˉL\bar{y}_L and yˉR\bar{y}_R are the mean target values in the left and right children

In Plain English: The algorithm tries every possible way to divide the houses into two groups. For each candidate split (say, "sqft <= 2,100"), it checks how tightly grouped the prices are within each resulting group. The split that creates the most internally consistent groups wins.

The algorithm is "greedy" because it picks the locally optimal split at each step without considering whether a different choice might lead to a better tree overall. This makes the process fast (training time is O(npnlogn)O(n \cdot p \cdot n \log n) for nn samples and pp features) but doesn't guarantee a globally optimal tree.

A single tree on house prices

Let's build a shallow regression tree and examine its performance on our house price dataset.

Expected output:

code
=== Single Decision Tree (max_depth=3) ===
RMSE: &#36;49,979
R²:   0.9452
Tree depth: 3
Number of leaves: 8

With just 8 leaf nodes, the tree captures 94.5% of the variance in house prices. That's strong for three splits. But this shallow tree can only represent 8 distinct price predictions for any input, a coarse approximation at best. Making the tree deeper helps, but introduces a critical problem.

Single trees suffer from high variance

A single decision tree without depth constraints will keep splitting until every leaf contains a single training sample. Training error drops to zero, but the model has memorized noise rather than learned signal.

This is the bias-variance tradeoff at its most visible:

ModelBiasVarianceTypical Behavior
Linear RegressionHighLowMisses non-linear patterns, stable across datasets
Shallow Tree (depth 3)ModerateLowCaptures major splits, misses fine structure
Deep Tree (no limit)LowHighFits training data perfectly, unstable on new data
Random ForestLowLowAverages away instability, keeps flexibility

Remove one training house from the dataset, retrain the tree, and you might get a completely different tree structure. That instability is the core weakness. We need a way to keep the tree's ability to model complex patterns while reducing its sensitivity to individual data points.

Common Pitfall: Setting max_depth=None and hoping for the best is the most common mistake with decision trees. An unrestricted tree on 200 samples can grow 13 levels deep with 160 leaves, essentially creating a lookup table for the training data.

Random forest reduces variance through ensemble averaging

A random forest builds multiple decision trees on different views of the same training data and averages their predictions (Breiman, 2001). By combining many unstable but accurate trees, the ensemble produces predictions that are both accurate and stable.

The mathematical foundation is straightforward. If you have BB trees, each producing prediction f^b(x)\hat{f}_b(x), the forest's prediction is:

f^RF(x)=1Bb=1Bf^b(x)\hat{f}_{RF}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)

Where:

  • f^RF(x)\hat{f}_{RF}(x) is the random forest's prediction for input xx
  • BB is the total number of trees (the n_estimators parameter)
  • f^b(x)\hat{f}_b(x) is the prediction from tree bb

In Plain English: Each tree makes its own price estimate for a house. Some will guess too high, others too low. When you average 100 guesses from trees that each looked at slightly different data and features, the errors tend to cancel out and the average lands closer to the true price.

This only works if the trees disagree enough. If every tree makes the same mistake, averaging won't help. Random forest forces diversity through two mechanisms.

Random forest pipeline showing bagging and feature randomnessClick to expandRandom forest pipeline showing bagging and feature randomness

Bagging creates different training sets

Bootstrap aggregating (bagging) trains each tree on a different random sample drawn with replacement from the training data. A bootstrap sample of NN observations contains roughly 63.2% of the unique original rows, with some rows appearing multiple times. Each tree sees a slightly different version of reality.

Feature randomness decorrelates the trees

At every split in every tree, the algorithm selects a random subset of mm features (out of pp total) and searches for the best split only within that subset. For regression, scikit-learn defaults to max_features=1.0 (all features), but setting max_features="sqrt" or max_features=0.33 forces trees to find different splitting strategies. If one feature dominates (like square footage in our house price data), feature randomness prevents every tree from splitting on it first, producing genuinely different tree structures.

Pro Tip: The combination of bagging and feature randomness is what makes random forest strictly better than simple bagging. Bagging alone still produces correlated trees when one feature is much stronger than the rest. Feature randomness breaks that correlation.

Random forest outperforms single trees on the same data

Let's compare an unrestricted single tree against a 100-tree random forest on our house price dataset.

Expected output:

code
=== Unrestricted Single Tree vs Random Forest ===
Single Tree  | RMSE: &#36;39,793  | R²: 0.9653
Random Forest| RMSE: &#36;29,401  | R²: 0.9810

Variance reduction: 26.1% lower RMSE
Deep tree depth: 13
Deep tree leaves: 160

The random forest cuts RMSE by 26.1% compared to the single tree, pushing R-squared from 0.965 to 0.981. The single tree grew 13 levels deep and created 160 leaves for just 160 training samples, essentially a lookup table. The forest achieves better accuracy without memorizing the training data.

Feature importance reveals what drives predictions

Random forests track how much each feature reduces the total MSE across all splits in all trees. This metric, called Mean Decrease in Impurity (MDI), provides a built-in ranking of feature relevance.

Expected output:

code
=== Feature Importance (MDI) ===
1. sqft         0.8745
2. bedrooms     0.1151
3. age          0.0104

Square footage dominates at 87.5%, which makes sense given our price formula. Bedrooms contribute 11.5%, and property age barely registers at 1%. This kind of instant interpretability is why random forests remain popular even when gradient boosting methods like XGBoost might squeeze out slightly better accuracy. For deeper analysis, permutation importance (available via sklearn.inspection.permutation_importance) is more reliable than MDI, especially when features have different scales or cardinalities. See feature selection strategies for a thorough comparison.

Pro Tip: MDI feature importance is biased toward high-cardinality features (continuous variables get favored over binary ones). For production feature selection, always confirm MDI results with permutation importance, which measures how much the model's test-set accuracy drops when each feature is shuffled.

Hyperparameter tuning controls the bias-variance balance

The four parameters that matter most for RandomForestRegressor are n_estimators, max_depth, max_features, and min_samples_leaf. Each one controls a different aspect of the bias-variance tradeoff.

ParameterDefaultRecommended RangeEffect
n_estimators100100 - 500More trees = lower variance. Diminishing returns past ~200
max_depthNone10 - 30, or NoneDeeper trees = lower bias, higher per-tree variance
max_features1.0"sqrt", 0.33, 1.0Smaller = more decorrelation, sometimes higher bias
min_samples_leaf11 - 10Higher = smoother predictions, prevents fitting noise

Expected output:

code
=== Hyperparameter Comparison (5-Fold CV) ===
n_est | depth | max_feat |  Mean R² | Std R²
--------------------------------------------------
   10 |     5 |      1.0 |   0.9749 | 0.0060
  100 |     5 |      1.0 |   0.9766 | 0.0054
  100 |  None |      1.0 |   0.9761 | 0.0051
  100 |  None |     sqrt |   0.9574 | 0.0109
  300 |  None |     sqrt |   0.9587 | 0.0088

Two patterns emerge. First, adding more trees (10 to 100) improves stability (lower std) more than accuracy. Second, max_features="sqrt" actually hurts performance here because we only have 3 features, so sqrt(3) rounds to 1, forcing each split to pick from a single random feature. With many features (50+), sqrt becomes essential for decorrelation. For systematic tuning, consider automated hyperparameter search or validate with cross-validation.

Key Insight: The max_features parameter needs to match your feature count. With 3 features, sqrt or log2 forces trees to ignore most features at each split, increasing bias without enough decorrelation benefit. With 100+ features, sqrt or 0.33 is usually the best starting point.

Trees cannot extrapolate beyond training data

The most significant limitation of any tree-based model is that predictions are bounded by the target values seen during training. A regression tree's prediction is always the mean of some leaf's training samples, so it can never output a value outside the range of the training set.

Expected output:

code
=== Extrapolation Comparison ===
  Sqft | True Price | Linear Reg | Rand Forest
--------------------------------------------------
  1000 | $  150,000 | $  152,077 | $  150,705
  2000 | $  250,000 | $  249,203 | $  243,272
  3000 | $  350,000 | $  346,330 | $  350,068
  4000 | $  450,000 | $  443,456 | $  350,068
  5000 | $  550,000 | $  540,583 | $  350,068

Within the training range (1,000 to 3,000 sqft), both models perform well. Beyond 3,000 sqft, the random forest flatlines at $350,068, the maximum average leaf value from its training data. Linear regression continues the trend accurately. This is not a bug; it's a fundamental property of how trees make predictions.

Common Pitfall: If your production data might contain values outside the training range (new product categories, price inflation over time, unusual house sizes), a pure tree model will silently produce bad predictions. Always check your input distributions against training data ranges.

When to use regression trees and random forest

Decision guide for choosing between tree-based and linear modelsClick to expandDecision guide for choosing between tree-based and linear models

When random forest is the right choice

  • Tabular data with non-linear relationships and feature interactions. This is where trees shine. No feature engineering required.
  • You need a model in 30 minutes. Random forests work well with default parameters, need no feature scaling, and handle missing values (in some implementations). They're the fastest path from raw data to a decent model.
  • Interpretability matters at the feature level. Feature importance gives stakeholders a clear ranking of what drives predictions.
  • The target range in production matches training data. No extrapolation needed.

When to choose something else

  • Extrapolation is required. Time series forecasting, predicting prices that will exceed historical ranges, any setting where the target will grow beyond observed values. Use linear models or hybrid approaches instead.
  • High-dimensional sparse data. Text (TF-IDF), one-hot encoded categories with thousands of levels. Linear models or neural networks are better fits.
  • Maximum predictive accuracy is the goal. Gradient boosting (XGBoost, LightGBM) usually beats random forest by 2-5% on structured data because it fits residuals sequentially rather than averaging parallel trees.
  • Training data is tiny. With fewer than 100 samples, a single regularized tree or linear model is more appropriate. Forests need enough data to populate diverse bootstrap samples.

Production considerations

ConcernRandom ForestSingle Tree
Training timeO(Bnplogn)O(B \cdot n \cdot p \cdot \log n)O(nplogn)O(n \cdot p \cdot \log n)
Inference timeO(Blogn)O(B \cdot \log n) per sampleO(logn)O(\log n) per sample
MemoryStores BB treesStores 1 tree
ParallelizationEmbarrassingly parallel (n_jobs=-1)Not applicable

For 1M training rows and 100 features, training 100 trees takes roughly 30 to 60 seconds on modern hardware. Inference is fast enough for real-time serving: each tree traversal is O(logn)O(\log n), and all trees can be evaluated in parallel.

Conclusion

Regression trees bring a fundamentally different approach to prediction. Instead of fitting a global equation, they partition the feature space into regions and assign a constant prediction to each region. That simplicity makes them powerful for non-linear data but unable to extrapolate beyond what they've seen.

Random forests solve the single tree's variance problem through a deceptively simple idea: train many trees on different data subsets with random feature restrictions, then average their predictions. The result is a model that's accurate, stable, and interpretable through feature importance. Breiman's original 2001 paper remains one of the most cited works in machine learning for good reason.

For problems where random forest falls short, the natural next step is gradient boosting, which builds trees sequentially to correct each other's errors rather than averaging independent trees. And when your data is fundamentally linear, a well-tuned linear regression will always be more appropriate. To understand how ensemble methods compare more broadly, including stacking and voting, see our ensemble learning guide.

The best model is the one that matches your data's structure. Trees for interactions. Lines for trends. And forests when you need both accuracy and reliability.

Frequently Asked Interview Questions

Q: Why can't a random forest extrapolate beyond its training data range?

Every prediction from a tree is the average of training samples in a leaf node. Leaves are bounded by the minimum and maximum target values in the training set. When a new input falls outside the training feature range, it gets routed to the nearest leaf, producing a capped prediction. Linear models don't have this limitation because they follow a slope that extends indefinitely.

Q: How does feature randomness in random forest differ from bagging alone?

Bagging trains each tree on a different bootstrap sample but lets every tree consider all features at every split. If one feature is much stronger than the rest, every bagged tree will split on it first, producing highly correlated trees that average to nearly the same prediction. Feature randomness forces each split to choose from a random subset of features, so different trees discover different splitting patterns. This decorrelation is what drives the variance reduction.

Q: What happens if you set n_estimators too high in a random forest?

Unlike deep learning, more trees in a random forest do not cause overfitting. Each additional tree reduces variance (or leaves it unchanged). The only cost is computation: training time and memory scale linearly with tree count. In practice, performance gains plateau around 200 to 300 trees for most datasets.

Q: When would you pick a single decision tree over a random forest?

When full interpretability is the priority. A single shallow tree can be printed and explained to non-technical stakeholders in minutes. Random forests sacrifice this transparency for accuracy. Regulatory environments (credit scoring, healthcare) sometimes require models where every prediction can be traced through explicit rules.

Q: How do you handle missing values with random forest in scikit-learn?

Scikit-learn's RandomForestRegressor does not handle missing values natively; you need to impute them first (using SimpleImputer or KNNImputer). However, other implementations like LightGBM and XGBoost handle missing values internally by learning an optimal direction for missing data at each split.

Q: What is the out-of-bag (OOB) error, and why does it matter?

Each bootstrap sample excludes roughly 36.8% of the training data. The OOB error is the model's prediction error on those excluded samples, aggregated across all trees. It provides a free validation estimate without needing a separate holdout set. Enable it with oob_score=True in scikit-learn. OOB error closely approximates leave-one-out cross-validation error.

Q: How does random forest compare to gradient boosting on tabular data?

Gradient boosting methods (XGBoost, LightGBM) typically achieve 2-5% lower error than random forest because they build trees sequentially, with each new tree focusing on the residuals of the previous ensemble. Random forest builds trees independently in parallel. The tradeoff: gradient boosting is more prone to overfitting and requires careful hyperparameter tuning, while random forest is more forgiving with defaults.

Hands-On Practice

Understanding the theoretical differences between single Regression Trees and Random Forest ensembles is crucial, but seeing them perform side-by-side on real data solidifies that knowledge. In this hands-on tutorial, you will build both a single Decision Tree Regressor and a Random Forest Regressor to predict house prices, directly observing how ensemble methods reduce variance and improve stability compared to single trees. We will use a housing dataset containing features like square footage and lot size to demonstrate how these non-linear models capture complex pricing patterns that simple linear equations might miss.

Dataset: House Prices (Linear) House pricing data with clear linear relationships. Square footage strongly predicts price (R² ≈ 0.87). Perfect for demonstrating linear regression fundamentals.

Try increasing max_depth in the single tree model to 10 or 15 and observe how the training accuracy might increase while test accuracy often drops, a classic sign of overfitting. Conversely, try changing n_estimators in the Random Forest to 10 and then 200 to see how adding more trees stabilizes the prediction but yields diminishing returns after a certain point. Experimenting with min_samples_split is also highly recommended to understand how restricting leaf size changes the model's granularity.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths