<!— slug: gradient-boosting-the-definitive-guide-to-boosting-weak-learners —> <!— excerpt: Master gradient boosting from residual fitting to regularization. Build house-price models with scikit-learn, understand the math behind loss functions, and learn when boosting beats other ensembles. —>
Gradient boosting builds accurate predictions by stacking small corrections on top of each other. The first model guesses the average house price. The second model looks at the leftover errors and tries to predict those. A third model tackles whatever errors remain. After a few hundred of these tiny corrections, the ensemble predicts house prices with startling accuracy.
This idea, fitting each new model to the mistakes of the previous ensemble, is what makes gradient boosting one of the most effective algorithms for structured data. It powers fraud detection at PayPal, click-through prediction at Yandex, and roughly half of all winning solutions on Kaggle. Jerome Friedman formalized the method in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine", and the core algorithm hasn't changed since.
Click to expandGradient boosting iteratively fits shallow trees to residual errors to build a strong ensemble model
How gradient boosting fits residuals
Gradient boosting is a sequential ensemble method that reduces prediction error by training each new weak learner on the residual errors of the current ensemble. Where Random Forest trains hundreds of deep trees independently and averages them to reduce variance, gradient boosting trains hundreds of shallow trees sequentially, each one correcting the specific mistakes its predecessors left behind.
Think of it as an archery coach and student. The student shoots and misses 30 centimeters to the left. The coach says "adjust 30 centimeters right." The student shoots again, overshoots by 5 centimeters. The coach corrects again. Each round focuses exclusively on the remaining error, and the corrections shrink rapidly.
Here is how this plays out with house prices:
- Initialize: Predict every house at the training set's mean price (say, $340,000).
- Compute residuals: A $500,000 house has a residual of +$160,000. A $200,000 house has a residual of -$140,000.
- Fit a shallow tree to residuals: The tree learns that large square footage maps to positive residuals and older homes map to negative ones.
- Update: Add a fraction of the tree's prediction to the current estimate. That $500,000 house might now be predicted at $356,000.
- Repeat: Compute new residuals from the updated predictions and fit the next tree.
After 200 rounds, the ensemble's combined prediction nails most house prices within a few thousand dollars.
Key Insight: Each tree in gradient boosting is deliberately weak (typically depth 3 to 5). The power comes from the accumulation of hundreds of small, targeted corrections, not from any single tree being smart.
The loss function drives everything
A loss function measures how far the model's prediction is from the true value . Gradient boosting minimizes this loss by performing gradient descent in function space, which is a dense way of saying: at each step, it asks "in which direction should I adjust my predictions to reduce the loss the fastest?"
For regression with squared error:
Where:
- is the loss for a single house
- is the actual house price
- is the model's current predicted price
- The is a convenience factor that simplifies the derivative
The negative gradient (the direction of steepest descent) is:
In Plain English: For squared error loss, the negative gradient is just the residual: the actual house price minus what we predicted. When we say "fit the next tree to the negative gradient," we literally mean "fit the next tree to the errors." The math and the intuition line up perfectly.
This flexibility is the core insight of Friedman's framework. Swap in a different loss function and the algorithm adjusts automatically:
| Loss Function | Use Case | Negative Gradient |
|---|---|---|
| Squared error | Regression | (the residual) |
| Absolute error | Outlier-heavy regression | |
| Huber loss | Balanced regression | Residual if small, sign if large |
| Log loss (deviance) | Binary classification |
The algorithm doesn't change. Only the definition of "what error looks like" changes with each loss function.
The gradient boosting algorithm step by step
Here is the full procedure, applied to our house-price problem.
Step 1: Initialize with a constant
Where:
- is the initial prediction for every house
- is a constant value being optimized
- is the number of training houses
- For squared error, equals the mean of all training prices
Step 2: For each round :
Compute pseudo-residuals:
Where:
- is the pseudo-residual for house at round
- is the ensemble's current prediction for house
Fit a shallow decision tree to the pseudo-residuals .
Step 3: Update the ensemble
Where:
- is the updated ensemble prediction after round
- is the learning rate (typically 0.01 to 0.3)
- is the tree fitted at round
In Plain English: Start by predicting every house at the average price. Measure how far off you are for each house. Train a small tree to predict those errors. Add a small fraction of that tree's prediction to your running estimate. Repeat until the residuals are tiny.
Learning rate and shrinkage control overfitting
The learning rate (also called shrinkage) is the single most important hyperparameter in gradient boosting. It scales down each tree's contribution before adding it to the ensemble.
Where:
- is the learning rate controlling each tree's contribution
- Setting means each tree fully corrects the residuals
- Setting means each tree corrects only 10% of the residuals
In Plain English: If the model thinks a house's price is $20,000 too low, a learning rate of 0.1 means it corrects by only $2,000 this round. That restraint prevents the model from overreacting to noise in individual training examples. It takes more rounds to converge, but the final model generalizes far better.
Setting lets each tree fully correct all residuals, which sounds efficient but actually memorizes training noise. Small values (0.01 to 0.1) force the model to spread corrections across many trees, and the averaging effect washes out noise.
Pro Tip: There's a direct tradeoff between learning rate and tree count. Cut the learning rate by 10x and you need roughly 10x more trees to reach the same training loss. The sweet spot for most problems: learning_rate=0.05 with n_estimators=500-2000, tuned via early stopping.
Regularization beyond the learning rate
Shrinkage alone isn't enough. Gradient boosting has several regularization knobs, and using them together produces models that generalize well on unseen data.
Click to expandHow learning rate, tree depth, subsampling, and early stopping each prevent overfitting in gradient boosting
| Technique | Parameter | Typical Range | What It Does |
|---|---|---|---|
| Learning rate | learning_rate | 0.01 to 0.3 | Scales each tree's contribution down |
| Tree depth | max_depth | 3 to 6 | Limits interaction complexity per tree |
| Min samples per leaf | min_samples_leaf | 5 to 50 | Prevents leaves from fitting single outliers |
| Subsampling | subsample | 0.5 to 0.8 | Each tree trains on a random subset of rows |
| Column subsampling | max_features | 0.5 to 1.0 | Each tree considers a random subset of features |
| Early stopping | n_iter_no_change | 10 to 50 | Stops adding trees when validation loss plateaus |
Tree depth is the second most important parameter after learning rate. A depth-3 tree captures up to 3-way feature interactions (e.g., square footage AND bedrooms AND age). Going deeper captures higher-order interactions but also fits noise. For most tabular problems, depth 3 to 5 hits the right balance.
Subsampling (also called stochastic gradient boosting) trains each tree on a random 50-80% of the training data. This introduces randomness that decorrelates successive trees and reduces overfitting, similar to how bagging works in Random Forest. Friedman showed in his original paper that subsample=0.5 often outperforms using the full dataset.
Early stopping monitors a validation set after each boosting round and halts training once the validation loss stops improving for a set number of rounds. This is the most practical safeguard against overfitting because you don't need to guess the right number of trees upfront.
Common Pitfall: Don't confuse learning_rate with training speed. A lower learning rate makes training slower because you need more trees. But smaller learning rates almost always produce better generalization. Budget the extra compute.
Gradient boosting with scikit-learn
Let's build a house-price model end to end. The data is synthetic so every code block runs in the browser without external files.
<!— EXEC —>
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(42)
n = 200
sqft = np.random.randint(800, 3500, n).astype(float)
bedrooms = np.random.randint(1, 6, n).astype(float)
age = np.random.randint(0, 50, n).astype(float)
price = (sqft * 150) + (bedrooms * 10000) - (age * 2000) + np.random.randn(n) * 20000
X = np.column_stack([sqft, bedrooms, age])
y = price
feature_names = ["sqft", "bedrooms", "age"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
gbr = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
random_state=42
)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:,.0f}")
print(f"RMSE: {np.sqrt(mse):,.0f}")
print(f"R2: {r2:.4f}")
print("\nFeature importances:")
for name, imp in zip(feature_names, gbr.feature_importances_):
print(f" {name:>10}: {imp:.4f}")
Expected Output:
MSE: 806,930,157
RMSE: 28,407
R2: 0.9148
Feature importances:
sqft: 0.9121
bedrooms: 0.0150
age: 0.0729
MSE: 396,486,659
RMSE: 19,912
R2: 0.9530
Feature importances:
sqft: 0.7279
bedrooms: 0.0413
age: 0.2308
The model explains over 95% of price variance. Square footage dominates the feature importances (as expected for a dataset where price scales linearly with area), followed by age and bedrooms.
Now watch the residuals shrink iteration by iteration:
<!— EXEC —>
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
np.random.seed(42)
n = 200
sqft = np.random.randint(800, 3500, n).astype(float)
bedrooms = np.random.randint(1, 6, n).astype(float)
age = np.random.randint(0, 50, n).astype(float)
price = (sqft * 150) + (bedrooms * 10000) - (age * 2000) + np.random.randn(n) * 20000
X = np.column_stack([sqft, bedrooms, age])
y = price
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
for n_trees in [1, 5, 20, 50, 200]:
gbr = GradientBoostingRegressor(
n_estimators=n_trees, learning_rate=0.1,
max_depth=3, subsample=0.8, random_state=42
)
gbr.fit(X_train, y_train)
pred = gbr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, pred))
print(f"Trees: {n_trees:>3} | RMSE: \${rmse:>10,.0f}")
Expected Output:
Trees: 1 | RMSE: \$ 89,356
Trees: 5 | RMSE: \$ 63,369
Trees: 20 | RMSE: \$ 31,435
Trees: 50 | RMSE: \$ 27,091
Trees: 200 | RMSE: \$ 28,407
Trees: 1 | RMSE: $ 79,698
Trees: 5 | RMSE: $ 55,648
Trees: 20 | RMSE: $ 31,015
Trees: 50 | RMSE: $ 21,887
Trees: 200 | RMSE: $ 19,912
With just 1 tree, RMSE is nearly $80,000. By 200 trees, it drops below $20,000. Each additional tree chips away at the remaining errors, and the improvements get smaller as fewer errors remain to correct.
Gradient boosting for classification
For binary classification, gradient boosting swaps squared error for log loss (cross-entropy). The update mechanism is identical: compute negative gradients, fit a tree, update predictions. The only difference is that the "residuals" now represent the gap between predicted probabilities and actual class labels.
<!— EXEC —>
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
np.random.seed(42)
n = 300
sqft = np.random.randint(800, 3500, n).astype(float)
bedrooms = np.random.randint(1, 6, n).astype(float)
age = np.random.randint(0, 50, n).astype(float)
score = 0.3 * (sqft / 3500) + 0.3 * (bedrooms / 5) - 0.4 * (age / 50)
prob = 1 / (1 + np.exp(-5 * (score - 0.15)))
sold_fast = (np.random.rand(n) < prob).astype(int)
X = np.column_stack([sqft, bedrooms, age])
X_train, X_test, y_train, y_test = train_test_split(
X, sold_fast, test_size=0.2, random_state=42
)
gbc = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
y_proba = gbc.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")
Expected Output:
Accuracy: 0.5000
AUC-ROC: 0.5146
Accuracy: 0.7833
AUC-ROC: 0.8545
Same algorithm, different loss function. The classifier predicts whether a house sells within 30 days based on the same features. An AUC above 0.85 is strong for synthetic data with deliberate noise.
When to use gradient boosting (and when not to)
Gradient boosting excels in specific situations and fails in others. Here is an honest decision framework.
Use gradient boosting when:
- You have structured/tabular data with mixed feature types
- Accuracy matters more than training speed
- Your dataset is moderate size (1K to 1M rows)
- Feature interactions matter (price depends on sqft AND location AND age together)
- You need built-in feature importance rankings
Skip gradient boosting when:
- Your dataset has fewer than ~100 rows (not enough signal for sequential correction)
- You need real-time predictions with sub-millisecond latency (hundreds of tree traversals add up)
- Your data is unstructured (images, text, audio belong to deep learning)
- Interpretability is paramount (a single decision tree or linear model is far more explainable)
- Training speed is critical and marginal accuracy differences don't matter (Random Forest parallelizes much better)
Gradient boosting versus other boosting variants
Click to expandComparison of AdaBoost, Gradient Boosting, XGBoost, and LightGBM across key dimensions
Gradient boosting spawned a family of optimized variants. Here is how they compare:
| Criterion | AdaBoost | Gradient Boosting | XGBoost | LightGBM |
|---|---|---|---|---|
| Error signal | Reweights misclassified samples | Negative gradient of loss | 2nd-order gradient (Newton) | 2nd-order gradient |
| Tree growth | Stumps (depth 1) | Depth-limited | Depth-limited | Leaf-wise |
| Regularization | Sample weights | Shrinkage, depth, subsample | L1/L2 on leaf weights | L1/L2, feature/row subsample |
| Split finding | Exact | Exact | Exact + approx histogram | Histogram only |
| Speed on 1M rows | Minutes | Minutes | Seconds | Seconds |
| Categorical features | Needs encoding | Needs encoding | Needs encoding | Native support |
AdaBoost was the first practical boosting algorithm but only adjusts sample weights. Gradient boosting generalized this to arbitrary differentiable loss functions. XGBoost added second-order gradients, regularized leaf weights, and system-level optimizations. LightGBM pushed further with histogram binning and leaf-wise growth for massive speed gains.
For a deeper look at how gradient boosting builds trees from scratch, see our under-the-hood implementation guide.
Production considerations
Training complexity: where is the number of trees, is the number of samples, and is the number of features. This is sequential; you cannot parallelize across trees (only within each tree's split search).
Inference complexity: per sample. With 500 trees of depth 5, that's 2,500 comparisons per prediction. Fast enough for batch scoring but worth benchmarking for real-time serving.
Memory: Scikit-learn stores all trees in memory. A 1,000-tree model on a dataset with 50 features can consume 100-500 MB. XGBoost and LightGBM are more memory-efficient due to histogram binning.
Scaling to large data: Scikit-learn's GradientBoostingRegressor scans all feature values for each split, which becomes slow past ~50K rows. For larger datasets, XGBoost and LightGBM bin features into 256 buckets (histogram-based splitting), reducing split-finding from to per feature. See the scikit-learn ensemble documentation for implementation details and additional parameter guidance.
Conclusion
Gradient boosting earns its reputation through a simple but powerful idea: don't try to be right all at once. Start with a crude guess, measure the errors, and train the next model to correct exactly those errors. Shrinkage ensures each correction is conservative, regularization prevents the model from memorizing noise, and the gradient framework lets you swap loss functions for regression, classification, or ranking without changing the algorithm itself.
The bias-variance tradeoff is central to understanding why gradient boosting works: it starts with high bias (the mean prediction) and reduces it incrementally while controlling variance through shrinkage and tree constraints. If you want to see the algorithm built from raw Python with no library calls, read our gradient boosting from scratch walkthrough. And when you're ready for production-grade implementations, XGBoost and LightGBM take everything in this article and optimize it for speed and scale.
Master the learning rate and tree depth tradeoff, use early stopping religiously, and gradient boosting will be the hardest model to beat on any tabular dataset you throw at it.
Interview Questions
Q: Why does gradient boosting fit trees to residuals instead of the original target?
Each tree corrects the specific mistakes left by all previous trees rather than trying to learn the full target from scratch. This sequential error-correction converts a collection of weak learners (shallow trees) into a strong learner. Mathematically, the residuals are the negative gradient of the loss function with respect to the current predictions, so each new tree moves the ensemble in the direction of steepest loss reduction.
Q: What happens if you set the learning rate to 1.0?
Each tree fully corrects the residuals in one step, which causes the model to memorize training data including noise. This leads to severe overfitting. Values between 0.01 and 0.3 force the model to spread corrections across many trees, and the averaging effect of many small updates improves generalization on unseen data.
Q: How does gradient boosting differ from AdaBoost?
AdaBoost reweights misclassified samples to focus subsequent weak learners on hard examples, but it only works with exponential loss. Gradient boosting generalizes this by fitting trees to the negative gradient of any differentiable loss function. You can use squared error for regression, log loss for classification, or Huber loss for outlier-resistant regression without changing the algorithm.
Q: Your gradient boosting model has much lower training error than validation error. What do you do?
This is classic overfitting. First, reduce the learning rate and increase the number of trees while enabling early stopping on a validation set. If overfitting persists, reduce max_depth to limit tree complexity, increase min_samples_leaf to prevent small leaves from fitting outliers, and lower subsample to 0.5-0.8 to add stochasticity.
Q: When would you choose Random Forest over gradient boosting?
Random Forest trains trees in parallel and is much faster on large datasets. It's also harder to overfit because bagging reduces variance naturally. Choose Random Forest when you need a quick baseline with minimal tuning, when training speed is a constraint, or when your dataset is very noisy and you want stability out of the box. Gradient boosting typically wins on accuracy but demands more careful hyperparameter tuning.
Q: What is stochastic gradient boosting and why does it help?
Stochastic gradient boosting trains each tree on a random subset of the training data (typically 50-80%). This randomness decorrelates successive trees and reduces overfitting, similar to how bootstrap sampling helps Random Forest. Friedman showed in his 2001 paper that subsampling often improves generalization even though each individual tree sees less data.
Q: How do you choose between squared error and Huber loss for regression?
Squared error penalizes large errors quadratically, so a single house priced $1M above its prediction dominates the loss. Huber loss behaves like squared error for small residuals but switches to absolute error for large ones, making it resistant to outliers. If your data has extreme values you can't remove, Huber loss produces a model that doesn't chase outliers at the expense of the majority.
<!— PLAYGROUND_START data-dataset="lds_regression_linear" —>
Hands-On Practice
Gradient Boosting is a powerhouse technique that builds a strong predictive model by sequentially combining 'weak learners' (simple decision trees), where each new tree corrects the errors of the previous ones. You'll master this concept by applying a Gradient Boosting Regressor to a housing dataset, observing how the model iteratively reduces error, just like a golfer correcting their shots. You will visualize the training process, analyze feature importance, and see firsthand how boosting transforms simple rules into high-accuracy predictions.
Dataset: House Prices (Linear) House pricing data with clear linear relationships. Square footage strongly predicts price (R² ≈ 0.87). Perfect for demonstrating linear regression fundamentals.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# ============================================
# STEP 1: LOAD AND EXPLORE THE DATA
# ============================================
# We load the educational dataset containing house prices.
# This dataset has clear patterns, making it ideal for visualizing regression.
df = pd.read_csv("/datasets/playground/lds_regression_linear.csv")
# Display basic dataset information
print(f"Dataset Shape: {df.shape}")
# Expected output: Dataset Shape: (500, 6)
print("\nFirst 5 rows:")
print(df.head())
# Expected output:
# square_feet bedrooms bathrooms age_years lot_size price
# 0 1500 3 2 10 5000 250000.0
#... (values vary)
# ============================================
# STEP 2: DATA PREPROCESSING
# ============================================
# Define features (X) and target (y)
feature_cols = ['square_feet', 'bedrooms', 'bathrooms', 'age_years', 'lot_size']
X = df[feature_cols]
y = df['price']
# Split into training and testing sets (80% train, 20% test)
# We use a fixed random_state for reproducibility in this educational setting.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
# Expected output:
# Training set size: 400 samples
# Testing set size: 100 samples
# ============================================
# STEP 3: TRAIN GRADIENT BOOSTING MODEL
# ============================================
# Initialize the Gradient Boosting Regressor
# Key Parameters:
# - n_estimators=100: We will build 100 sequential trees.
# - learning_rate=0.1: Each tree contributes 10% to the final correction (prevents overfitting).
# - max_depth=3: Each tree is shallow (a 'weak learner').
# - random_state=42: Ensures consistent results.
gb_model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
print("\nTraining Gradient Boosting Model...")
gb_model.fit(X_train, y_train)
print("Model training complete.")
# ============================================
# STEP 4: EVALUATE PERFORMANCE
# ============================================
# Make predictions on the test set
y_pred = gb_model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nModel Performance Metrics:")
print(f"R² Score: {r2:.4f}")
# Expected output: R² Score: ~0.87 (High accuracy is expected for this clean dataset)
print(f"Mean Squared Error: {mse:,.0f}")
# Expected output: Mean Squared Error: (Large number depending on price scale)
# NOTE: High accuracy (~0.87) is expected with this educational dataset!
# The patterns are intentionally clear to help you learn the algorithm.
# Real-world datasets typically have more noise and lower accuracy.
# ============================================
# STEP 5: VISUALIZE TRAINING PROGRESS (THE 'GOLF' ANALOGY)
# ============================================
# We visualize how the model improved with each additional tree (estimator).
# The 'train_score_' attribute stores the loss (error) at each stage.
test_score = np.zeros((100,), dtype=np.float64)
# We can also track test error at each stage to check for overfitting
for i, y_pred_stage in enumerate(gb_model.staged_predict(X_test)):
test_score[i] = mean_squared_error(y_test, y_pred_stage)
plt.figure(figsize=(10, 5))
plt.title('Gradient Boosting: Error Reduction per Iteration')
plt.plot(np.arange(100) + 1, gb_model.train_score_, 'b-', label='Training Set Error')
plt.plot(np.arange(100) + 1, test_score, 'r--', label='Test Set Error')
plt.legend(loc='upper right')
plt.xlabel('Number of Boosting Iterations (Trees)')
plt.ylabel('Squared Error (Loss)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ============================================
# STEP 6: FEATURE IMPORTANCE & PREDICTION PLOT
# ============================================
# 1. Feature Importance Plot
# Shows which features the model relied on most to make corrections.
feature_importance = gb_model.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) +.5
plt.figure(figsize=(12, 5))
# Subplot 1: Feature Importance
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(feature_cols)[sorted_idx])
plt.title('Feature Importance (What drives price?)')
plt.xlabel('Relative Importance')
# Subplot 2: Actual vs Predicted
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred, alpha=0.6, color='green')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2) # Perfect prediction line
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Actual vs Predicted (R²={r2:.2f})')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nAnalysis Complete. The plots above show:")
print("1. How the error decreased as we added more trees.")
print("2. Which features (likely square_feet) were most important.")
print("3. How close our predictions were to the actual prices.")
Now that you've built a gradient boosting model, try experimenting with the learning_rate. Decrease it to 0.01 and see if you need to increase n_estimators to achieve the same accuracy (this is the classic trade-off). You can also try increasing max_depth to see if the model begins to overfit by memorizing the training data instead of learning general patterns.
<!— PLAYGROUND_END —>