Skip to content

AdaBoost: The Definitive Guide to Adaptive Boosting

DS
LDS Team
Let's Data Science
12 minAudio
Listen Along
0:00/ 0:00
AI voice

A single shallow decision tree classifying credit card transactions will miss most of the tricky cases. It might catch obvious fraud (a $9,000 purchase from a new country at 3 AM) but completely whiff on the borderline ones: a $47 grocery charge that falls slightly outside the cardholder's usual radius. What if, instead of building one smarter tree, you trained fifty simple ones in sequence, and each new tree focused specifically on the transactions that tripped up its predecessors?

That is exactly what AdaBoost (Adaptive Boosting) does. Proposed by Yoav Freund and Robert Schapire in their 1997 paper, AdaBoost was the first practical boosting algorithm and proved a remarkable theoretical result: any collection of classifiers that each perform just slightly better than a coin flip can be combined into an arbitrarily accurate predictor. The work earned Freund and Schapire the 2003 Godel Prize for outstanding contributions to theoretical computer science.

We will build AdaBoost from the ground up using a single running example: classifying credit card transactions as fraudulent or legitimate, where hard-to-classify borderline cases receive progressively more weight with each boosting round.

The AdaBoost Algorithm

AdaBoost is a sequential ensemble method that builds a strong classifier by chaining together many weak learners, typically decision stumps (trees with a single split, depth = 1). Unlike Random Forest, which trains trees in parallel on bootstrap samples, AdaBoost trains them one after another. Each new stump inherits a modified version of the dataset where previously misclassified samples carry heavier weight.

Think of it as a fraud investigation team. The first analyst reviews all transactions equally and catches the easy cases. The second analyst receives the same pile, but every transaction the first analyst got wrong is now highlighted in bold. The third analyst sees an even more lopsided stack, dominated by the cases both predecessors missed. At the end, the team votes on each transaction, and each analyst's vote is weighted by their overall track record.

AdaBoost iteration cycle showing how weak learners are trained, errors computed, weights updated, and the process repeatedClick to expandAdaBoost iteration cycle showing how weak learners are trained, errors computed, weights updated, and the process repeated

Key Insight: A "weak learner" only needs accuracy above 50% for binary classification. A decision stump that splits transactions on a single feature (say, transaction amount > $500) qualifies. AdaBoost's power comes from stitching dozens of these weak rules into a strong composite.

The AdaBoost Algorithm Step by Step

The full algorithm operates on a dataset of NN labeled samples where each label is yi{1,+1}y_i \in \{-1, +1\}. In our fraud example, +1+1 marks fraud and 1-1 marks legitimate.

Step 1: Initialize Equal Weights

Every sample starts with the same weight:

wi=1Nw_i = \frac{1}{N}

Where:

  • wiw_i is the weight assigned to sample ii
  • NN is the total number of training samples

In Plain English: If your fraud dataset has 1,000 transactions, each one starts with a weight of 0.001. The algorithm pays the same attention to an obvious $8,000 overseas purchase as to a borderline $47 grocery charge.

Step 2: Train a Weak Learner on Weighted Data

Fit a decision stump ht(x)h_t(x) that minimizes the weighted classification error:

ϵt=i=1Nwi1(yiht(xi))i=1Nwi\epsilon_t = \frac{\sum_{i=1}^{N} w_i \cdot \mathbb{1}(y_i \neq h_t(x_i))}{\sum_{i=1}^{N} w_i}

Where:

  • ϵt\epsilon_t is the weighted error rate of the tt-th weak learner
  • wiw_i is the current weight of sample ii
  • 1(yiht(xi))\mathbb{1}(y_i \neq h_t(x_i)) is 1 when the prediction is wrong, 0 when correct
  • ht(xi)h_t(x_i) is the stump's prediction for sample ii

In Plain English: The stump picks one feature and one threshold (e.g., "transaction amount > $500"). The error isn't a simple count of wrong answers; misclassifying a heavily weighted borderline transaction penalizes the score far more than misclassifying a lightly weighted obvious one.

Step 3: Compute Learner Influence

Each stump earns a vote weight αt\alpha_t proportional to its accuracy:

αt=12ln(1ϵtϵt)\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)

Where:

  • αt\alpha_t is the influence (vote weight) of the tt-th learner
  • ϵt\epsilon_t is the weighted error from Step 2
  • ln\ln is the natural logarithm

In Plain English: A stump that correctly classifies 90% of the weighted fraud transactions earns a loud vote. One that barely beats 50% gets a whisper. If a stump performs worse than random (ϵt>0.5\epsilon_t > 0.5), its αt\alpha_t goes negative, effectively flipping its predictions in the final ensemble.

Weighted Error (ϵt\epsilon_t)αt\alpha_tPractical Meaning
0.012.30Near-perfect stump, dominant vote
0.101.10Strong stump
0.300.42Decent stump
0.490.02Barely useful
0.500.00Random guessing, zero influence

Step 4: Update Sample Weights

This is the core of adaptive boosting. Misclassified samples get heavier; correctly classified ones get lighter:

wi(t+1)=wi(t)eαtyiht(xi)w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t \, y_i \, h_t(x_i)}

Where:

  • wi(t+1)w_i^{(t+1)} is the updated weight for sample ii at round t+1t+1
  • wi(t)w_i^{(t)} is the current weight
  • αt\alpha_t is the learner's influence from Step 3
  • yiy_i is the true label (1-1 or +1+1)
  • ht(xi)h_t(x_i) is the stump's prediction (1-1 or +1+1)

In Plain English: When the stump correctly labels a $47 grocery charge as legitimate, yiy_i and ht(xi)h_t(x_i) share the same sign, so the exponent is negative and the weight shrinks. When it wrongly calls a fraudulent transaction legitimate, the signs disagree, the exponent becomes positive, and the weight balloons. After this update, the next stump is forced to focus on the transactions the ensemble still gets wrong.

How sample weights shift after each AdaBoost round as correctly classified samples shrink and misclassified samples growClick to expandHow sample weights shift after each AdaBoost round as correctly classified samples shrink and misclassified samples grow

After updating, all weights are normalized to sum to 1, keeping the distribution stable for the next round.

Step 5: Repeat and Combine

Steps 2 through 4 repeat for TT rounds (the n_estimators hyperparameter). The final prediction is a weighted vote:

H(x)=sign(t=1Tαtht(x))H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \, h_t(x)\right)

Where:

  • H(x)H(x) is the final ensemble prediction
  • αt\alpha_t is the vote weight of learner tt
  • ht(x)h_t(x) is the prediction of learner tt
  • sign()\text{sign}(\cdot) returns +1+1 if the sum is positive, 1-1 otherwise

In Plain English: All fifty stumps examine a new transaction and vote fraud (+1+1) or legitimate (1-1). Each vote is amplified by that stump's α\alpha score. If the weighted sum is positive, the ensemble flags it as fraud. The more accurate stumps dominate the decision.

Exponential Loss and Why It Matters

AdaBoost's weight update rule isn't arbitrary. The algorithm implicitly minimizes the exponential loss function:

L(y,f(x))=eyf(x)L(y, f(x)) = e^{-y \, f(x)}

Where:

  • yy is the true label
  • f(x)=t=1Tαtht(x)f(x) = \sum_{t=1}^{T} \alpha_t h_t(x) is the ensemble's raw output before applying sign

This loss grows exponentially when f(x)f(x) disagrees with yy. The practical consequence: AdaBoost punishes confident wrong predictions far more severely than logistic loss does. This aggressiveness is a double-edged sword. It drives the algorithm to fix mistakes fast, but it also makes AdaBoost sensitive to outliers and label noise. A single mislabeled fraud transaction can accumulate enormous weight across rounds, pulling the entire ensemble off course.

Common Pitfall: If your fraud labels are noisy (chargebacks later reversed, disputed transactions with ambiguous ground truth), AdaBoost will latch onto those mislabeled samples and distort the ensemble. Clean your labels first, or switch to Gradient Boosting, which supports more forgiving loss functions like Huber or log loss.

AdaBoost vs. Gradient Boosting vs. Random Forest

All three are ensemble methods, but they approach the "combine weak models" problem from different angles.

CriterionAdaBoostGradient BoostingRandom Forest
Training orderSequentialSequentialParallel
Correction mechanismRe-weights misclassified samplesFits residuals (gradients)Independent trees, majority vote
Typical weak learnerStump (depth 1)Shallow tree (depth 3-8)Full or deep tree
Loss functionExponential onlyAny differentiable lossN/A (voting/averaging)
Outlier sensitivityHigh (weights explode)Moderate (depends on loss)Low (outliers diluted across trees)
Overfitting riskLow on clean dataModerate (tune learning rate)Low (bagging reduces variance)
Best forClean data, interpretabilityFlexible objectives, tabular dataNoisy data, fast baseline

Side-by-side comparison of AdaBoost sequential weighted approach vs Random Forest parallel unweighted approachClick to expandSide-by-side comparison of AdaBoost sequential weighted approach vs Random Forest parallel unweighted approach

For a deep look at residual fitting, see How Gradient Boosting Actually Works Under the Hood. For extreme gradient boosting with regularization, see the XGBoost for Classification guide.

Key Insight: AdaBoost re-weights samples. Gradient Boosting fits residuals. This is the fundamental difference. Gradient Boosting is actually a generalization: plug exponential loss into gradient boosting with depth-1 trees, and you recover AdaBoost exactly.

When to Use AdaBoost (and When NOT To)

Choose AdaBoost when:

  1. Your dataset is clean with reliable labels and few outliers
  2. You want an interpretable ensemble (each stump inspects one feature, easy to audit)
  3. Binary classification is the primary task
  4. You need a quick baseline with minimal hyperparameter tuning

Avoid AdaBoost when:

  1. Your labels are noisy or contain significant mislabeling. Exponential loss amplifies these errors across rounds
  2. You need regression or custom loss functions. AdaBoost is locked to exponential loss for classification
  3. Your dataset exceeds 500K rows and speed matters. Sequential training with full-dataset weight updates can't compete with histogram-based splitters in XGBoost
  4. You're optimizing for Kaggle or competition leaderboards where every tenth of a percent matters

Pro Tip: AdaBoost is an excellent teaching algorithm. Once you understand its re-weighting mechanics and alpha voting, the leap to gradient boosting and XGBoost becomes much shorter. Think of it as the conceptual foundation of the entire boosting family.

AdaBoost for Fraud Detection in Python

Let's build an AdaBoost classifier on a synthetic credit card fraud dataset. The code generates all data inline and runs in any Python environment, including Pyodide.

<!— EXEC —>

python
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

np.random.seed(42)

# Synthetic fraud dataset: 1000 transactions, 5 features
n_samples = 1000
X = np.column_stack([
    np.random.exponential(200, n_samples),       # transaction amount
    np.random.exponential(30, n_samples),         # distance from home (km)
    np.random.uniform(0, 24, n_samples),          # time of day (0-24)
    np.random.poisson(2, n_samples),              # transactions in last hour
    np.random.binomial(1, 0.15, n_samples)        # is_international flag
])

# Generate fraud labels based on feature combinations
fraud_score = (
    0.35 * (X[:, 0] > 400).astype(float) +
    0.25 * (X[:, 1] > 40).astype(float) +
    0.2 * ((X[:, 2] > 22) | (X[:, 2] < 4)).astype(float) +
    0.2 * X[:, 4]
)
noise = np.random.normal(0, 0.1, n_samples)
y = (fraud_score + noise > 0.3).astype(int)

print(f"Dataset: {n_samples} transactions, {y.sum()} fraudulent ({y.mean()*100:.1f}%)")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# AdaBoost with 50 decision stumps
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)

print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))

Expected Output:

code
Dataset: 1000 transactions, 310 fraudulent (31.0%)

Accuracy: 0.8767

Classification Report:
              precision    recall  f1-score   support

  Legitimate       0.89      0.94      0.91       207
       Fraud       0.84      0.74      0.79        93

    accuracy                           0.88       300
   macro avg       0.87      0.84      0.85       300
weighted avg       0.87      0.88      0.87       300
text
Dataset: 1000 transactions, 310 fraudulent (31.0%)

Accuracy: 0.8767

Classification Report:
              precision    recall  f1-score   support

  Legitimate       0.89      0.94      0.91       207
       Fraud       0.84      0.74      0.79        93

    accuracy                           0.88       300
   macro avg       0.87      0.84      0.85       300
weighted avg       0.87      0.88      0.87       300

The ensemble hits 87.7% accuracy on a 31% fraud rate dataset. Let's watch how test accuracy builds across boosting rounds:

<!— EXEC —>

python
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

np.random.seed(42)

# Recreate the same fraud dataset
n_samples = 1000
X = np.column_stack([
    np.random.exponential(200, n_samples),
    np.random.exponential(30, n_samples),
    np.random.uniform(0, 24, n_samples),
    np.random.poisson(2, n_samples),
    np.random.binomial(1, 0.15, n_samples)
])
fraud_score = (
    0.35 * (X[:, 0] > 400).astype(float) +
    0.25 * (X[:, 1] > 40).astype(float) +
    0.2 * ((X[:, 2] > 22) | (X[:, 2] < 4)).astype(float) +
    0.2 * X[:, 4]
)
noise = np.random.normal(0, 0.1, n_samples)
y = (fraud_score + noise > 0.3).astype(int)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)
ada.fit(X_train, y_train)

# Track how accuracy improves with each added stump
staged_scores = list(ada.staged_score(X_test, y_test))
for t in [1, 5, 10, 25, 50]:
    print(f"After {t:>2} stumps: test accuracy = {staged_scores[t-1]:.4f}")

Expected Output:

code
After  1 stumps: test accuracy = 0.7833
After  5 stumps: test accuracy = 0.8100
After 10 stumps: test accuracy = 0.8767
After 25 stumps: test accuracy = 0.8867
After 50 stumps: test accuracy = 0.8767
text
After  1 stumps: test accuracy = 0.7833
After  5 stumps: test accuracy = 0.8100
After 10 stumps: test accuracy = 0.8767
After 25 stumps: test accuracy = 0.8867
After 50 stumps: test accuracy = 0.8767

Most of the gain arrives in the first 10 rounds. The slight dip from 25 to 50 stumps hints at the ensemble starting to overfit borderline cases where the noise in our synthetic labels is unavoidable. In practice, you would use cross-validation to pick the optimal n_estimators.

Hyperparameter Reference

ParameterDefaultTypical RangeEffect
n_estimators5025-500More stumps reduce bias; too many risk overfitting noisy data
learning_rate1.00.01-1.0Shrinks each stump's contribution; lower values need more estimators
estimatorDecisionTreeClassifier(max_depth=1)depth 1-3Deeper stumps increase capacity but weaken the boosting effect

There's a well-known tradeoff between learning_rate and n_estimators. Halving the learning rate roughly requires doubling the number of estimators to reach equivalent performance. In practice, learning_rate=0.1 with n_estimators=200 often outperforms learning_rate=1.0 with n_estimators=50 because the slower learning produces smoother decision boundaries. This connects directly to the bias-variance tradeoff: a lower learning rate trades bias for reduced variance.

Pro Tip: As of scikit-learn 1.8, the algorithm parameter has been removed entirely. The older SAMME.R variant was deprecated in 1.4 and dropped in 1.6. Only the SAMME discrete algorithm remains. If you see tutorials referencing algorithm='SAMME.R', they're outdated.

Production Considerations

Time complexity: Training runs in O(TND)O(T \cdot N \cdot D) where TT is the number of estimators, NN is the sample count, and DD is the number of features. Each stump evaluates all features across all samples, and the rounds are strictly sequential. For 50 stumps on 10K samples with 20 features, training finishes in under a second. On 1M samples, expect noticeable wait times.

Prediction complexity: O(T)O(T) per sample. Each stump makes one comparison and contributes a weighted vote. With 50 stumps, inference is essentially instant.

Memory: AdaBoost stores TT stumps (each is a single split: one feature index, one threshold, two leaf values) plus a weight vector of size NN during training. Memory footprint is negligible compared to deep-tree ensembles like Random Forest.

Scaling advice: Beyond 100K rows, gradient boosting implementations with histogram-based splitting (LightGBM, XGBoost) are faster and more accurate. AdaBoost's sweet spot is small-to-medium clean datasets where simplicity and interpretability are genuine advantages.

Conclusion

AdaBoost proves a counterintuitive point: you don't need powerful individual models to build a powerful ensemble. Fifty decision stumps, each barely better than a coin flip, combine into a classifier that rivals far more complex architectures on clean data. The math is elegant, and the intuition transfers directly to every boosting algorithm that followed.

Where AdaBoost falls short is noise tolerance. The exponential loss that makes it aggressive at fixing mistakes also makes it brittle when labels are unreliable. For real-world datasets with messy labels, Gradient Boosting and XGBoost handle the mess more gracefully with configurable loss functions and built-in regularization.

If you're building intuition for how ensembles work more broadly, start with Ensemble Methods: Why Teams of Models Beat Solo Geniuses for the big picture. For the theoretical grounding behind combining weak learners, see The Bias-Variance Tradeoff.

Interview Questions

Q: What makes AdaBoost "adaptive"?

The "adaptive" refers to how each new weak learner adapts to the mistakes of previous ones. After every round, the algorithm increases the weight of misclassified samples so the next learner is forced to concentrate on those harder cases. This distinguishes AdaBoost from simple averaging ensembles where each model trains independently on the same distribution.

Q: Can AdaBoost overfit, and under what conditions?

On clean data, AdaBoost is remarkably resistant to overfitting even with many estimators, which aligns with Schapire's original theoretical analysis. However, AdaBoost overfits aggressively on noisy data. Mislabeled samples accumulate exponentially increasing weight across rounds, eventually dominating the training signal and degrading generalization.

Q: Why does AdaBoost use decision stumps rather than deeper trees?

Stumps are high-bias, low-variance learners, exactly what boosting needs. Each round of AdaBoost reduces bias by targeting remaining errors, while the shallowness of stumps prevents variance from creeping up. A deep tree already has low bias, leaving boosting little room to improve and creating a higher overfitting risk on re-weighted distributions.

Q: Explain the mathematical relationship between AdaBoost and Gradient Boosting.

AdaBoost is a special case of gradient boosting. Gradient Boosting fits each new learner to the negative gradient of the loss function. When you set that loss to exponential loss and use depth-1 trees, the gradient descent update reduces to AdaBoost's multiplicative weight update rule. Gradient Boosting's generality (any differentiable loss) is why it replaced AdaBoost in most production settings.

Q: Your AdaBoost model has 95% training accuracy but 78% test accuracy. What do you investigate?

The 17-point gap signals overfitting, most likely caused by noisy labels. With exponential loss, AdaBoost amplifies the weight of repeatedly misclassified samples, causing later stumps to memorize noise. I would inspect the highest-weight training samples for mislabeling, reduce n_estimators, lower learning_rate, or switch to gradient boosting with log loss for more graceful handling of label noise.

Q: When would you pick AdaBoost over XGBoost in a production system?

Almost never for raw accuracy. XGBoost dominates on performance, speed, and noise handling. But AdaBoost wins on interpretability: 50 stumps are trivially inspectable, each being a single if-else rule. For regulated industries (finance, healthcare) that require model explainability, or as a quick interpretable baseline before investing in more complex models, AdaBoost still earns its place.

<!— PLAYGROUND_START data-dataset="lds_regression_linear" —>

Hands-On Practice

AdaBoost (Adaptive Boosting) is a powerful ensemble technique that combines multiple 'weak learners', simple models that are only slightly better than random guessing, into a highly accurate 'strong learner.' You'll build an AdaBoost Regressor from scratch using the House Prices dataset to understand how boosting sequentially corrects errors. By visualizing the relationship between square footage and price, you will see how AdaBoost improves predictions by focusing on the hardest-to-predict data points.

Dataset: House Prices (Linear) House pricing data with clear linear relationships. Square footage strongly predicts price (R² ≈ 0.87). Perfect for demonstrating linear regression fundamentals.

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# ============================================
# STEP 1: LOAD AND EXPLORE THE DATA
# ============================================

# Load the dataset from the specified URL
df = pd.read_csv("/datasets/playground/lds_regression_linear.csv")

# Display basic information to understand the structure
print("Dataset Information:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst 5 rows:")
print(df.head())
# Expected output:
# Shape: (500, 6)
# Columns: ['square_feet', 'bedrooms', 'bathrooms', 'age_years', 'lot_size', 'price']
# First 5 rows: [table showing house data]

# Visualize the primary relationship: Square Feet vs Price
# This helps us see the linearity that the model will learn
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='square_feet', y='price', alpha=0.6)
plt.title('House Prices vs Square Footage')
plt.xlabel('Square Feet')
plt.ylabel('Price')
plt.grid(True, alpha=0.3)
plt.show()

# ============================================
# STEP 2: PREPARE DATA FOR TRAINING
# ============================================

# Define features (X) and target (y)
# We'll use multiple features to help the model learn complex patterns
X = df[['square_feet', 'bedrooms', 'bathrooms', 'age_years', 'lot_size']]
y = df['price']

# Split into training and testing sets (80% train, 20% test)
# This separation prevents overfitting and allows fair evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
# Expected output:
# Training set size: (400, 5)
# Test set size: (100, 5)

# ============================================
# STEP 3: TRAIN ADABOOST REGRESSOR
# ============================================

# Define the base estimator (weak learner)
# AdaBoost typically uses Decision Stumps (depth=1), but for regression
# we often use slightly deeper trees (max_depth=4) to capture more nuance.
base_estimator = DecisionTreeRegressor(max_depth=4)

# Initialize AdaBoost
# n_estimators=50: The model will sequentially build 50 trees
# learning_rate=0.1: Shrinks the contribution of each tree to prevent overfitting
model = AdaBoostRegressor(
    estimator=base_estimator,
    n_estimators=50,
    learning_rate=0.1,
    random_state=42
)

print("\nTraining AdaBoost Regressor...")
model.fit(X_train, y_train)
print("Training complete.")

# ============================================
# STEP 4: EVALUATE MODEL PERFORMANCE
# ============================================

# Generate predictions
y_pred = model.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("\nModel Performance Metrics:")
print(f"R² Score: {r2:.4f}")
print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Root Mean Squared Error: ${rmse:,.2f}")

# NOTE: High accuracy is expected with this educational dataset!
# The patterns are intentionally clear to help you learn the algorithm.
# Real-world datasets typically have more noise and lower accuracy.
# Expected output:
# R² Score: ~0.87 (may vary slightly)
# Mean Absolute Error: ~$xxxx.xx
# Root Mean Squared Error: ~$xxxx.xx

# ============================================
# STEP 5: VISUALIZE RESULTS VS ACTUALS
# ============================================

plt.figure(figsize=(10, 6))
# Plot perfect prediction line
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         'r--', lw=2, label='Perfect Prediction')
# Plot actual vs predicted
plt.scatter(y_test, y_pred, alpha=0.6, label='Model Predictions')

plt.title('Actual vs Predicted House Prices (AdaBoost)')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ============================================
# STEP 6: FEATURE IMPORTANCE ANALYSIS
# ============================================
# AdaBoost calculates importance based on how much each feature contributes
# to reducing the error across all trees in the ensemble.

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importances:")
print(feature_importance)
# Expected output:
#        feature  importance
# 0  square_feet    ~0.85
#... (other features with lower importance)

plt.figure(figsize=(10, 4))
sns.barplot(data=feature_importance, x='importance', y='feature', hue='feature')
plt.title('Feature Importance in AdaBoost Model')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

Try adjusting the n_estimators parameter from 50 to 200 to see if adding more learners improves performance or leads to overfitting. You can also experiment with the learning_rate (try 0.01 vs 1.0); a lower rate usually requires more estimators but often yields a more generalized model. Finally, try changing the max_depth of the base DecisionTreeRegressor to see how the complexity of individual weak learners affects the overall ensemble.

<!— PLAYGROUND_END —>

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths