Skip to content

Why Your Model Is Failing: Diagnosing with Learning Curves

DS
LDS Team
Let's Data Science
9 minAudio
Listen Along
0:00/ 0:00
AI voice

Your house price model scores 0.82 R2 on training data and 0.81 on validation. Decent, but not good enough. Should you collect more listings, engineer new features, or swap in a more complex algorithm? Pick wrong and you burn weeks of effort with zero improvement.

Learning curves answer this question in a single plot. They graph your model's training and validation scores as the training set grows, revealing whether the bottleneck is insufficient data, insufficient model complexity, or something else entirely. Think of them as an X-ray for your model: the shape of the curves tells you exactly what's broken and what to fix.

Every code block and example in this article uses one running scenario: predicting house prices from square footage, bedrooms, age, and neighborhood. We'll generate learning curves for models that underfit, overfit, and fit well, then read each curve like a diagnostic report.

Anatomy of a Learning Curve

A learning curve plots model performance (y-axis) against training set size (x-axis). Two lines appear: the training score and the validation score, both computed via cross-validation.

With a tiny training set, any model can memorize the examples perfectly, so training score starts high. As data grows, memorization becomes harder, and training score drifts down. Meanwhile, validation score starts low (the model hasn't seen enough patterns) and climbs as the model generalizes better.

The vertical distance between these two lines is the generalization gap. Its size and trajectory are the diagnostic signal.

Curve FeatureWhat It Tells You
Gap sizeHow much the model overfits (large gap) or underfits (small gap, low scores)
Gap trajectoryWhether more data will help (gap shrinking) or not (gap flat)
Score levelWhether performance is acceptable, even if the gap is small
Convergence pointThe approximate sample size where adding more data stops helping

Common Pitfall: Learning curves (x-axis = training set size) are not the same as training loss curves (x-axis = epochs or iterations). A loss curve tells you if the optimizer has converged. A learning curve tells you if your model's fundamental capacity matches the problem. They answer different questions.

Learning curve diagnostic patterns showing high bias, high variance, and good fit side by sideClick to expandLearning curve diagnostic patterns showing high bias, high variance, and good fit side by side

The Error Decomposition Behind the Curves

Every learning curve reflects the bias-variance tradeoff at different data sizes. The expected prediction error decomposes into three terms:

Error=Bias2+Variance+σ2\text{Error} = \text{Bias}^2 + \text{Variance} + \sigma^2

Where:

  • Bias2\text{Bias}^2 measures how far the model's average prediction is from the true function (systematic error from wrong assumptions)
  • Variance\text{Variance} measures how much predictions fluctuate across different training sets (sensitivity to specific training samples)
  • σ2\sigma^2 is the irreducible noise in the data (the Bayes error floor that no model can beat)

In Plain English: Imagine predicting house prices. Bias is like using a straight line for a curved relationship: no matter how many houses you see, the line stays wrong in the same way. Variance is like fitting a wiggly curve that nails the training houses but wobbles wildly on new ones. Irreducible noise is the inherent randomness in sale prices that even a perfect model can't predict, such as a buyer's mood or a competing offer.

Learning curves make this decomposition visible. High bias shows up as both lines converging at a low score. High variance shows up as a wide gap between them.

Generating Learning Curves in scikit-learn

The learning_curve() function in scikit-learn handles the mechanics: it trains the model on progressively larger subsets of the training data, computes cross-validated scores at each size, and returns arrays ready for plotting.

Here is a direct comparison of a high-bias model (linear regression on nonlinear data) against a high-variance model (unbounded decision tree) on the same house price dataset:

code
=== HIGH BIAS (Linear Regression on Nonlinear Data) ===
  Size |   Train R2 |     Val R2 |      Gap
------------------------------------------
    64 |     0.8130 |     0.8006 |   0.0124
   146 |     0.8180 |     0.8138 |   0.0042
   228 |     0.8100 |     0.8181 |  -0.0081
   310 |     0.8186 |     0.8195 |  -0.0009
   393 |     0.8225 |     0.8189 |   0.0035
   475 |     0.8268 |     0.8192 |   0.0076
   557 |     0.8203 |     0.8192 |   0.0011
   640 |     0.8222 |     0.8193 |   0.0029

=== HIGH VARIANCE (Decision Tree, no max_depth) ===
  Size |   Train R2 |     Val R2 |      Gap
------------------------------------------
    64 |     1.0000 |     0.8124 |   0.1876
   146 |     1.0000 |     0.8630 |   0.1370
   228 |     1.0000 |     0.8858 |   0.1142
   310 |     1.0000 |     0.8953 |   0.1047
   393 |     1.0000 |     0.9060 |   0.0940
   475 |     1.0000 |     0.9063 |   0.0937
   557 |     1.0000 |     0.9102 |   0.0898
   640 |     1.0000 |     0.9095 |   0.0905

Look at these two tables side by side. The linear model's training and validation scores sit around 0.82 with almost no gap, even at 640 samples. Adding more data won't change this. The model can't capture the nonlinear pricing patterns. Meanwhile, the decision tree achieves perfect training R2 (1.0000) every time, but validation lags behind by 0.09. This model memorizes training data instead of learning general rules. More data does help it (the gap narrows from 0.19 to 0.09), but it would take a very large dataset to close it fully.

Reading the X-Ray: High Bias Diagnosis

High bias means the model is too simple to capture the data's true patterns. On a learning curve, the signature is unmistakable: both training and validation scores plateau at a mediocre level with a tiny gap between them.

What the curves look like:

  1. Training score starts moderate and stays moderate
  2. Validation score converges quickly to nearly the same moderate level
  3. The gap between them is negligible
  4. Neither curve improves meaningfully as data grows

Why it happens in our house price example: Linear regression assumes price scales linearly with each feature. But the real relationship includes squared terms, interactions (age times bedrooms), and neighborhood effects. No amount of additional house listings fixes this mismatch. The model's assumptions are wrong.

How to fix it:

  • Increase model complexity. Switch to a random forest, gradient boosted trees (XGBoost), or a neural network.
  • Engineer better features. Add polynomial terms, interaction features, or domain-specific transformations. Our guide on feature engineering covers this in depth.
  • Reduce regularization. If you're using Ridge or Lasso with a heavy penalty, the model might be over-constrained.

Pro Tip: If your learning curve shows high bias, stop collecting more data immediately. More data will not help. This is the most expensive mistake in applied ML, spending months gathering data for a model that's already plateaued due to insufficient complexity.

Reading the X-Ray: High Variance Diagnosis

High variance means the model is too complex relative to the training set size. It memorizes noise instead of learning signal. The learning curve signature: training score is near-perfect, validation score is substantially lower, and a visible gap persists.

What the curves look like:

  1. Training score stays very high (often 1.0 for tree-based models)
  2. Validation score is noticeably lower
  3. A persistent gap separates them
  4. The gap may narrow slowly as data grows, but doesn't close

Why it happens in our house price example: An unbounded decision tree creates leaves specific to individual houses. A 3,200 sqft house in neighborhood 3 with 4 bedrooms gets its own prediction node. This works great on the training set, but a new 3,200 sqft house with slightly different features gets a poor prediction because the tree memorized the training-specific noise.

How to fix it:

  • Get more data. This is the primary fix. More samples make memorization harder, forcing the model to learn general patterns.
  • Simplify the model. Limit tree depth, reduce the number of trees in an ensemble, or use a simpler architecture.
  • Add regularization. Increase L1/L2 penalties, add dropout (for neural nets), or use min_samples_leaf constraints.
  • Remove noisy features. Irrelevant features give the model extra noise to memorize.

Key Insight: The learning curve tells you whether more data will help before you spend the effort collecting it. If the validation curve is still climbing and the gap is narrowing, more data is worth the investment. If both curves have flattened, you've saturated the model's capacity and more data is wasted.

Decision tree for diagnosing learning curve patterns and choosing the right fixClick to expandDecision tree for diagnosing learning curve patterns and choosing the right fix

Validation Curves: The Other Diagnostic Tool

Learning curves hold the model fixed and vary data size. Validation curves do the opposite: they hold the dataset fixed and vary a single hyperparameter. Together, the two tools give you a complete diagnostic picture.

A validation curve sweeps a hyperparameter across a range and plots training and validation scores at each value. The point where validation score peaks is your optimal setting. Before that peak, you're underfitting. After it, you're overfitting.

code
=== VALIDATION CURVE: Decision Tree max_depth ===
 Depth |   Train R2 |     Val R2 |      Gap
------------------------------------------
     2 |     0.8050 |     0.7851 |   0.0199
     3 |     0.8640 |     0.8386 |   0.0254
     5 |     0.9400 |     0.8901 |   0.0499
     8 |     0.9850 |     0.9155 |   0.0695
    12 |     0.9991 |     0.9088 |   0.0904
    20 |     1.0000 |     0.9095 |   0.0905
    30 |     1.0000 |     0.9095 |   0.0905
    50 |     1.0000 |     0.9095 |   0.0905

Best max_depth: 8 (Val R2 = 0.9155)
At depth=2, the model underfits: both scores are low.
At depth=50, the model overfits: training is perfect but validation drops.
The sweet spot is in between, balancing bias and variance.

This validation curve tells a clear story. At max_depth=2, the tree is too shallow: both scores hover around 0.80. At max_depth=8, validation peaks at 0.9155. Beyond that, training keeps climbing to 1.0 while validation actually drops. The optimal depth for our house price data is 8.

Diagnostic ToolX-axisAnswersHolds Fixed
Learning curveTraining set size"Do I need more data?"Model and hyperparameters
Validation curveHyperparameter value"What's the best setting?"Full dataset

Pro Tip: Always run a learning curve first, then a validation curve. The learning curve tells you if the problem is bias or variance. The validation curve helps you tune the specific hyperparameters to address it. Running them in reverse order means you might tune a hyperparameter perfectly for a model that's fundamentally wrong.

Comparison of learning curves and validation curves showing different x-axes and diagnostic questionsClick to expandComparison of learning curves and validation curves showing different x-axes and diagnostic questions

Advanced Patterns Beyond Bias and Variance

Real-world learning curves don't always fall neatly into "high bias" or "high variance." Several advanced patterns are worth recognizing.

Data Leakage Curves

If your validation score is suspiciously close to 1.0 and shows almost no gap from the start, suspect data leakage. Information from the validation set has bled into training, making the model "cheat." Common causes include scaling before splitting, target-derived features, and time-series data shuffled randomly instead of split chronologically. A genuine 0.99 R2 on a real-world regression task is extremely rare; a leaked model routinely produces it.

The Bayes Error Floor

Sometimes both curves converge and flatten, but at a score lower than you want. This might be the irreducible noise floor (Bayes error). For our house price example, sale prices depend on buyer psychology, competing bids, and market timing that no feature captures. If your best model tops out at 0.92 R2 on clean data, that remaining 0.08 may be the noise floor. No algorithm or dataset increase will push past it.

Convergence Stalls

The validation curve rises but then stalls, while the gap remains. This happens when the model needs both more data and more complexity simultaneously. Collecting data alone won't close the gap. Increasing complexity alone will worsen the gap. You need a staged approach: add complexity, then verify with a new learning curve on the larger dataset.

Label Noise Detection

Label noise (incorrect target values) creates a characteristic pattern: the training score drops as data grows (more noise samples make fitting harder), but the validation score plateaus well below what clean data would achieve. Tools like Cleanlab can identify mislabeled samples. Cleaning even 2% to 3% of labels sometimes moves the validation score more than doubling the dataset.

When More Data Helps (and When It Doesn't)

This is the question every learning curve was designed to answer. The table below summarizes the decision framework:

Learning Curve PatternDiagnosisWill More Data Help?Better Fix
Small gap, both scores lowHigh biasNoMore complex model, better features
Large gap, training near perfectHigh varianceYes (if gap still narrowing)Also try regularization, simpler model
Small gap, both scores highGood fitMarginal improvement at bestDeploy or fine-tune
Gap narrowing but still wideVariance, convergingYes, significantlyKeep collecting data
Gap flat, scores flatSaturatedNoChange model, add features, clean labels
Both scores near 1.0 from the startData leakageNo (results are fake)Fix the data pipeline

Key Insight: The slope of the validation curve at the rightmost data point is the most actionable signal. If validation score is still climbing with a meaningful slope, each additional 10% of data will produce measurable improvement. If the slope is near zero, stop collecting and start rethinking.

Production Considerations

Computational Cost

Generating a learning curve trains your model k * t times, where k is the number of CV folds and t is the number of training sizes. With 5-fold CV and 10 training sizes, that's 50 model fits. For a model that takes 30 seconds to train, the learning curve takes 25 minutes.

Model TypeFit Time (1K samples)Learning Curve (5-fold, 10 sizes)Practical?
Linear Regression~0.01s~0.5sAlways
Random Forest (100 trees)~0.5s~25sAlways
XGBoost (500 rounds)~2s~100sYes
Deep Neural Network~60s~50 minOnly early in development

Sampling Strategies for Large Datasets

With millions of rows, training on every size from 10% to 100% is impractical. Two strategies help:

  1. Log-spaced sizes. Use np.logspace instead of np.linspace. The difference between 900K and 1M samples tells you less than the difference between 10K and 100K.
  2. Subsample first. Take a random 50K subsample, run the learning curve on that. If the curve shows high bias at 50K, it will still show high bias at 500K. If it shows high variance with a narrowing gap, you know more data will help.

The LearningCurveDisplay API

Since scikit-learn 1.2, the LearningCurveDisplay class wraps the entire curve generation and plotting workflow into a single call. In scikit-learn 1.8.0 (December 2025), this API is stable and is the recommended approach:

python
from sklearn.model_selection import LearningCurveDisplay
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# One line to generate and plot
LearningCurveDisplay.from_estimator(
    DecisionTreeRegressor(max_depth=8, random_state=42),
    X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='r2',
    score_type='both',
    std_display_style='fill_between',
    negate_score=False,
    random_state=42
)
plt.title("Learning Curve: Decision Tree (max_depth=8)")
plt.show()

The ValidationCurveDisplay class works similarly for validation curves. Both accept score_type='both' to show training and validation together, and std_display_style='fill_between' for confidence bands.

Pro Tip: For quick debugging, LearningCurveDisplay.from_estimator() saves you from writing the boilerplate loop-and-plot code. For detailed analysis where you need the raw score arrays (like the tables above), stick with the learning_curve() function directly.

A Complete Diagnostic Workflow

Here's the practical sequence I use when a model underperforms:

  1. Generate a learning curve with your current model and data.
  2. Read the gap. Large gap = variance problem. Small gap = check the score level.
  3. Read the level. Both scores low with small gap = bias problem. Both scores high = good fit.
  4. Check convergence. If validation is still rising, collect more data before changing anything.
  5. Run a validation curve on the most impactful hyperparameter (usually model complexity: max_depth, n_estimators, C, alpha).
  6. Apply the fix. Bias: add features or complexity. Variance: regularize, simplify, or get data.
  7. Re-generate the learning curve to confirm the fix worked.

This loop takes 10 minutes with scikit-learn and saves you from weeks of aimless experimentation.

Conclusion

Learning curves turn model debugging from guesswork into science. A single plot reveals whether your model needs more data, more complexity, or something else entirely. The training-validation gap is the signal: large gap means high variance (try more data or regularization), small gap at low scores means high bias (change the model or add features), and small gap at high scores means you're ready to ship.

Pair learning curves with validation curves for the full diagnostic picture. The learning curve tells you what's wrong; the validation curve tells you how to tune the specific hyperparameter that fixes it. Both techniques build directly on the theory in the Bias-Variance Tradeoff article, which explains why these patterns emerge mathematically.

For reliable scores on the y-axis, make sure you're using proper cross-validation rather than a single train/test split. And once your learning curve says the model architecture is right, move to hyperparameter tuning to squeeze out the last few percentage points.

Before investing weeks in data collection, spend 10 minutes on a learning curve. It's the cheapest experiment in machine learning.

Frequently Asked Interview Questions

Q: Your model achieves 0.95 accuracy on training data but only 0.78 on validation. What does the learning curve likely show, and what's your first move?

The learning curve would show a large gap between training and validation scores, indicating high variance (overfitting). My first move would be checking whether the validation curve is still climbing as training size increases. If yes, collecting more data is the easiest fix. If the gap has plateaued, I'd add regularization (increase L2 penalty, limit tree depth, or add dropout for neural nets) and remove features that might be contributing noise.

Q: Both training and validation accuracy are stuck at 0.65 despite having 100K samples. What does this tell you about the model?

This is classic high bias. The learning curve would show both lines converged with a small gap at a low level. More data won't help because the model has already plateaued. The fix is increasing model complexity: switch from a linear model to a tree-based ensemble, add polynomial or interaction features, or reduce any regularization that's constraining the model too aggressively.

Q: What is the difference between a learning curve and a validation curve?

A learning curve keeps the model and hyperparameters fixed while varying the training set size. It answers "do I need more data?" A validation curve keeps the dataset fixed while sweeping one hyperparameter across a range. It answers "what's the optimal value for this hyperparameter?" Both plot training and validation scores, but they vary different things on the x-axis.

Q: You see a learning curve where the validation score is nearly identical to the training score from the very start, both close to 0.99. Is this a good sign?

Not necessarily. This pattern often indicates data leakage rather than a genuinely great model. If validation scores are suspiciously perfect from tiny training sizes onward, check for information leaking from validation into training: scaling before splitting, features derived from the target, or time-series data that was shuffled randomly. Run a sanity check by permuting the target labels; if the model still scores well, something is definitely leaking.

Q: When should you NOT use learning curves?

Learning curves are less useful when the model takes hours to train (50 fits in a learning curve becomes impractical), when the data distribution shifts over time (historical data volumes don't predict future performance), or when you've already identified the problem through other means. They're also misleading if the evaluation metric on the y-axis isn't aligned with the actual business objective.

Q: How do learning curves relate to the bias-variance tradeoff?

Learning curves visualize the bias-variance tradeoff at different data sizes. High bias (underfitting) appears as both lines converging low with a small gap, reflecting large Bias2\text{Bias}^2 and low Variance\text{Variance} terms in the error decomposition. High variance (overfitting) appears as a large gap, reflecting low bias but high variance. As training size grows, variance decreases (the model sees more representative data), which is why the gap narrows. Bias doesn't change with data size because it's a property of the model's functional form, not the amount of data.

Q: Your learning curve shows the validation score is still climbing at your maximum dataset size. How do you estimate how much more data you need?

Plot the validation scores on a log-scale x-axis and look for the rate of improvement. If validation R2 improved by 0.02 going from 5K to 10K samples (doubling), a rough extrapolation suggests another doubling to 20K would yield about 0.01 improvement, following a power-law decay. In practice, I'd also benchmark a simpler model (which may plateau sooner) against the current one to decide if complexity or data is the binding constraint.

Q: A colleague suggests always collecting more data when the model isn't good enough. How do you respond?

I'd generate a learning curve to check. If it shows high bias (small gap, both scores low), more data won't move the needle. The model's assumptions are wrong, not its data supply. Only when the curve shows high variance with a narrowing gap is "collect more data" the right recommendation. This 10-minute diagnostic prevents months of wasted data collection effort.

Hands-On Practice

See learning curves in action! We'll compare a simple model (high bias) vs a complex model (high variance) and watch how the training/validation gap reveals the problem.

Dataset: ML Fundamentals (Loan Approval) We'll diagnose why models fail by examining their learning curves.

Try this: Change max_depth=2 to max_depth=5 for the "Simple Model" and watch the gap and validation score improve as the model gains just enough complexity!

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths