Your house price model scores 0.82 R2 on training data and 0.81 on validation. Decent, but not good enough. Should you collect more listings, engineer new features, or swap in a more complex algorithm? Pick wrong and you burn weeks of effort with zero improvement.
Learning curves answer this question in a single plot. They graph your model's training and validation scores as the training set grows, revealing whether the bottleneck is insufficient data, insufficient model complexity, or something else entirely. Think of them as an X-ray for your model: the shape of the curves tells you exactly what's broken and what to fix.
Every code block and example in this article uses one running scenario: predicting house prices from square footage, bedrooms, age, and neighborhood. We'll generate learning curves for models that underfit, overfit, and fit well, then read each curve like a diagnostic report.
Anatomy of a Learning Curve
A learning curve plots model performance (y-axis) against training set size (x-axis). Two lines appear: the training score and the validation score, both computed via cross-validation.
With a tiny training set, any model can memorize the examples perfectly, so training score starts high. As data grows, memorization becomes harder, and training score drifts down. Meanwhile, validation score starts low (the model hasn't seen enough patterns) and climbs as the model generalizes better.
The vertical distance between these two lines is the generalization gap. Its size and trajectory are the diagnostic signal.
| Curve Feature | What It Tells You |
|---|---|
| Gap size | How much the model overfits (large gap) or underfits (small gap, low scores) |
| Gap trajectory | Whether more data will help (gap shrinking) or not (gap flat) |
| Score level | Whether performance is acceptable, even if the gap is small |
| Convergence point | The approximate sample size where adding more data stops helping |
Common Pitfall: Learning curves (x-axis = training set size) are not the same as training loss curves (x-axis = epochs or iterations). A loss curve tells you if the optimizer has converged. A learning curve tells you if your model's fundamental capacity matches the problem. They answer different questions.
Click to expandLearning curve diagnostic patterns showing high bias, high variance, and good fit side by side
The Error Decomposition Behind the Curves
Every learning curve reflects the bias-variance tradeoff at different data sizes. The expected prediction error decomposes into three terms:
Where:
- measures how far the model's average prediction is from the true function (systematic error from wrong assumptions)
- measures how much predictions fluctuate across different training sets (sensitivity to specific training samples)
- is the irreducible noise in the data (the Bayes error floor that no model can beat)
In Plain English: Imagine predicting house prices. Bias is like using a straight line for a curved relationship: no matter how many houses you see, the line stays wrong in the same way. Variance is like fitting a wiggly curve that nails the training houses but wobbles wildly on new ones. Irreducible noise is the inherent randomness in sale prices that even a perfect model can't predict, such as a buyer's mood or a competing offer.
Learning curves make this decomposition visible. High bias shows up as both lines converging at a low score. High variance shows up as a wide gap between them.
Generating Learning Curves in scikit-learn
The learning_curve() function in scikit-learn handles the mechanics: it trains the model on progressively larger subsets of the training data, computes cross-validated scores at each size, and returns arrays ready for plotting.
Here is a direct comparison of a high-bias model (linear regression on nonlinear data) against a high-variance model (unbounded decision tree) on the same house price dataset:
=== HIGH BIAS (Linear Regression on Nonlinear Data) ===
Size | Train R2 | Val R2 | Gap
------------------------------------------
64 | 0.8130 | 0.8006 | 0.0124
146 | 0.8180 | 0.8138 | 0.0042
228 | 0.8100 | 0.8181 | -0.0081
310 | 0.8186 | 0.8195 | -0.0009
393 | 0.8225 | 0.8189 | 0.0035
475 | 0.8268 | 0.8192 | 0.0076
557 | 0.8203 | 0.8192 | 0.0011
640 | 0.8222 | 0.8193 | 0.0029
=== HIGH VARIANCE (Decision Tree, no max_depth) ===
Size | Train R2 | Val R2 | Gap
------------------------------------------
64 | 1.0000 | 0.8124 | 0.1876
146 | 1.0000 | 0.8630 | 0.1370
228 | 1.0000 | 0.8858 | 0.1142
310 | 1.0000 | 0.8953 | 0.1047
393 | 1.0000 | 0.9060 | 0.0940
475 | 1.0000 | 0.9063 | 0.0937
557 | 1.0000 | 0.9102 | 0.0898
640 | 1.0000 | 0.9095 | 0.0905
Look at these two tables side by side. The linear model's training and validation scores sit around 0.82 with almost no gap, even at 640 samples. Adding more data won't change this. The model can't capture the nonlinear pricing patterns. Meanwhile, the decision tree achieves perfect training R2 (1.0000) every time, but validation lags behind by 0.09. This model memorizes training data instead of learning general rules. More data does help it (the gap narrows from 0.19 to 0.09), but it would take a very large dataset to close it fully.
Reading the X-Ray: High Bias Diagnosis
High bias means the model is too simple to capture the data's true patterns. On a learning curve, the signature is unmistakable: both training and validation scores plateau at a mediocre level with a tiny gap between them.
What the curves look like:
- Training score starts moderate and stays moderate
- Validation score converges quickly to nearly the same moderate level
- The gap between them is negligible
- Neither curve improves meaningfully as data grows
Why it happens in our house price example: Linear regression assumes price scales linearly with each feature. But the real relationship includes squared terms, interactions (age times bedrooms), and neighborhood effects. No amount of additional house listings fixes this mismatch. The model's assumptions are wrong.
How to fix it:
- Increase model complexity. Switch to a random forest, gradient boosted trees (XGBoost), or a neural network.
- Engineer better features. Add polynomial terms, interaction features, or domain-specific transformations. Our guide on feature engineering covers this in depth.
- Reduce regularization. If you're using Ridge or Lasso with a heavy penalty, the model might be over-constrained.
Pro Tip: If your learning curve shows high bias, stop collecting more data immediately. More data will not help. This is the most expensive mistake in applied ML, spending months gathering data for a model that's already plateaued due to insufficient complexity.
Reading the X-Ray: High Variance Diagnosis
High variance means the model is too complex relative to the training set size. It memorizes noise instead of learning signal. The learning curve signature: training score is near-perfect, validation score is substantially lower, and a visible gap persists.
What the curves look like:
- Training score stays very high (often 1.0 for tree-based models)
- Validation score is noticeably lower
- A persistent gap separates them
- The gap may narrow slowly as data grows, but doesn't close
Why it happens in our house price example: An unbounded decision tree creates leaves specific to individual houses. A 3,200 sqft house in neighborhood 3 with 4 bedrooms gets its own prediction node. This works great on the training set, but a new 3,200 sqft house with slightly different features gets a poor prediction because the tree memorized the training-specific noise.
How to fix it:
- Get more data. This is the primary fix. More samples make memorization harder, forcing the model to learn general patterns.
- Simplify the model. Limit tree depth, reduce the number of trees in an ensemble, or use a simpler architecture.
- Add regularization. Increase L1/L2 penalties, add dropout (for neural nets), or use min_samples_leaf constraints.
- Remove noisy features. Irrelevant features give the model extra noise to memorize.
Key Insight: The learning curve tells you whether more data will help before you spend the effort collecting it. If the validation curve is still climbing and the gap is narrowing, more data is worth the investment. If both curves have flattened, you've saturated the model's capacity and more data is wasted.
Click to expandDecision tree for diagnosing learning curve patterns and choosing the right fix
Validation Curves: The Other Diagnostic Tool
Learning curves hold the model fixed and vary data size. Validation curves do the opposite: they hold the dataset fixed and vary a single hyperparameter. Together, the two tools give you a complete diagnostic picture.
A validation curve sweeps a hyperparameter across a range and plots training and validation scores at each value. The point where validation score peaks is your optimal setting. Before that peak, you're underfitting. After it, you're overfitting.
=== VALIDATION CURVE: Decision Tree max_depth ===
Depth | Train R2 | Val R2 | Gap
------------------------------------------
2 | 0.8050 | 0.7851 | 0.0199
3 | 0.8640 | 0.8386 | 0.0254
5 | 0.9400 | 0.8901 | 0.0499
8 | 0.9850 | 0.9155 | 0.0695
12 | 0.9991 | 0.9088 | 0.0904
20 | 1.0000 | 0.9095 | 0.0905
30 | 1.0000 | 0.9095 | 0.0905
50 | 1.0000 | 0.9095 | 0.0905
Best max_depth: 8 (Val R2 = 0.9155)
At depth=2, the model underfits: both scores are low.
At depth=50, the model overfits: training is perfect but validation drops.
The sweet spot is in between, balancing bias and variance.
This validation curve tells a clear story. At max_depth=2, the tree is too shallow: both scores hover around 0.80. At max_depth=8, validation peaks at 0.9155. Beyond that, training keeps climbing to 1.0 while validation actually drops. The optimal depth for our house price data is 8.
| Diagnostic Tool | X-axis | Answers | Holds Fixed |
|---|---|---|---|
| Learning curve | Training set size | "Do I need more data?" | Model and hyperparameters |
| Validation curve | Hyperparameter value | "What's the best setting?" | Full dataset |
Pro Tip: Always run a learning curve first, then a validation curve. The learning curve tells you if the problem is bias or variance. The validation curve helps you tune the specific hyperparameters to address it. Running them in reverse order means you might tune a hyperparameter perfectly for a model that's fundamentally wrong.
Click to expandComparison of learning curves and validation curves showing different x-axes and diagnostic questions
Advanced Patterns Beyond Bias and Variance
Real-world learning curves don't always fall neatly into "high bias" or "high variance." Several advanced patterns are worth recognizing.
Data Leakage Curves
If your validation score is suspiciously close to 1.0 and shows almost no gap from the start, suspect data leakage. Information from the validation set has bled into training, making the model "cheat." Common causes include scaling before splitting, target-derived features, and time-series data shuffled randomly instead of split chronologically. A genuine 0.99 R2 on a real-world regression task is extremely rare; a leaked model routinely produces it.
The Bayes Error Floor
Sometimes both curves converge and flatten, but at a score lower than you want. This might be the irreducible noise floor (Bayes error). For our house price example, sale prices depend on buyer psychology, competing bids, and market timing that no feature captures. If your best model tops out at 0.92 R2 on clean data, that remaining 0.08 may be the noise floor. No algorithm or dataset increase will push past it.
Convergence Stalls
The validation curve rises but then stalls, while the gap remains. This happens when the model needs both more data and more complexity simultaneously. Collecting data alone won't close the gap. Increasing complexity alone will worsen the gap. You need a staged approach: add complexity, then verify with a new learning curve on the larger dataset.
Label Noise Detection
Label noise (incorrect target values) creates a characteristic pattern: the training score drops as data grows (more noise samples make fitting harder), but the validation score plateaus well below what clean data would achieve. Tools like Cleanlab can identify mislabeled samples. Cleaning even 2% to 3% of labels sometimes moves the validation score more than doubling the dataset.
When More Data Helps (and When It Doesn't)
This is the question every learning curve was designed to answer. The table below summarizes the decision framework:
| Learning Curve Pattern | Diagnosis | Will More Data Help? | Better Fix |
|---|---|---|---|
| Small gap, both scores low | High bias | No | More complex model, better features |
| Large gap, training near perfect | High variance | Yes (if gap still narrowing) | Also try regularization, simpler model |
| Small gap, both scores high | Good fit | Marginal improvement at best | Deploy or fine-tune |
| Gap narrowing but still wide | Variance, converging | Yes, significantly | Keep collecting data |
| Gap flat, scores flat | Saturated | No | Change model, add features, clean labels |
| Both scores near 1.0 from the start | Data leakage | No (results are fake) | Fix the data pipeline |
Key Insight: The slope of the validation curve at the rightmost data point is the most actionable signal. If validation score is still climbing with a meaningful slope, each additional 10% of data will produce measurable improvement. If the slope is near zero, stop collecting and start rethinking.
Production Considerations
Computational Cost
Generating a learning curve trains your model k * t times, where k is the number of CV folds and t is the number of training sizes. With 5-fold CV and 10 training sizes, that's 50 model fits. For a model that takes 30 seconds to train, the learning curve takes 25 minutes.
| Model Type | Fit Time (1K samples) | Learning Curve (5-fold, 10 sizes) | Practical? |
|---|---|---|---|
| Linear Regression | ~0.01s | ~0.5s | Always |
| Random Forest (100 trees) | ~0.5s | ~25s | Always |
| XGBoost (500 rounds) | ~2s | ~100s | Yes |
| Deep Neural Network | ~60s | ~50 min | Only early in development |
Sampling Strategies for Large Datasets
With millions of rows, training on every size from 10% to 100% is impractical. Two strategies help:
- Log-spaced sizes. Use
np.logspaceinstead ofnp.linspace. The difference between 900K and 1M samples tells you less than the difference between 10K and 100K. - Subsample first. Take a random 50K subsample, run the learning curve on that. If the curve shows high bias at 50K, it will still show high bias at 500K. If it shows high variance with a narrowing gap, you know more data will help.
The LearningCurveDisplay API
Since scikit-learn 1.2, the LearningCurveDisplay class wraps the entire curve generation and plotting workflow into a single call. In scikit-learn 1.8.0 (December 2025), this API is stable and is the recommended approach:
from sklearn.model_selection import LearningCurveDisplay
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
# One line to generate and plot
LearningCurveDisplay.from_estimator(
DecisionTreeRegressor(max_depth=8, random_state=42),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='r2',
score_type='both',
std_display_style='fill_between',
negate_score=False,
random_state=42
)
plt.title("Learning Curve: Decision Tree (max_depth=8)")
plt.show()
The ValidationCurveDisplay class works similarly for validation curves. Both accept score_type='both' to show training and validation together, and std_display_style='fill_between' for confidence bands.
Pro Tip: For quick debugging, LearningCurveDisplay.from_estimator() saves you from writing the boilerplate loop-and-plot code. For detailed analysis where you need the raw score arrays (like the tables above), stick with the learning_curve() function directly.
A Complete Diagnostic Workflow
Here's the practical sequence I use when a model underperforms:
- Generate a learning curve with your current model and data.
- Read the gap. Large gap = variance problem. Small gap = check the score level.
- Read the level. Both scores low with small gap = bias problem. Both scores high = good fit.
- Check convergence. If validation is still rising, collect more data before changing anything.
- Run a validation curve on the most impactful hyperparameter (usually model complexity:
max_depth,n_estimators,C,alpha). - Apply the fix. Bias: add features or complexity. Variance: regularize, simplify, or get data.
- Re-generate the learning curve to confirm the fix worked.
This loop takes 10 minutes with scikit-learn and saves you from weeks of aimless experimentation.
Conclusion
Learning curves turn model debugging from guesswork into science. A single plot reveals whether your model needs more data, more complexity, or something else entirely. The training-validation gap is the signal: large gap means high variance (try more data or regularization), small gap at low scores means high bias (change the model or add features), and small gap at high scores means you're ready to ship.
Pair learning curves with validation curves for the full diagnostic picture. The learning curve tells you what's wrong; the validation curve tells you how to tune the specific hyperparameter that fixes it. Both techniques build directly on the theory in the Bias-Variance Tradeoff article, which explains why these patterns emerge mathematically.
For reliable scores on the y-axis, make sure you're using proper cross-validation rather than a single train/test split. And once your learning curve says the model architecture is right, move to hyperparameter tuning to squeeze out the last few percentage points.
Before investing weeks in data collection, spend 10 minutes on a learning curve. It's the cheapest experiment in machine learning.
Frequently Asked Interview Questions
Q: Your model achieves 0.95 accuracy on training data but only 0.78 on validation. What does the learning curve likely show, and what's your first move?
The learning curve would show a large gap between training and validation scores, indicating high variance (overfitting). My first move would be checking whether the validation curve is still climbing as training size increases. If yes, collecting more data is the easiest fix. If the gap has plateaued, I'd add regularization (increase L2 penalty, limit tree depth, or add dropout for neural nets) and remove features that might be contributing noise.
Q: Both training and validation accuracy are stuck at 0.65 despite having 100K samples. What does this tell you about the model?
This is classic high bias. The learning curve would show both lines converged with a small gap at a low level. More data won't help because the model has already plateaued. The fix is increasing model complexity: switch from a linear model to a tree-based ensemble, add polynomial or interaction features, or reduce any regularization that's constraining the model too aggressively.
Q: What is the difference between a learning curve and a validation curve?
A learning curve keeps the model and hyperparameters fixed while varying the training set size. It answers "do I need more data?" A validation curve keeps the dataset fixed while sweeping one hyperparameter across a range. It answers "what's the optimal value for this hyperparameter?" Both plot training and validation scores, but they vary different things on the x-axis.
Q: You see a learning curve where the validation score is nearly identical to the training score from the very start, both close to 0.99. Is this a good sign?
Not necessarily. This pattern often indicates data leakage rather than a genuinely great model. If validation scores are suspiciously perfect from tiny training sizes onward, check for information leaking from validation into training: scaling before splitting, features derived from the target, or time-series data that was shuffled randomly. Run a sanity check by permuting the target labels; if the model still scores well, something is definitely leaking.
Q: When should you NOT use learning curves?
Learning curves are less useful when the model takes hours to train (50 fits in a learning curve becomes impractical), when the data distribution shifts over time (historical data volumes don't predict future performance), or when you've already identified the problem through other means. They're also misleading if the evaluation metric on the y-axis isn't aligned with the actual business objective.
Q: How do learning curves relate to the bias-variance tradeoff?
Learning curves visualize the bias-variance tradeoff at different data sizes. High bias (underfitting) appears as both lines converging low with a small gap, reflecting large and low terms in the error decomposition. High variance (overfitting) appears as a large gap, reflecting low bias but high variance. As training size grows, variance decreases (the model sees more representative data), which is why the gap narrows. Bias doesn't change with data size because it's a property of the model's functional form, not the amount of data.
Q: Your learning curve shows the validation score is still climbing at your maximum dataset size. How do you estimate how much more data you need?
Plot the validation scores on a log-scale x-axis and look for the rate of improvement. If validation R2 improved by 0.02 going from 5K to 10K samples (doubling), a rough extrapolation suggests another doubling to 20K would yield about 0.01 improvement, following a power-law decay. In practice, I'd also benchmark a simpler model (which may plateau sooner) against the current one to decide if complexity or data is the binding constraint.
Q: A colleague suggests always collecting more data when the model isn't good enough. How do you respond?
I'd generate a learning curve to check. If it shows high bias (small gap, both scores low), more data won't move the needle. The model's assumptions are wrong, not its data supply. Only when the curve shows high variance with a narrowing gap is "collect more data" the right recommendation. This 10-minute diagnostic prevents months of wasted data collection effort.
Hands-On Practice
See learning curves in action! We'll compare a simple model (high bias) vs a complex model (high variance) and watch how the training/validation gap reveals the problem.
Dataset: ML Fundamentals (Loan Approval) We'll diagnose why models fail by examining their learning curves.
Try this: Change max_depth=2 to max_depth=5 for the "Simple Model" and watch the gap and validation score improve as the model gains just enough complexity!