Skip to content

Quantile Regression: Beyond the Average

DS
LDS Team
Let's Data Science
14 minAudio
Listen Along
0:00/ 0:00
AI voice

Your food delivery app says "estimated arrival: 35 minutes." Forty-five minutes later, you're still waiting. The app modeled the average delivery time, but you needed the worst-case estimate. That 35-minute prediction was the mean; what you actually cared about was the 90th percentile.

Standard linear regression fits a single line through the center of the data, modeling the conditional mean. Quantile regression fits lines through any percentile you choose: the 10th, the 50th (median), the 90th, or anywhere in between. Instead of one answer for "what's the expected value?", you get a full picture of where future observations are likely to fall.

Throughout this article, we'll build a single running example: predicting delivery times from distance, where the spread of actual times grows wider as orders travel farther. This is exactly the kind of heteroscedastic data where mean-only predictions fall apart.

The Problem with Modeling Only the Mean

Ordinary Least Squares (OLS) regression minimizes the sum of squared residuals to find one line that represents the conditional mean of the response variable. Two structural weaknesses make this insufficient for many real problems.

Outlier sensitivity. Squaring residuals amplifies extreme values. A single delayed delivery of 120 minutes when the rest cluster around 40 will drag the entire regression line upward. The mean tries to stay "fair" to every point, including the anomalies.

Heteroscedasticity. In many datasets, variance isn't constant. Delivery times for nearby restaurants (2 km) might cluster tightly between 10 and 20 minutes, while deliveries across town (20 km) spread anywhere from 30 to 90 minutes. OLS draws one line through the middle but can't capture the widening cone of uncertainty. Its confidence intervals assume constant variance, so they're too narrow on one end and too wide on the other.

Quantile regression addresses both problems. It models specific percentiles independently, naturally capturing the expanding spread without assuming constant variance.

OLS regression fits one line through the mean while quantile regression fits multiple lines at different percentilesClick to expandOLS regression fits one line through the mean while quantile regression fits multiple lines at different percentiles

The Pinball Loss Function

The pinball loss (also called the check function or quantile loss) is the mathematical engine behind quantile regression. Introduced by Koenker and Bassett in their 1978 Econometrica paper, it replaces squared error with an asymmetric penalty that weights over-predictions and under-predictions differently depending on the target quantile τ\tau.

ρτ(u)={τuif u0(τ1)uif u<0\rho_\tau(u) = \begin{cases} \tau \cdot u & \text{if } u \geq 0 \\ (\tau - 1) \cdot u & \text{if } u < 0 \end{cases}

Where:

  • ρτ(u)\rho_\tau(u) is the pinball loss for quantile τ\tau
  • u=yiy^iu = y_i - \hat{y}_i is the residual (actual minus predicted)
  • τ(0,1)\tau \in (0, 1) is the target quantile (e.g., 0.9 for the 90th percentile)
  • When u0u \geq 0 (under-prediction), the penalty weight is τ\tau
  • When u<0u < 0 (over-prediction), the penalty weight is τ1|\tau - 1|

In Plain English: For the 90th percentile (τ=0.9\tau = 0.9), under-predicting a delivery time costs 9x more than over-predicting. If the actual delivery took 80 minutes but you predicted 60 (under by 20), the loss is $0.9 \times 20 = 18. If you predicted 100 (over by 20), the loss is only &#36;0.1 \times 20 = 2. The optimizer pushes the line upward to avoid the expensive under-prediction penalties, settling where roughly 90% of deliveries finish below the predicted time.

How the asymmetric pinball loss function penalizes over-prediction and under-prediction at different quantile levelsClick to expandHow the asymmetric pinball loss function penalizes over-prediction and under-prediction at different quantile levels

The full objective function sums the pinball loss across all nn observations:

minβi=1nρτ(yixiTβ)\min_{\beta} \sum_{i=1}^{n} \rho_\tau(y_i - \mathbf{x}_i^T \beta)

Where:

  • β\beta is the coefficient vector (intercept and slopes)
  • xi\mathbf{x}_i is the feature vector for observation ii
  • yiy_i is the actual response value
  • nn is the number of training examples

Key Insight: At τ=0.5\tau = 0.5, the pinball loss reduces to the absolute error u|u|, giving median regression (Least Absolute Deviations). Unlike OLS, absolute errors don't explode with outliers, making the median inherently resistant to extreme values.

Interpreting Quantile Regression Coefficients

Coefficients in quantile regression measure the marginal effect of a predictor on a specific percentile of the response, not the mean. This distinction is what makes the technique so informative.

In our delivery example, the OLS slope says "each additional kilometer adds about 2.6 minutes to the average delivery time." But the quantile slopes tell a richer story:

QuantileSlope (min/km)Interpretation
τ=0.1\tau = 0.1~1.5Each extra km adds ~1.5 min to fast deliveries (best case)
τ=0.5\tau = 0.5~2.6Each extra km adds ~2.6 min to typical deliveries (median)
τ=0.9\tau = 0.9~3.8Each extra km adds ~3.8 min to slow deliveries (worst case)

The slopes diverge because variance grows with distance. For short trips, traffic and prep time are fairly predictable. For long trips, a single red light sequence, a wrong turn, or a restaurant delay can balloon the delivery time. The 90th percentile slope is more than double the 10th percentile slope, quantifying exactly how much uncertainty expands per kilometer.

Key Insight: When quantile slopes differ substantially across τ\tau values, your data is heteroscedastic. If all slopes were nearly identical, the conditional distribution would be roughly symmetric and constant-variance, and OLS would suffice.

Linear Quantile Regression with statsmodels

The statsmodels library provides QuantReg for linear quantile regression. It solves the optimization problem using interior point methods from linear programming, based on the original formulation by Koenker and Bassett (1978).

Let's generate delivery data where noise scales with distance, then fit OLS alongside three quantile models.

Expected output:

text
Sample delivery data (first 5 rows):
 distance_km  delivery_time
        10.0           35.4
        23.8           56.5
        18.6           90.4
        15.4           56.8
         4.7           12.1

Dataset: 300 deliveries, distance 1.1-24.8 km

Now fit the models and compare coefficients.

Expected output:

text
Model coefficients (Intercept + Slope per km):

Model                 Intercept  Slope (min/km)
-----------------------------------------------
OLS (Mean)                 9.10            2.59
Quantile 0.1               8.48            1.51
Quantile 0.5               9.01            2.60
Quantile 0.9               9.39            3.78

Predicted delivery time for 20 km:
  OLS (average):    60.9 min
  10th percentile:  38.7 min  (best case)
  Median:           61.1 min  (typical)
  90th percentile:  84.9 min  (worst case)

The 20 km delivery shows the practical stakes. OLS says "about 61 minutes." But 10% of deliveries at that distance take 85 minutes or longer. If your app shows the mean, one in ten customers waits 25 minutes past the estimate. Showing the 90th percentile sets realistic expectations and reduces complaints.

Pro Tip: For customer-facing ETAs, many logistics companies use τ=0.8\tau = 0.8 or τ=0.9\tau = 0.9. The slight over-estimate builds trust: customers are pleasantly surprised by early deliveries rather than frustrated by late ones.

Building Prediction Intervals

Prediction intervals are one of the most practical outputs of quantile regression. By fitting models at two symmetric quantiles, you construct an interval that captures a known fraction of future observations without assuming normality or constant variance.

For an 80% prediction interval, fit models at τ=0.1\tau = 0.1 and τ=0.9\tau = 0.9. The gap between their predictions is the interval width, and it naturally widens where data is more spread out.

How quantile regression creates prediction intervals from lower bound to median to upper boundClick to expandHow quantile regression creates prediction intervals from lower bound to median to upper bound

Distance10th pctl (lower)Median90th pctl (upper)80% Interval Width
5 km~16 min~22 min~28 min~12 min
10 km~23 min~35 min~47 min~24 min
20 km~39 min~61 min~85 min~46 min

The interval width roughly quadruples from 5 km to 20 km. This is exactly the heteroscedastic structure that OLS prediction intervals, which assume constant-width bands, completely miss.

Common Pitfall: Don't confuse quantile-based prediction intervals with confidence intervals. Confidence intervals describe uncertainty about the regression line itself (the parameter estimates). Prediction intervals describe where individual future observations will fall. Prediction intervals are always wider.

Non-Linear Quantile Regression with Gradient Boosting

Linear quantile regression assumes each quantile follows a straight line. Real delivery times have non-linear patterns: traffic congestion creates step-function jumps at certain distances, and rush-hour effects interact with distance in complex ways.

Scikit-learn's GradientBoostingRegressor supports quantile loss natively through the loss='quantile' parameter (documentation). Each tree in the ensemble optimizes the pinball loss at the specified alpha (the quantile level), allowing the model to capture non-linear conditional quantiles.

Expected output:

text
alpha=0.1  pinball_loss=1.65
alpha=0.5  pinball_loss=3.68
alpha=0.9  pinball_loss=1.67

Non-linear quantile predictions:
Distance          Q10      Q50      Q90   Spread
5 km            13.3     22.5     29.2     15.9
20 km            36.8     48.5     75.3     38.5

The "Spread" column confirms heteroscedasticity: the gap between the 10th and 90th percentile widens from 15.9 minutes at 5 km to 38.5 minutes at 20 km. Gradient boosting captures this non-linear widening naturally because the tree structure adapts to local data density.

For larger datasets, XGBoost and LightGBM both support objective='quantile' and run significantly faster than scikit-learn's implementation on 100K+ rows.

When to Use Quantile Regression

Financial risk (Value at Risk). Banks and hedge funds model the 1st or 5th percentile of portfolio returns to quantify tail risk. The 2008 financial crisis exposed how mean-based models catastrophically underestimated downside risk. Quantile regression at τ=0.01\tau = 0.01 directly estimates the worst 1% of scenarios.

Logistics and delivery ETAs. As our running example demonstrates, customers and dispatchers need worst-case estimates, not averages. Amazon, DoorDash, and Uber all use quantile-based prediction intervals for their delivery time estimates.

Healthcare growth charts. The CDC's pediatric growth charts plot the 5th, 25th, 50th, 75th, and 95th percentiles of weight-for-age. These are quantile regression models fit on population data. A child at the 3rd percentile triggers a malnutrition screening; one at the 97th triggers an obesity evaluation.

Real estate pricing. Bayesian regression gives you uncertainty around the mean, but quantile regression tells you something different: how the effect of square footage varies across the price distribution. Luxury homes (90th percentile) see a price-per-sqft premium 2-3x higher than budget homes (10th percentile).

Robust central tendency. Even if you only want the middle of the distribution, median regression (τ=0.5\tau = 0.5) is more resistant to outliers than OLS. In datasets with measurement errors or data entry mistakes, the median provides a safer baseline.

When NOT to Use Quantile Regression

Not every regression problem benefits from quantile modeling. Skip it when:

  1. Your data is well-behaved. If residuals are normally distributed with constant variance, OLS already gives you valid prediction intervals through standard formulas. Quantile regression won't add much insight.
  2. You have small samples. Estimating the mean requires far less data than estimating tails. Modeling the 99th percentile with 100 observations means your estimate depends on just 1 data point. You need at minimum several hundred observations for extreme quantiles.
  3. Computational budget is tight. OLS has a closed-form solution: β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty, computed in one matrix operation. Quantile regression uses iterative linear programming, roughly 5-10x slower for moderate datasets and worse for very large ones.

Practical Pitfalls and Production Considerations

The Quantile Crossing Problem

Each quantile is estimated independently. Nothing in the math prevents the 90th percentile prediction from falling below the 50th percentile for certain input values. This creates a logical impossibility where your "worst case" is better than your "typical case."

Common Pitfall: Quantile crossing usually happens at the edges of the feature space where data is sparse. If you're predicting delivery times and only have 3 observations for 25+ km distances, the 10th and 90th percentile lines may cross there. Solutions include joint quantile estimation or simply flagging predictions in sparse regions as unreliable.

Computational Scaling

ApproachTraining Complexity10K rows100K rows1M rows
OLSO(np2)O(np^2) closed-form<1 ms~10 ms~100 ms
Linear QR (statsmodels)O(np)O(n \cdot p) per iteration~50 ms~500 ms~5 s
GB Quantile (sklearn)O(npTd)O(n \cdot p \cdot T \cdot d)~1 s~10 s~2 min

Where nn is observations, pp is features, TT is number of trees, and dd is max depth.

Evaluation with Pinball Loss

Standard metrics like RMSE or MAE don't make sense for quantile models since they target different parts of the distribution. Use mean_pinball_loss from scikit-learn. A lower pinball loss at a given τ\tau means the model's quantile predictions are better calibrated.

Pro Tip: To validate quantile calibration, check the empirical coverage. If you fit τ=0.9\tau = 0.9, roughly 90% of test-set observations should fall below the predicted values. If only 75% do, your model is underestimating the upper tail.

TechniqueWhat It ModelsUncertainty TypeKey Assumption
OLSConditional meanPrediction intervals (assumes normality)Homoscedastic, normal errors
Quantile RegressionConditional quantilesDistribution-free intervalsMinimal; no distributional assumption
Bayesian RegressionPosterior over parametersCredible intervalsPrior specification needed
Ridge/LassoRegularized meanShrinkage, not distributionalSame as OLS + regularization
Conformal PredictionAny modelCoverage-guaranteed intervalsExchangeability

Quantile regression's unique advantage is that it makes no assumptions about the error distribution. It doesn't assume normality, homoscedasticity, or any parametric form. The price you pay is fitting a separate model for each quantile of interest and potentially dealing with crossing.

Conclusion

Quantile regression replaces the single-line summary of OLS with a full distributional view of how predictors affect the response. The pinball loss function, with its asymmetric penalties, is the mathematical engine that makes this possible: by tuning the quantile parameter τ\tau, you control which part of the conditional distribution your model targets.

In our delivery time example, the difference was stark. OLS said a 20 km delivery takes about 61 minutes. Quantile regression revealed the full story: best-case 39 minutes, worst-case 85 minutes, with the uncertainty cone widening proportionally to distance. That spread is exactly the kind of operational intelligence that confidence intervals from OLS fail to capture when variance is non-constant.

For Python implementations, statsmodels.QuantReg gives you statistical inference (p-values, standard errors) for linear quantile models, while scikit-learn's GradientBoostingRegressor(loss='quantile') handles non-linear relationships. For large-scale production systems, LightGBM with objective='quantile' offers the best speed-to-accuracy tradeoff.

The next time you're about to call .predict() and return a single number, ask yourself: does your user need the average, or do they need to know what happens in the tails?

Interview Questions

Q: What is the key difference between OLS regression and quantile regression?

OLS minimizes the sum of squared residuals to estimate the conditional mean of the response variable. Quantile regression minimizes the pinball loss to estimate any conditional quantile (median, 10th percentile, 90th percentile, etc.). This means OLS gives you one summary line through the center, while quantile regression gives you a family of lines that describe the entire conditional distribution.

Q: Why is median regression more resistant to outliers than OLS?

Median regression minimizes the sum of absolute residuals rather than squared residuals. Squaring amplifies large errors: a residual of 100 contributes 10,000 to the OLS loss but only 100 to the median regression loss. This means a single extreme outlier has far less pull on the median regression line than on OLS.

Q: Explain the pinball loss function and why it produces the desired quantile.

The pinball loss assigns asymmetric penalties to positive and negative residuals. For quantile τ\tau, under-predictions (positive residuals) are penalized by weight τ\tau and over-predictions by weight $1 - \tau.For. For \tau = 0.9$, the model is penalized 9x more for leaving points above the line than below. The optimizer pushes the line up until the marginal cost of additional over-predictions balances the cost of reducing under-predictions, which occurs precisely at the 90th percentile.

Q: What is the quantile crossing problem, and how do you handle it?

Since each quantile model is fit independently, the predicted 90th percentile can fall below the predicted 50th percentile for certain inputs, creating a logical contradiction. This typically occurs in regions with sparse data. Solutions include joint quantile estimation methods that enforce non-crossing constraints, post-hoc sorting of quantile predictions, or conformal quantile regression which provides distribution-free coverage guarantees.

Q: When would you choose quantile regression over standard prediction intervals from OLS?

Choose quantile regression when the homoscedasticity assumption is violated (variance changes across the predictor range), when the error distribution is skewed or heavy-tailed, or when you specifically need estimates at non-central quantiles (risk analysis, worst-case planning). OLS prediction intervals assume normally distributed, constant-variance errors and become unreliable when those assumptions fail.

Q: How would you evaluate a quantile regression model?

Use the mean pinball loss (also called the quantile loss or check loss) at the target quantile. Additionally, check empirical coverage: if you fit τ=0.9\tau = 0.9, verify that approximately 90% of held-out observations fall below the predicted values. Poor calibration (e.g., only 75% coverage for a 90th percentile model) signals the model is underestimating the upper tail, possibly due to distribution shift or insufficient training data in the tails.

Q: Your delivery time model predicts the average well, but customers keep complaining about late orders. What regression technique would you recommend?

This is a textbook case for quantile regression at a high quantile like τ=0.85\tau = 0.85 or τ=0.9\tau = 0.9. Showing customers the 90th percentile delivery time rather than the mean sets expectations that are met or exceeded 90% of the time. The slight over-estimation for typical deliveries is far less costly than the customer frustration from repeated under-estimates.

Q: Can quantile regression be applied to non-linear models?

Yes. Gradient boosting, random forests, and neural networks all support quantile loss functions. Scikit-learn's GradientBoostingRegressor accepts loss='quantile' with an alpha parameter. LightGBM supports objective='quantile'. For deep learning, you replace MSE with the pinball loss in your training objective. The principle is the same: asymmetric penalties push predictions toward the desired quantile.

Hands-On Practice

While theoretical understanding of Quantile Regression helps you grasp how we can model different parts of a distribution beyond just the average, hands-on practice is essential to seeing these 'hidden' relationships in actual data. You'll implement Quantile Regression to analyze how flower dimensions relate to each other across different percentiles, rather than just looking at the mean trends. We will use the Species Classification dataset, focusing on the continuous relationships between petal and sepal dimensions, to demonstrate how regression lines for the 10th, 50th, and 90th percentiles reveal specific structural insights that a standard OLS model would miss.

Dataset: Species Classification (Multi-class) Iris-style species classification with 3 well-separated classes. Perfect for multi-class algorithms. Expected accuracy ≈ 95%+.

Try experimenting with different quantiles, such as 0.05 and 0.95, to see how the model behaves at the extreme edges of the data. You might also try swapping the variables (predicting sepal width from petal width) to see if different biological features exhibit stronger heteroscedasticity. Observing where the quantile slopes diverge significantly from the OLS slope will reveal exactly where the 'average' model is misleading you.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths