Quantile Regression: Beyond the Average

DS
LDS Team
Let's Data Science
14 min readAudio
Quantile Regression: Beyond the Average
0:00 / 0:00

Imagine Bill Gates walks into a crowded dive bar.

Statistically, the average net worth of everyone in that bar just skyrocketed to over a billion dollars. If you used a standard predictive model (like Ordinary Least Squares) to describe the patrons based on that average, you would conclude that everyone in the bar is a billionaire.

But the median patron? They’re still just trying to pay for their beer.

This illustrates the fundamental flaw of standard linear regression: it is obsessed with averages. It models the conditional mean of the data. But in the real world, data is often messy, skewed, or plagued by outliers. Sometimes, we don't care about the average case; we care about the extremes—the starving artists, the top 1% of earners, or the worst-case scenarios in financial risk.

This is where Quantile Regression shines. It allows us to model the median—or the 10th percentile, or the 99th—giving us a complete view of the distribution that averages simply cannot see.

What is Quantile Regression?

Quantile Regression is a statistical technique that estimates the conditional quantiles of a response variable rather than the conditional mean. While standard linear regression draws a line through the "middle" of the data to minimize squared errors, Quantile Regression draws lines that divide the data into specific percentiles (e.g., 50% above and 50% below for the median).

In simpler terms, if Ordinary Least Squares (OLS) Regression asks, "What is the average value of Y for a given X?", Quantile Regression asks, "What is the value of Y that exceeds τ%\tau\% of the data for a given X?"

By varying the target quantile (denoted as τ\tau), we can scan the entire distribution of the data, creating a comprehensive picture of how the relationship between variables changes for high, medium, and low values.

🔑 Key Insight: Linear Regression gives you a summary of the central tendency. Quantile Regression gives you a summary of the entire distribution, allowing you to see relationships that exist only at the extremes.

Why do standard averages fail us?

Standard averages fail when data violates the strict assumptions of normality or homoscedasticity (constant variance). Averages are highly sensitive to outliers and hide the variation in the data, masking critical insights about risk or inequality.

To understand this, we need to revisit the two biggest weaknesses of the standard Linear Regression model:

  1. Sensitivity to Outliers: As seen in the Bill Gates example, one extreme value can pull the regression line wildly off course. The "Mean" tries to balance the distance to all points, so it must move toward the outlier to minimize the squared error. The "Median," however, doesn't care how far away the outlier is—only that it is on one side of the line.
  2. Heteroscedasticity: This is a fancy word for "changing variance." In many real-world datasets (like income vs. age), the spread of data points increases as the value of X increases.
    • Homoscedastic (Good for OLS): The data looks like a uniform tube around the line.
    • Heteroscedastic (Bad for OLS): The data looks like a cone or a megaphone.

When data is heteroscedastic, the OLS line might still go through the middle, but its prediction intervals become useless because they assume the error is the same everywhere. Quantile Regression solves this naturally by modeling the upper and lower boundaries of the "cone" separately.

How does the math work? (The Pinball Loss)

Quantile Regression minimizes the "Pinball Loss" function, a tilted absolute value function that assigns different penalties to positive and negative errors depending on the target quantile.

To understand the math, let's look at the intuition of the Median (τ=0.5\tau = 0.5).

In OLS, we minimize the Sum of Squared Errors (SSE): LossOLS=(yiy^i)2\text{Loss}_{OLS} = \sum (y_i - \hat{y}_i)^2

Squared errors punish large mistakes heavily. This is why outliers pull the line so hard.

In Median Regression (Least Absolute Deviations), we minimize the Sum of Absolute Errors: LossMedian=yiy^i\text{Loss}_{Median} = \sum |y_i - \hat{y}_i|

Absolute errors don't explode when the difference is large. A point 1,000 units away pulls the line with the same "force" as a point 10 units away, provided they are on the same side. This makes the median robust.

Generalizing to Any Quantile

Now, what if we want the 90th percentile (τ=0.9\tau = 0.9)? We want a line where 90% of the actual data points are below the prediction and only 10% are above.

To achieve this, we use an asymmetric penalty. We punish the model differently for overestimating vs. underestimating. The Pinball Loss function ρτ\rho_\tau is defined as:

ρτ(u)={τuif u0(τ1)uif u<0\rho_\tau(u) = \begin{cases} \tau u & \text{if } u \ge 0 \\ (\tau - 1) u & \text{if } u < 0 \end{cases}

Where uu is the residual (yy^y - \hat{y}).

Let's break down the intuition for τ=0.9\tau = 0.9:

  • Underestimation (y>y^y > \hat{y}): The residual uu is positive. The point is above the line. We apply a weight of 0.9 (heavy penalty).
  • Overestimation (y<y^y < \hat{y}): The residual uu is negative. The point is below the line. We apply a weight of 0.91=|0.9 - 1| = 0.1 (light penalty).

Because the penalty for leaving points above the line is so high (0.9), the algorithm aggressively pushes the regression line upward. It keeps pushing up until the "cost" of pushing further (incurring many small 0.1 penalties for points now below the line) balances out the benefit of reducing the expensive 0.9 penalties.

This equilibrium naturally settles exactly where 90% of the points are below the line.

How do we interpret the coefficients?

The coefficients in Quantile Regression represent the change in the specified quantile of the response variable for a one-unit change in the predictor variable.

This is the most powerful part of the technique. In standard regression, a coefficient β=5\beta = 5 means "If X increases by 1, the average Y increases by 5."

In Quantile Regression with τ=0.9\tau = 0.9, a coefficient β=15\beta = 15 means "If X increases by 1, the 90th percentile of Y increases by 15."

Real-World Interpretation Example

Imagine we are modeling House Prices based on Square Footage.

  • OLS Coefficient: 100. This means adding 1 sq ft adds $100 to the average house price.
  • τ=0.1\tau=0.1 Coefficient: 50. For cheaper, entry-level homes (10th percentile), extra space is less valuable. Adding 1 sq ft only adds $50.
  • τ=0.9\tau=0.9 Coefficient: 250. For luxury mansions (90th percentile), extra space is premium. Adding 1 sq ft adds $250.

If you only used OLS, you would tell a luxury developer that space is worth $100/sq ft, leading them to underprice their property significantly. Quantile Regression reveals the differing value of space across the market spectrum.

How do we implement Quantile Regression in Python?

We will use the statsmodels library, which offers the robust QuantReg class. While scikit-learn also supports quantile regression (via QuantileRegressor), statsmodels provides comprehensive statistical summaries similar to standard regression outputs.

In this example, we will generate "heteroscedastic" data (the cone shape) and show how Quantile Regression captures the spreading boundaries while OLS gets stuck in the middle.

Step 1: Generate Synthetic Data

We'll create data where the variance of Y increases as X increases.

python
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Generate X values
n_samples = 500
x = np.linspace(0, 10, n_samples)

# Generate heteroscedastic noise (noise increases as x increases)
# The standard deviation of the noise grows with x
noise = np.random.normal(0, 1 + x * 0.5, n_samples)

# Generate Y values: Linear trend + growing noise
y = 5 + 2 * x + noise

# Create a DataFrame
data = pd.DataFrame({'x': x, 'y': y})

# Add a constant for statsmodels (intercept)
X_const = sm.add_constant(x)

print(data.head())

Step 2: Fit OLS and Quantile Models

We will fit four models:

  1. OLS (Mean)
  2. QR 0.05 (5th Percentile - Lower Bound)
  3. QR 0.50 (Median)
  4. QR 0.95 (95th Percentile - Upper Bound)
python
# 1. Ordinary Least Squares (OLS)
ols_model = sm.OLS(y, X_const).fit()

# 2. Quantile Regression Models
# We use the formula API for cleaner syntax
quantiles = [0.05, 0.5, 0.95]
models = {}

for qt in quantiles:
    # 'q' is the parameter for the quantile
    qr = smf.quantreg('y ~ x', data)
    res = qr.fit(q=qt)
    models[qt] = res

# Display coefficients
print(f"OLS Slope: {ols_model.params[1]:.2f}")
print("-" * 30)
for qt, res in models.items():
    print(f"Quantile {qt} Slope: {res.params['x']:.2f}")

Expected Output:

text
OLS Slope: 1.96
------------------------------
Quantile 0.05 Slope: 0.98
Quantile 0.5 Slope: 1.94
Quantile 0.95 Slope: 3.12

💡 Analysis: Notice the slopes!

  • The Median (0.5) slope (1.94) is close to the OLS slope (1.96), which makes sense as the trend is linear.
  • The 0.05 Quantile slope is much lower (~0.98). The lower bound of the data grows slowly.
  • The 0.95 Quantile slope is much higher (~3.12). The upper bound of the data shoots up rapidly.

This confirms the "cone" shape: the gap between the rich (0.95) and the poor (0.05) widens as X increases. OLS completely misses this structural change in variance.

Step 3: Visualizing the Result

python
plt.figure(figsize=(10, 6))

# Plot raw data
plt.scatter(data['x'], data['y'], alpha=0.3, label='Data', color='gray')

# Plot OLS line
plt.plot(data['x'], ols_model.predict(X_const), color='red', linestyle='--', linewidth=2, label='OLS (Mean)')

# Plot Quantile lines
colors = ['green', 'blue', 'purple']
for i, qt in enumerate(quantiles):
    plt.plot(data['x'], models[qt].predict({'x': data['x']}), 
             color=colors[i], linewidth=2, label=f'Quantile {qt}')

plt.title('Quantile Regression vs. OLS on Heteroscedastic Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()

The resulting plot will show the Green (0.05) and Purple (0.95) lines expanding outward like a funnel, capturing the full range of the data, while the Red OLS line sits oblivious in the center.

When should you use Quantile Regression?

You should use Quantile Regression whenever you suspect that the predictor variables affect the rest of the distribution differently than they affect the mean, or when standard regression assumptions are violated.

1. Financial Risk Management (Value at Risk)

In finance, nobody loses sleep over the "average" daily return of a stock (which is usually near zero). They worry about the tail risk—the worst 1% or 5% of days. Quantile regression at τ=0.05\tau=0.05 allows analysts to model Value at Risk (VaR) and understand how market indicators affect the potential for catastrophic losses.

2. Real Estate and Pricing

As mentioned in the interpretation section, high-end buyers value features differently than budget buyers. Quantile regression allows platforms like Zillow or Airbnb to build segmented pricing models that are accurate for both luxury villas and studio apartments.

3. Medical Growth Charts

Pediatricians don't just check if a child's weight is "average." They check if a child is in the 95th percentile (risk of obesity) or the 5th percentile (risk of malnourishment). Quantile regression is the standard method for constructing these growth curves based on age.

4. Robustness to Outliers

Even if you only care about the central trend, if your data is messy and full of errors, Median Regression (τ=0.5\tau=0.5) is far safer than OLS. It refuses to be bullied by extreme anomalies.

What are the limitations?

While powerful, Quantile Regression requires careful application and comes with its own set of challenges.

The Quantile Crossing Problem

Ideally, the 90th percentile line should always be above the 50th percentile line. However, because each quantile line is estimated independently, it is mathematically possible for the lines to cross. For example, the model might predict that the 90th percentile is lower than the median for certain values of X. This is a logical impossibility known as quantile crossing.

  • Solution: This usually happens where data is sparse (at the edges of the X range). Constrained quantile regression methods exist to prevent this, or you can simply acknowledge the model is invalid in that specific region.

Data Requirements

Estimating the mean (OLS) is "cheap" in terms of data—you can get a decent average with a small sample. Estimating extremes (like the 99th percentile) is "expensive." You need a lot of data to statistically justify what happens in the far tails of the distribution. If you try to model the 99th percentile with only 100 data points, your model will be extremely unstable.

Computational Cost

OLS has a closed-form solution (matrix algebra) that is instant. Quantile regression requires Linear Programming or iterative optimization algorithms, making it significantly slower for massive datasets.

Conclusion

Quantile Regression transforms our view of data from a 2D line drawing into a 3D landscape. It acknowledges that the "average" experience is rarely the only one that matters.

By using the Pinball Loss function to focus on specific percentiles, we can model scenarios where variance is unequal, outliers are rampant, or the most valuable insights lie at the edges of the distribution. Whether you are modeling catastrophic financial risk or simply trying to predict housing prices across a diverse market, Quantile Regression provides the precision that OLS lacks.

Before you default to model.fit() on a standard Linear Regression next time, plot your residuals. If you see a megaphone shape, or if you realize your business problem is actually about the tails rather than the average, it's time to import QuantReg.

Ready to explore more advanced regression techniques?


Hands-On Practice

While theoretical understanding of Quantile Regression helps you grasp how we can model different parts of a distribution beyond just the average, hands-on practice is essential to seeing these 'hidden' relationships in actual data. In this tutorial, you will implement Quantile Regression to analyze how flower dimensions relate to each other across different percentiles, rather than just looking at the mean trends. We will use the Species Classification dataset, focusing on the continuous relationships between petal and sepal dimensions, to demonstrate how regression lines for the 10th, 50th, and 90th percentiles reveal specific structural insights that a standard OLS model would miss.

Dataset: Species Classification (Multi-class) Iris-style species classification with 3 well-separated classes. Perfect for multi-class algorithms. Expected accuracy ≈ 95%+.

Try It Yourself

Multi-class Classification
Loading editor...
0/50 runs

Multi-class Classification: 150 flower samples (Iris-style)

Try experimenting with different quantiles, such as 0.05 and 0.95, to see how the model behaves at the extreme edges of the data. You might also try swapping the variables (predicting sepal width from petal width) to see if different biological features exhibit stronger heteroscedasticity. Observing where the quantile slopes diverge significantly from the OLS slope will reveal exactly where the 'average' model is misleading you.