Skip to content

Solving the "What If": A Practical Guide to Causal Inference

DS
LDS Team
Let's Data Science
13 minAudio
Listen Along
0:00/ 0:00
AI voice

Your company rolls out a new training program and productivity jumps 14%. The VP of People sends a celebratory email. But six months later, a junior analyst notices something uncomfortable: the employees who received the training were disproportionately senior staff, people who were already more productive before the program started. The true effect of the training? Closer to 8%. Almost half the reported gain was an illusion created by confounding.

This is the central problem of causal inference, and it shows up everywhere data scientists make decisions. Causal inference is the set of statistical methods that separate genuine cause-and-effect relationships from misleading correlations. Standard machine learning asks "What is likely to happen?" Causal inference asks a harder question: "What would happen if we intervened?" It's the difference between predicting rain and knowing whether a rain dance actually works.

Throughout this article, we'll use a single running example: does a corporate training program cause higher employee productivity? The treatment is training enrollment, the outcome is a productivity score, and the confounder is experience level. Every formula, every code block, and every diagram maps back to this scenario.

Confounding variable diagram showing experience level affecting both training assignment and productivityClick to expandConfounding variable diagram showing experience level affecting both training assignment and productivity

Correlation Measures Association, Causation Measures Intervention

Correlation tells you two variables move together. Causation tells you one variable moves because of the other. The mathematical distinction is precise.

Standard conditional probability gives us P(YT)P(Y | T): the probability of outcome YY given that we observe treatment TT. Causal inference asks for something different: P(Ydo(T))P(Y | do(T)), the probability of outcome YY given that we force treatment TT to occur.

Pro Tip: The do()do(\cdot) notation comes from Judea Pearl's do-calculus framework. It represents an intervention that breaks the natural relationship between the treatment and its causes. Observing that trained employees are more productive is P(YT)P(Y|T). Forcing a random employee into training and measuring the result is P(Ydo(T))P(Y|do(T)).

Pearl's Ladder of Causation (The Book of Why, Pearl & Mackenzie, 2018) organizes these reasoning levels into three tiers:

LevelQuery TypeExampleTypical Tool
1. AssociationP(YX)P(Y \| X)"Trained employees score higher"Regression, correlation
2. InterventionP(Ydo(X))P(Y \| do(X))"Training causes higher scores"RCTs, stratification, IPW
3. CounterfactualP(YxX,Y)P(Y_x \| X', Y')"Would this employee have scored higher if trained?"Structural causal models

Most data science operates at Level 1. Causal inference pushes us to Level 2 and sometimes Level 3.

Pearl's causal hierarchy showing association, intervention, and counterfactual levelsClick to expandPearl's causal hierarchy showing association, intervention, and counterfactual levels

The Fundamental Problem: You Can Never Observe the Counterfactual

The central challenge in causal inference is that you can never observe both outcomes for the same individual. If Employee A completes training and her productivity is 75, we observe her outcome under treatment (Y1Y_1). We can never observe what her productivity would have been without training (Y0Y_0) at the same moment. This missing data problem, formalized by Donald Rubin in the Rubin Causal Model, defines the entire field.

We define the Individual Treatment Effect (ITE) as:

ITEi=Y1,iY0,iITE_i = Y_{1,i} - Y_{0,i}

Where:

  • Y1,iY_{1,i} is the outcome for individual ii under treatment (with training)
  • Y0,iY_{0,i} is the outcome for individual ii under control (without training)
  • ITEiITE_i is the causal effect of the training program on individual ii

Since we can only observe one of these two outcomes, we estimate the Average Treatment Effect (ATE) across a population:

ATE=E[Y1Y0]=E[Y1]E[Y0]ATE = E[Y_1 - Y_0] = E[Y_1] - E[Y_0]

Where:

  • E[Y1]E[Y_1] is the expected outcome if everyone received training
  • E[Y0]E[Y_0] is the expected outcome if no one received training
  • ATEATE is the average causal effect across the entire population

In Plain English: The ATE says "the true effect of the training program is the average productivity difference between a world where everyone got trained and a world where nobody got trained." Since we can't clone the world, we have to estimate these averages using statistical methods that make the treated and untreated groups as comparable as possible.

Confounding Variables Create Spurious Associations

A confounding variable influences both the treatment assignment and the outcome, creating a false association between them. In our training example, experience level is a textbook confounder:

  1. Experience affects who gets trained. Managers nominate senior employees for professional development at higher rates than juniors.
  2. Experience affects productivity. Senior employees produce more output regardless of training.

If we ignore experience and just compare trained vs. untreated employees, training looks more effective than it really is. The senior employees were going to be productive anyway; training gets credit it doesn't deserve. This is selection bias.

To identify confounders, practitioners draw Directed Acyclic Graphs (DAGs). The arrow ZTZ \to T means experience affects who gets trained. The arrow ZYZ \to Y means experience affects productivity. The path TZYT \leftarrow Z \to Y is a "backdoor path" that generates a spurious association. To isolate the true causal effect (TYT \to Y), we must block this backdoor path.

Key Insight: A variable qualifies as a confounder only if it causes both the treatment and the outcome. A variable that is caused by the treatment (a mediator) or caused by both treatment and outcome (a collider) should NOT be controlled for. Controlling for a collider actually introduces bias.

The Naive Estimate Falls for Confounding

Let's see how badly a naive comparison misleads us. We'll generate a synthetic dataset where the true causal effect of training on productivity is exactly 8 points, but experience level confounds the relationship.

Expected Output:

text
Treatment Rate by Experience:
experience
Junior    0.240223
Mid       0.515789
Senior    0.816794

Mean Productivity (Treated): 70.97
Mean Productivity (Control): 56.82
Naive ATE: 14.15
True Causal Effect: 8

The naive estimate says training boosts productivity by 14.15 points. The real effect is 8. That's a 77% overestimate. Why? Senior employees (treatment rate: 82%) were both more likely to be trained and inherently more productive. The naive comparison lumps these two effects together.

Stratification Blocks the Backdoor Path

Stratification (also called conditioning or subclassification) fixes confounding by comparing treated and control groups within each level of the confounder. If we look only at Junior employees, experience is held constant, so any productivity difference must come from training itself.

The stratified ATE formula weights each stratum by its share of the population:

ATEstratified=z[E[YT=1,Z=z]E[YT=0,Z=z]]P(Z=z)ATE_{stratified} = \sum_{z} \left[ E[Y | T=1, Z=z] - E[Y | T=0, Z=z] \right] \cdot P(Z=z)

Where:

  • E[YT=1,Z=z]E[Y | T=1, Z=z] is the average productivity of trained employees at experience level zz
  • E[YT=0,Z=z]E[Y | T=0, Z=z] is the average productivity of untrained employees at experience level zz
  • P(Z=z)P(Z=z) is the proportion of employees at experience level zz in the full population
  • The sum runs over all experience levels: Junior, Mid, Senior

In Plain English: Calculate the training effect separately for Junior, Mid, and Senior employees. Then combine those effects into a single number, weighting each group by how common it is. This forces an apples-to-apples comparison within each experience bracket.

Expected Output:

text
Experience: Junior   | Diff: 7.20 | Weight: 0.358
Experience: Mid      | Diff: 8.55 | Weight: 0.380
Experience: Senior   | Diff: 5.23 | Weight: 0.262

Stratified ATE: 7.20
True Effect: 8

The stratified estimate (7.20) is dramatically closer to the true effect (8.00) than the naive estimate (14.15). Within each experience group, the treated and control employees are more comparable, so the estimate captures the actual training effect rather than the confounded mix.

Common Pitfall: Stratification breaks down when you have many confounders. With 10 binary confounders, you'd need $2^{10} = 1024$ strata. Most strata would contain zero or one observation. This is the "curse of dimensionality" for stratification.

Propensity Scores Compress Many Confounders Into One

Propensity Score methods solve the dimensionality problem. Instead of stratifying on every confounder separately, we compress all confounders into a single number: the propensity score.

e(x)=P(T=1X=x)e(x) = P(T=1 | X=x)

Where:

  • e(x)e(x) is the propensity score for an individual with characteristics xx
  • T=1T=1 indicates receiving treatment (training)
  • XX is the vector of all observed confounders (experience, tenure, department, etc.)

In Plain English: The propensity score is the probability that an employee would have been selected for training based on their characteristics, regardless of whether they actually were. If two employees have the same propensity score (say both had a 60% chance of being trained), they're "comparable," and any productivity difference between them can be attributed to the training itself.

Inverse Probability Weighting (IPW)

Rather than matching employees into pairs, Inverse Probability Weighting reweights the entire dataset to create a pseudo-population where the confounders are balanced between treated and control groups.

The IPW weight for each individual is:

wi=Tie(xi)+1Ti1e(xi)w_i = \frac{T_i}{e(x_i)} + \frac{1 - T_i}{1 - e(x_i)}

Where:

  • TiT_i is the treatment indicator (1 = trained, 0 = not trained)
  • e(xi)e(x_i) is the propensity score of individual ii
  • The first term applies to treated individuals, the second to controls

In Plain English: If a Junior employee (low probability of training) actually received training, they get a large weight because they represent many similar Junior employees who typically don't get trained. This artificially balances the dataset so that both groups look identical in terms of experience distribution. It's like creating a synthetic randomized experiment from observational data.

Expected Output:

text
IPW Weighted Mean (Treated): 67.54
IPW Weighted Mean (Control): 60.28
IPW ATE: 7.26
True Effect: 8

The IPW estimate (7.26) is close to the stratified result (7.20) and both are near the true effect of 8. This convergence across different methods is reassuring: when multiple adjustment techniques agree, the causal estimate is credible.

Regression Adjustment Offers a Direct Alternative

Regression adjustment is the simplest causal method conceptually: fit a regression model that includes both the treatment and the confounders, then read the treatment coefficient. The coefficient represents the treatment effect while "holding constant" the confounders.

Expected Output:

text
Treatment coefficient (ATE): 7.46
95% CI: [6.34, 8.58]
p-value: 0.000000
True Effect: 8

The regression coefficient for treatment is 7.46, and the 95% confidence interval of [6.34, 8.58] contains the true effect of 8. Notice that the exp_code coefficient (~9.53) captures how much each step up in experience contributes to productivity, independent of training.

Pro Tip: Regression adjustment assumes the relationship between confounders and outcome is correctly specified (linear, in this case). If the true relationship is nonlinear, the treatment estimate can be biased. IPW doesn't make this assumption about the outcome model, which is why combining both approaches ("doubly robust" estimation) is popular in practice.

When to Use Each Method

MethodBest ForLimitations
Stratification1-2 categorical confoundersCurse of dimensionality with many confounders
IPWMany confounders, flexibleExtreme weights when propensity scores near 0 or 1
Regression AdjustmentContinuous confounders, quickAssumes correct functional form
Doubly RobustProduction use casesMore complex to implement
A/B Testing (RCT)When randomization is possibleExpensive, sometimes unethical

Causal inference method selection flowchart showing when to use each techniqueClick to expandCausal inference method selection flowchart showing when to use each technique

When Causal Methods Fail: Three Assumptions That Must Hold

Every observational causal method rests on assumptions. Violate them, and your "causal" estimate is just a dressed-up correlation.

1. No Unobserved Confounding (Ignorability)

If a hidden variable affects both treatment and outcome, no amount of adjustment fixes the bias. In our training example, suppose "motivation" drives both who volunteers for training and who performs well. Since we didn't measure motivation, our estimates are biased by omitted variable bias. This is the single biggest threat to observational causal studies.

2. Positivity (Overlap)

Every subgroup must have both treated and untreated individuals. If every Senior employee receives training (treatment probability = 1.0), the IPW weight $1/e(x) stays finite, but the weight $1/(1-e(x)) for Senior controls explodes to infinity. You can't compare treated to control Seniors if control Seniors don't exist.

3. No Collider Bias (Bad Controls)

Never control for a variable that is caused by the treatment. If the training program causes "certification" and certification affects productivity, controlling for certification blocks part of the causal pathway. Even worse, if two variables both cause a third (a collider), conditioning on the collider creates a spurious association between the two causes. This is Berkson's paradox, and it trips up experienced analysts.

Common Pitfall: In the hypothesis testing framework, a low p-value on a treatment coefficient does NOT guarantee causation. Statistical significance tests whether an association is likely due to chance; it says nothing about whether the association is causal. A confounded estimate can be highly "significant" and completely wrong.

Production Considerations for Causal Analysis

Real-world causal inference involves considerations beyond choosing the right estimator.

Computational cost. Stratification and regression adjustment are O(n)O(n) in the number of observations. IPW with logistic regression is O(np)O(np) where pp is the number of confounders. For datasets above 1M rows, consider approximate methods or sampling.

Sensitivity analysis. Since unobserved confounding can never be fully ruled out, report how strong an unmeasured confounder would need to be to overturn your result. The DoWhy library (actively maintained as of early 2026 under the PyWhy organization) automates sensitivity analysis with its refutation API.

Standard errors for IPW. The standard errors from weighted regression understate uncertainty because they ignore the estimation error in the propensity scores. Bootstrap standard errors (resample entire pipeline: estimate propensity scores, compute weights, calculate ATE) give correct coverage.

Sample size requirements. Causal inference needs larger samples than prediction tasks. Statistical power drops when confounders absorb variation, so plan for 2-3x the sample size you'd need for a simple A/B test with comparable effect size.

Conclusion

Causal inference answers the question that matters most for decision-making: not "what patterns exist?" but "what actions work?" The gap between a naive estimate (14.15) and an adjusted one (7.20 to 7.46 in our example) is exactly the kind of mistake that leads organizations to double down on programs that don't deliver what they think.

The practical workflow is straightforward. First, draw a DAG to map out which variables confound the treatment-outcome relationship. Second, choose an adjustment method that fits your data: stratification for simple cases with few confounders, IPW or regression adjustment for more complex settings. Third, validate your assumptions, particularly the positivity condition and the possibility of unmeasured confounders.

If you can randomize treatment assignment, an A/B test eliminates confounding entirely and remains the gold standard. When randomization isn't possible, the methods in this article are your best tools for extracting causal signal from observational noise. For a deeper dive into the statistical foundations underpinning these tests, explore our guides on hypothesis testing and linear regression.

Frequently Asked Interview Questions

Q: What is the difference between correlation and causation, and why does it matter for data science?

Correlation measures statistical association: two variables move together. Causation means one variable directly influences the other. The distinction matters because business decisions based on correlations can backfire. If you observe that users who use a premium feature churn less, recommending the feature to all users won't reduce churn if the real driver is that loyal users both adopt features and stay longer. Confounding creates the illusion of a causal link where none exists.

Q: What is the Average Treatment Effect, and why can't we compute it directly?

The ATE is the expected difference in outcomes between treating everyone and treating no one: E[Y1]E[Y0]E[Y_1] - E[Y_0]. We can't compute it directly because of the fundamental problem of causal inference: each individual is either treated or not, never both simultaneously. We only observe one of the two potential outcomes for each person, making the individual treatment effect unidentifiable. Population-level methods like stratification and IPW estimate the ATE by making treated and control groups statistically comparable.

Q: How does a confounder differ from a mediator and a collider?

A confounder causes both the treatment and the outcome, creating a spurious association that must be adjusted for. A mediator sits on the causal pathway from treatment to outcome (treatment causes mediator causes outcome); controlling for it blocks the very effect you're trying to measure. A collider is caused by both the treatment and the outcome; conditioning on a collider opens a non-causal path and introduces bias. Drawing a DAG before any analysis helps you classify each variable correctly.

Q: When would you use Inverse Probability Weighting over stratification?

Stratification works well with one or two categorical confounders but breaks down with many confounders due to the curse of dimensionality (too many strata, too few observations per stratum). IPW compresses all confounders into a single propensity score, making it scalable to dozens of covariates. However, IPW is sensitive to extreme propensity scores near 0 or 1, which produce unstable weights. Trimming or stabilizing the weights is standard practice in production.

Q: Your company can't run an A/B test for a pricing change. How would you estimate the causal effect of the new price on revenue?

I'd use observational causal methods. First, I'd draw a DAG to identify confounders that affect both which customers saw the new price and their purchase behavior (e.g., geography, customer segment, acquisition channel). Then I'd estimate propensity scores using logistic regression and apply IPW or doubly robust estimation. If there's a clear rollout date, difference-in-differences with a comparable control group provides another angle. I'd run sensitivity analysis (e.g., via DoWhy) to check how strong an unmeasured confounder would need to be to invalidate the result.

Q: What is the positivity assumption, and what happens when it's violated?

Positivity requires that every combination of confounder values has a non-zero probability of receiving both treatment and control. When violated (e.g., every senior employee receives training), IPW weights become infinite or extremely large, making the ATE estimate unreliable and high-variance. Practical fixes include trimming extreme propensity scores, restricting analysis to the region of overlap, or using different estimation methods like regression adjustment that don't directly invert the propensity score.

Q: Explain doubly robust estimation and its advantage over IPW alone.

Doubly robust estimation combines a propensity score model with an outcome regression model. The key advantage is that the ATE estimate is consistent if either model is correctly specified. With IPW alone, a misspecified propensity model biases the result. With regression alone, a wrong functional form biases the result. The doubly robust estimator hedges both bets. In practice, you still want both models to be reasonable, but the theoretical guarantee provides extra insurance against misspecification.

Hands-On Practice

Correlation is not causation. We move beyond simple statistics to estimate the true Causal Effect of a medical treatment. We will start with a naive comparison that falls victim to Simpson's Paradox, and then use Stratification and Propensity Score weighting (Inverse Probability of Treatment Weighting) to adjust for confounding variables like disease severity, revealing the true impact of the drug.

Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.

By adjusting for the confounding variable 'disease_severity', we found that the naive calculation slightly overestimated the drug's effectiveness. While the difference here (5.21 naive vs 4.93 adjusted) seems modest, it reveals an important bias: sicker patients were more likely to receive treatment and also showed larger improvements. In real-world scenarios with stronger confounding (like socioeconomic status in public health), failing to adjust can flip the sign of the result entirely (Simpson's Paradox). Techniques like Stratification and Propensity Scores are essential tools for answering 'What If?'

Practice with real Banking data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Banking problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths