It is a well-known statistical trope that ice cream sales correlate with shark attacks. Does this mean banning Rocky Road would save swimmers? Of course not. Both are caused by a third factor: summer heat.
Most data scientists know this intuitive example, yet in professional settings, we often make the exact same mistake. We see that users who click a specific "Buy Now" button spend more money, so we assume the button causes the spending. But what if high-intent buyers were just more likely to click it?
This is the domain of Causal Inference. While standard machine learning asks "What is likely to happen next based on correlations?", causal inference asks "What would happen if we intervened?"
It is the science of the counterfactual—estimating the unobservable "what if" to make decisions that actually drive results.
What separates correlation from causation?
Correlation measures association; causation measures the effect of an intervention. Correlation tells you that two things move together. Causation tells you that one thing moves because of the other.
Mathematically, standard probability helps us calculate — the probability of an outcome given that we observe treatment . Causal inference asks for — the probability of outcome given that we force treatment to occur.
💡 Pro Tip: The notation comes from Pearl’s do-calculus. It signifies an intervention that breaks the natural dependence between the treatment and the environment, effectively simulating a controlled experiment.
The Fundamental Problem of Causal Inference
The central challenge in causal inference is that we can never observe the counterfactual for a specific individual.
If Patient A takes a drug and recovers, we observe their outcome under treatment (). We absolutely cannot observe what would have happened if they hadn't taken the drug () at that exact same moment in time. This missing data problem is the definition of causal inference.
We define the Individual Treatment Effect (ITE) as:
Since we can't see both, we settle for the Average Treatment Effect (ATE) across a population:
In Plain English: This formula says "The true effect of the drug is the average difference between the world where everyone took it and the world where no one took it." Since we can't clone the world, we have to estimate these averages using statistical tricks to make the groups look as similar as possible.
What is a confounding variable?
A confounding variable (or confounder) is a third variable that influences both the treatment and the outcome. It creates a "backdoor path" that generates a spurious association between cause and effect.
In our clinical trial dataset, disease_severity is a classic confounder:
- Influences Treatment: Doctors are ethical; they give stronger drugs (
is_treatment=1) to patients with "Severe" conditions. - Influences Outcome: Patients with "Severe" conditions generally have a harder time recovering, regardless of the drug.
If we ignore this, the drug might look ineffective (or even harmful!) simply because the sickest people took it. This is Selection Bias.
Visualizing Confounding with DAGs
We use Directed Acyclic Graphs (DAGs) to visualize these relationships:
Severity (Z)
↙ ↘
Treatment (T) → Outcome (Y)
The arrow means severity affects who gets treated. The arrow means severity affects recovery. To find the true causal effect (), we must "block" the backdoor path flowing through .
How do we estimate causal effects with Python?
We will use the lds_stats_probability.csv dataset to demonstrate how naive analysis fails and how causal methods fix it.
Setup and Data Loading
import pandas as pd
import numpy as np
import scipy.stats as stats
# Load the dataset
url = "https://letsdatascience.com/datasets/playground/lds_stats_probability.csv"
df = pd.read_csv(url)
# Relevant columns
# T: is_treatment (0 = Placebo, 1 = Drug A/B/C)
# Y: improvement (final_score - baseline_score)
# Z: disease_severity (Confounder)
print(df[['is_treatment', 'improvement', 'disease_severity']].head())
The Naive Comparison (Bias in Action)
First, let's calculate the simple difference in means. This is what you get if you just run df.groupby('is_treatment').mean().
# Naive Average Treatment Effect (ATE)
treated_mean = df[df['is_treatment'] == 1]['improvement'].mean()
control_mean = df[df['is_treatment'] == 0]['improvement'].mean()
naive_ate = treated_mean - control_mean
print(f"Mean Improvement (Treated): {treated_mean:.2f}")
print(f"Mean Improvement (Control): {control_mean:.2f}")
print(f"Naive ATE: {naive_ate:.2f}")
Expected Output:
Mean Improvement (Treated): 5.36
Mean Improvement (Control): 0.15
Naive ATE: 5.21
Is 5.21 the true causal effect? Only if the treatment was randomly assigned (Randomized Controlled Trial). However, in this observational scenario, we suspect disease_severity is messing things up.
How does stratification fix confounding?
Stratification (or conditioning) involves breaking the population into subgroups based on the confounder and calculating the effect within each group. If we look at "Severe" patients only, the severity bias is removed because everyone in that group has the same severity.
In Plain English: This formula says "Calculate the drug's effect separately for Mild, Moderate, and Severe patients. Then, average those effects together, but weight them by how common each severity level is in the total population." This forces an apples-to-apples comparison.
Applying Stratification in Python
# 1. Check if severity confounds treatment assignment
print("Treatment Rate by Severity:")
print(df.groupby('disease_severity')['is_treatment'].mean())
# 2. Calculate ATE within each severity stratum
strata_ate = []
weights = []
# Get total count for weighting
N = len(df)
for severity in df['disease_severity'].unique():
subset = df[df['disease_severity'] == severity]
# Calculate weights based on proportion of total population
weight = len(subset) / N
weights.append(weight)
# Calculate mean difference in this subgroup
mean_treated = subset[subset['is_treatment'] == 1]['improvement'].mean()
mean_control = subset[subset['is_treatment'] == 0]['improvement'].mean()
diff = mean_treated - mean_control
strata_ate.append(diff)
print(f"Severity: {severity:<10} | Diff: {diff:.2f} | Weight: {weight:.2f}")
# 3. Weighted Average
adjusted_ate = np.average(strata_ate, weights=weights)
print("-" * 40)
print(f"Naive ATE: {naive_ate:.2f}")
print(f"Adjusted ATE: {adjusted_ate:.2f}")
Expected Output:
Treatment Rate by Severity:
disease_severity
Mild 0.602817
Moderate 0.728070
Severe 0.883598
Name: is_treatment, dtype: float64
Severity: Moderate | Diff: 5.83 | Weight: 0.46
Severity: Mild | Diff: 3.30 | Weight: 0.35
Severity: Severe | Diff: 5.82 | Weight: 0.19
----------------------------------------
Naive ATE: 5.21
Adjusted ATE: 4.93
🔑 Key Insight: In this specific dataset, the Adjusted ATE (4.93) is slightly lower than the Naive ATE (5.21). Why? Because "Severe" patients (who show larger treatment effects, ~5.8 points) were more likely to be treated (88% vs 60% for Mild). The naive estimate was inflated by this imbalance. When we properly weight each stratum by its population share, the effect of "Mild" patients (who show smaller improvements of ~3.3 points but comprise 35% of the population) pulls the true average down.
What is Propensity Score Matching?
Stratification works great for one or two categorical confounders. But what if you have 50 confounders (age, gender, BMI, history, income...)? You can't slice the data into millions of tiny buckets—you'll run out of data points.
Propensity Score Matching (PSM) solves this by compressing all confounders into a single number: the Propensity Score.
The propensity score is the probability that a patient would have received treatment based on their characteristics, regardless of whether they actually did.
If Patient A (Treated) and Patient B (Control) both have a 65% probability of receiving treatment (based on age, severity, etc.), they are "comparable." Comparing them mimics a randomized experiment.
Inverse Probability Weighting (IPW)
A modern way to use propensity scores is Inverse Probability Weighting. Instead of matching pairs, we weight the data samples to create a "pseudo-population" where treatment and control groups look identical.
The weight for an individual is:
In Plain English: If a treated person had a very low probability of being treated (a "rare" case), they get a huge weight—they represent all the other people like them who usually don't get treated. This artificially balances the dataset so that the distribution of covariates () is the same for both treated and control groups.
Implementing IPW in Python
We will use logistic regression to estimate the propensity scores () and then calculate the weighted effect.
import statsmodels.api as sm
# 1. Prepare data for Logistic Regression
# We'll use age and disease_severity (encoded) to predict treatment
df['severity_code'] = df['disease_severity'].map({'Mild': 0, 'Moderate': 1, 'Severe': 2})
X_confounders = df[['age', 'severity_code']]
X_confounders = sm.add_constant(X_confounders)
T_treatment = df['is_treatment']
# 2. Fit Logistic Regression to get Propensity Scores
logit_model = sm.Logit(T_treatment, X_confounders).fit(disp=0)
df['propensity_score'] = logit_model.predict(X_confounders)
# 3. Calculate Inverse Probability Weights
# If Treated: 1 / PS
# If Control: 1 / (1 - PS)
df['ipw'] = np.where(
df['is_treatment'] == 1,
1 / df['propensity_score'],
1 / (1 - df['propensity_score'])
)
# 4. Calculate Weighted Means
# Weighted mean of outcome for Treated
weighted_sum_T = (df[df['is_treatment'] == 1]['improvement'] * df[df['is_treatment'] == 1]['ipw']).sum()
total_weight_T = df[df['is_treatment'] == 1]['ipw'].sum()
mu_1_ipw = weighted_sum_T / total_weight_T
# Weighted mean of outcome for Control
weighted_sum_C = (df[df['is_treatment'] == 0]['improvement'] * df[df['is_treatment'] == 0]['ipw']).sum()
total_weight_C = df[df['is_treatment'] == 0]['ipw'].sum()
mu_0_ipw = weighted_sum_C / total_weight_C
ipw_ate = mu_1_ipw - mu_0_ipw
print(f"IPW Adjusted ATE: {ipw_ate:.2f}")
Expected Output:
IPW Adjusted ATE: 4.94
The IPW result (4.94) is very close to our stratified result (4.93), confirming that when we adjust for confounders properly using different methods, the effect estimate is robust and consistent.
When should you NOT use Causal Inference methods?
While powerful, these methods are dangerous if assumptions are violated.
- The Positivity Violation: If
propensity_scoreis exactly 0 or 1 for anyone, IPW explodes. This happens if, for example, every severe patient gets the drug. You cannot compare severe treated vs. severe control if severe control patients don't exist. - Unobserved Confounding: If there is a hidden variable (e.g., genetic mutation) that we didn't measure but affects both treatment and outcome, all these adjustments fail. This is the "Omitted Variable Bias."
- Bad Controls (Colliders): Never control for a variable that is caused by the treatment. For example, if the drug causes
side_effectsand side effects causeimprovementto drop, controlling forside_effectswill bias your results. This is called Collider Bias.
Conclusion
Causal inference moves us from "What patterns exist?" to "What actions work?". By understanding that correlation is just the surface level of data, we can dig deeper into the mechanisms that drive outcomes.
The journey from naive observation to causal truth requires three steps:
- Identify potential confounders (Draw your DAG!).
- Choose an adjustment method (Stratification for simple cases, IPW/Matching for complex ones).
- Validate assumptions (Check positivity and unobserved confounders).
To deepen your understanding of statistical relationships, explore our guide on A/B Testing Design and Analysis, which is the "Gold Standard" way to bypass these observational problems entirely. If you're dealing with categorical data analysis, check out Understanding Chi-Square Tests.
Hands-On Practice
Correlation is not causation. In this tutorial, we move beyond simple statistics to estimate the true Causal Effect of a medical treatment. We will start with a naive comparison that falls victim to Simpson's Paradox, and then use Stratification and Propensity Score weighting (Inverse Probability of Treatment Weighting) to adjust for confounding variables like disease severity, revealing the true impact of the drug.
Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.
Try It Yourself
Statistics & Probability: 1,000 clinical trial records for statistical analysis and probability distributions
By adjusting for the confounding variable 'disease_severity', we found that the naive calculation slightly overestimated the drug's effectiveness. While the difference here (5.21 naive vs 4.93 adjusted) seems modest, it reveals an important bias: sicker patients were more likely to receive treatment and also showed larger improvements. In real-world scenarios with stronger confounding (like socioeconomic status in public health), failing to adjust can flip the sign of the result entirely (Simpson's Paradox). Techniques like Stratification and Propensity Scores are essential tools for answering 'What If?'