Your e-commerce team just spent six weeks running an A/B test on a redesigned checkout flow. The new design felt faster, customer complaints dropped, and the UX team was confident. Then the results came back: p = 0.12. "Not statistically significant." The project gets shelved.
Three months later, a competitor launches an almost identical checkout redesign and reports a 15% lift in conversions. Your design worked too. The experiment just couldn't detect it.
This is what happens when you ignore statistical power. The test had too few users to catch a real but modest improvement, so it declared "no effect" when the effect was sitting right there. In hypothesis testing, most teams obsess over avoiding false positives (Type I errors) but completely overlook the other failure mode: missing effects that actually exist.
This article uses one running example throughout: an A/B test comparing a new streamlined checkout flow against the old checkout on an e-commerce site. Every formula, every code block, and every diagram ties back to this scenario.
Statistical Power Defined
Statistical power is the probability that a hypothesis test correctly rejects the null hypothesis when a real effect exists. If the new checkout truly reduces completion time, power tells you how likely your test is to detect that improvement.
Where:
- is the probability of correctly detecting a real effect
- is the Type II error rate (the probability of missing a real effect)
In Plain English: Power answers a simple question: "If the new checkout really is faster, what are the chances my A/B test actually picks that up?" At 80% power, you'll catch the improvement 4 out of 5 times. At 50% power, it's a coin flip.
The widely accepted minimum is 80% power, meaning you accept a 20% risk of missing real effects to keep experiment costs reasonable. Clinical trials and high-stakes decisions often target 90% or higher.
Click to expandType I and Type II error decision matrix for hypothesis testing outcomes
The Decision Matrix
Every hypothesis test produces one of four outcomes:
| H0 True (No real effect) | H0 False (Real effect exists) | |
|---|---|---|
| Reject H0 (Significant) | Type I Error (): False alarm | Power ($1 - \beta$): Correct detection |
| Fail to Reject H0 (Not significant) | Correct: True negative | Type II Error (): Missed effect |
Key Insight: Most statistics courses hammer Type I errors. But in practice, Type II errors cause more damage. A false positive gets corrected when you try to replicate. A false negative kills a good idea permanently because nobody ever finds out it worked.
The Four Levers That Control Power
Statistical power depends on four interconnected variables. Fix any three, and the fourth is mathematically determined. Understanding these levers is the difference between designing experiments that find answers and ones that waste time.
Click to expandThe four inputs to statistical power and their effects on experiment outcomes
1. Sample Size ()
The number of observations in each group. More users in the A/B test means narrower confidence intervals, less overlap between distributions, and higher power. This is usually the lever teams have the most control over.
2. Effect Size (Cohen's )
How large the real difference is. A checkout redesign that saves 30 seconds is easier to detect than one saving 3 seconds. You often can't control effect size directly, but you can decide the minimum effect worth detecting (the Minimum Detectable Effect, or MDE).
3. Significance Level ()
The threshold for declaring a result "significant," typically 0.05. Lowering (stricter about false positives) reduces power. Raising (more tolerant of false positives) increases power. This is the direct tradeoff between Type I and Type II errors.
4. Variance ()
How noisy the data is. High variance in checkout times makes it harder to spot real differences because the signal drowns in noise. Variance reduction techniques (stratification, CUPED, pre-experiment filtering) can boost power without increasing sample size.
| Lever | Direction to Increase Power | Tradeoff |
|---|---|---|
| Sample size () | Increase | Costs time and money |
| Effect size () | Look for larger effects | May miss small but important effects |
| Alpha () | Raise from 0.05 to 0.10 | More false positives |
| Variance () | Reduce via better measurement | Not always possible |
Measuring Effect Size with Cohen's d
Cohen's quantifies "how different are the two groups" in a standardized way that doesn't depend on sample size. Jacob Cohen introduced this metric in his 1988 book Statistical Power Analysis for the Behavioral Sciences, and it remains the standard for comparing means.
Where:
- is Cohen's effect size
- is the mean of the control group (old checkout time)
- is the mean of the treatment group (new checkout time)
- is the pooled standard deviation across both groups
In Plain English: Cohen's asks "how many standard deviations apart are these two groups?" If the old checkout averages 185 seconds and the new one averages 175 seconds, that 10-second gap sounds meaningful. But if the standard deviation is 45 seconds, both groups overlap heavily and the difference is hard to spot statistically. Cohen's captures this relationship.
Cohen's benchmarks for interpreting :
| Value | Interpretation | Example |
|---|---|---|
| 0.2 | Small effect | Subtle UX tweak |
| 0.5 | Medium effect | Meaningful redesign |
| 0.8+ | Large effect | Completely different flow |
Let's calculate Cohen's for our checkout A/B test using synthetic data.
Expected Output:
Control (old checkout): n=120, Mean=181.4s, Std=41.6s
Treatment (new checkout): n=120, Mean=171.1s, Std=42.5s
Mean difference: 10.3 seconds faster
Pooled Std Dev: 42.1s
Cohen's d: 0.2451
Interpretation: Small effect
Common Pitfall: A 10-second improvement in checkout time sounds operationally significant. But with a standard deviation of 42 seconds, the groups overlap almost entirely. Cohen's of 0.245 puts this squarely in "small effect" territory, meaning we need a large sample to detect it reliably. Never confuse practical significance with statistical detectability.
Post-Hoc Power Analysis
Post-hoc power analysis answers the question: "Given the sample size we collected and the effect we observed, how likely were we to detect this result?" The TTestIndPower class from statsmodels handles this calculation.
You supply three of the four parameters, and solve_power computes the missing one.
Expected Output:
Effect size (Cohen's d): 0.2451
Sample size per group: 120
Alpha: 0.05
Statistical Power: 0.4725
This test is UNDERPOWERED (power < 0.80).
There is a 53% chance of missing a real effect.
There it is. With 120 users per group and a small effect size, our checkout A/B test has only 47% power. That's worse than a coin flip. More than half the time, the test would declare "no significant difference" even when the new checkout genuinely works.
Pro Tip: Post-hoc power analysis is most useful for diagnosing past experiments. For future experiments, always run power analysis before collecting data. Computing power after seeing a non-significant p-value is circular reasoning because the observed effect size is itself uncertain.
Calculating Required Sample Size
This is the most valuable application of power analysis: determining how many observations you need before running the experiment. For our checkout A/B test with a small effect (), how many users do we need per group?
Click to expandPower analysis workflow from defining effect size through sample size calculation
Expected Output:
Required Sample Size Per Group (alpha=0.05, power=0.80)
=======================================================
Small effect (d=0.25): 253 per group (506 total)
Medium effect (d=0.50): 64 per group (128 total)
Large effect (d=0.80): 26 per group (52 total)
Quadrupling the effect size cuts sample needs by ~16x.
The relationship is stark. Detecting a subtle checkout improvement () requires 506 total users. Detecting an obvious overhaul () needs only 52. This is why choosing a realistic Minimum Detectable Effect is arguably the most important decision in A/B test design.
How Power Works Under the Hood
To understand power mechanically, picture two probability distributions laid on top of each other. The null distribution () shows what results look like if the new checkout has zero effect. The alternative distribution () shows results when the checkout truly saves time.
We draw a vertical line (the critical value) on the null distribution based on . Any test statistic past that line is "significant." Power is simply the fraction of the alternative distribution that falls beyond that line.
Where:
- is the standard normal test statistic
- is the critical value from the null distribution (1.96 for , two-tailed)
- is Cohen's effect size
- is the sample size per group (assuming balanced groups)
In Plain English: Imagine two overlapping bell curves. The critical value is a fence built from the first curve (null). Power is the percentage of the second curve (alternative) that spills over that fence. To push more of the alternative curve past the fence, either shove the curves further apart (bigger effect size) or make them narrower (larger sample size). Both strategies reduce the overlap, making the real effect easier to catch.
This formula reveals why sample size has such a dramatic effect: appears under a square root, so you need to quadruple the sample to double the separation between distributions. That's the mathematical reason small effects are so expensive to detect.
Power Curves Across Effect Sizes
A power curve plots statistical power as a function of sample size for a given effect size. These curves are the single most useful visual tool for experiment planning.
Expected Output:
Power by Sample Size and Effect Size (alpha=0.05)
============================================================
n/group d=0.2 d=0.5 d=0.8
------------------------------------------------------------
10 7.1% 18.5% 39.5%
20 9.5% 33.8% 69.3%
50 16.8% 69.7% 97.7%
100 29.1% 94.0% 100.0%
200 51.4% 99.9% >99.9%
300 68.6% >99.9% >99.9%
400 80.6% >99.9% >99.9%
500 88.5% >99.9% >99.9%
Three patterns jump out from this table:
- Large effects are cheap to detect. At , just 50 observations per group gives 97.7% power. You barely need to think about sample size.
- Small effects are expensive. At , even 300 users per group only gets you to 68.6% power. You need 400+ per group to cross 80%.
- Diminishing returns kick in fast. Going from 100 to 200 per group at only moves power from 94% to 99.9%. Past the 80% threshold, extra data yields marginal gains.
The Alpha-Power Tradeoff
Adjusting the significance level () directly trades false positive risk against detection ability. This tradeoff matters when the cost of missing a real effect differs from the cost of a false alarm.
Expected Output:
Impact of Alpha on Power (d=0.25, n=200 per group)
==================================================
alpha=0.01 -> Power = 46.6%
alpha=0.05 -> Power = 70.3%
alpha=0.10 -> Power = 80.3%
Required n Per Group at Different Power Levels (d=0.25, alpha=0.05)
=======================================================
Power=70% -> n=199 per group (398 total)
Power=80% -> n=253 per group (506 total)
Power=90% -> n=338 per group (676 total)
Power=95% -> n=417 per group (834 total)
Pro Tip: In exploratory A/B testing where false positives are cheap to catch (you can always run a follow-up test), bumping to 0.10 can save substantial traffic. At with , you hit 80.3% power for a small effect. At the stricter , that same sample gives only 46.6% power. Pick the alpha that matches the consequences of each error type.
Confirming Power Through Simulation
Analytical formulas tell you the theoretical power. Monte Carlo simulation lets you verify it empirically by running thousands of synthetic experiments and counting how many correctly reject .
Expected Output:
Monte Carlo Power Simulation (10,000 trials, d=0.25)
=======================================================
n=50 per group: Power = 24.0%
n=120 per group: Power = 48.9%
n=253 per group: Power = 80.9%
With only 50 per group, 76% of real effects go undetected.
At n=253, we cross the 80% threshold by design.
The simulation confirms the analytical result: 253 per group delivers approximately 80% power for . At , three out of four experiments would miss the real effect. At (the sample size from our checkout test), it's essentially a coin flip.
This is why simulation matters. It grounds abstract probability in something tangible: out of 10,000 parallel universes where the effect is real, how many experiments catch it?
When Experiments Have Too Much Power
Overpowered experiments create a subtle but real problem. With enormous samples, even trivially small effects become statistically significant.
Consider an A/B test with 1,000,000 users per group. The standard error becomes microscopic. A 0.3-second difference in checkout time (completely irrelevant operationally) will yield a p-value below 0.001. The result is "highly significant" but practically meaningless. No product team should redesign a checkout flow for a third of a second.
| P-value | Effect Size | Interpretation |
|---|---|---|
| Low () | Large | Important finding worth acting on |
| Low () | Tiny | Statistically real but practically irrelevant |
| High () | Large | Underpowered study; increase sample size |
| High () | Small | Probably no meaningful effect |
Key Insight: Always report effect size alongside p-values. A significant p-value without an effect size is an incomplete answer. In the checkout example, "p = 0.002, Cohen's = 0.01" tells you the effect is real but too small to justify the engineering cost of deploying the new flow.
When to Use Power Analysis (and When Not To)
Use power analysis when:
- Designing A/B tests to determine how long to run the experiment
- Planning clinical trials where underpowered studies waste patient enrollment
- Budgeting data collection to decide between sample sizes of 500 vs. 5,000
- Evaluating past studies to understand why a promising result came back non-significant
- Comparing study designs to see whether a paired design outperforms an independent one
Skip power analysis when:
- You have the full population (census data). No sampling error means no power calculation needed.
- The analysis is purely descriptive. Reporting means and percentages doesn't involve hypothesis testing.
- Effect size is unknown and unguessable. Power analysis requires an assumed effect size. If you have zero basis for choosing one, the calculation produces meaningless numbers. Run a pilot study first.
- The cost of data collection is near zero. If you can trivially collect millions of rows (web analytics, log data), just collect everything and focus on effect size interpretation.
Click to expandComparison of underpowered and adequately powered study outcomes
Production Considerations
Computational Complexity
Power analysis itself is computationally trivial. The solve_power function runs in microseconds. Monte Carlo simulation with 10,000 iterations takes about 2 seconds for simple t-tests.
Practical Pitfalls at Scale
Multiple comparisons. If your A/B test has 5 variants, you're running comparisons. Each needs its own power calculation, and you must apply a correction like Bonferroni or Benjamini-Hochberg to control the family-wise error rate. See the ANOVA guide for the multi-group approach.
Sequential testing. Many teams peek at A/B test results daily. Each peek inflates the false positive rate. If you plan to monitor results continuously, use sequential analysis methods (group sequential designs or always-valid p-values) and adjust your power calculation accordingly.
Non-normal distributions. Cohen's and TTestIndPower assume normally distributed data. For conversion rates (binary outcomes), use proportions-based power analysis. For heavily skewed revenue data, consider bootstrap-based power estimation or transform the outcome variable.
Unequal group sizes. Balanced groups () maximize power for a given total sample. A 70/30 split wastes about 4% of your power compared to 50/50. The ratio parameter in solve_power handles unequal allocations, but aim for balance when possible.
Conclusion
Statistical power determines whether your experiment can actually answer the question you're asking. An underpowered test is worse than no test at all because it gives you false confidence in a null result.
The core insight is that sample size, effect size, significance level, and variance form a closed system. You can't independently choose all four. The practical workflow is straightforward: decide the smallest effect worth detecting, set your alpha and power targets, and compute the required sample size. If you can't afford that many observations, either accept a larger MDE or find ways to reduce variance.
For deeper coverage of the hypothesis testing framework that power analysis sits within, see the complete guide to hypothesis testing. To understand the probability distributions underlying these calculations, the probability distributions guide covers the normal and non-central t-distributions used by power analysis in detail. And if you're applying power analysis specifically to A/B tests, the A/B testing guide walks through the full lifecycle from power calculation through analysis.
Frequently Asked Interview Questions
Q: What is statistical power, and why does it matter?
Statistical power is the probability that a hypothesis test correctly rejects the null when a real effect exists. It equals $1 - \beta\beta$ is the Type II error rate. It matters because underpowered experiments frequently miss real effects, leading teams to abandon interventions that actually work.
Q: What are the four factors that determine statistical power?
Sample size, effect size, significance level (), and variance. These four form a closed mathematical system: fixing any three determines the fourth. In practice, teams typically fix , target power at 0.80, estimate an effect size, and solve for the required sample size.
Q: How does sample size affect power, and why is the relationship nonlinear?
Larger samples produce narrower sampling distributions, which reduces overlap between the null and alternative distributions. The relationship is nonlinear because the test statistic scales with , not itself. To double the separation between distributions (and roughly double your ability to detect effects), you need four times the sample size.
Q: What is Cohen's , and how do you interpret it?
Cohen's measures the difference between two group means in units of pooled standard deviations. Values of 0.2, 0.5, and 0.8 are conventionally called small, medium, and large effects. A of 0.5 means the groups differ by half a standard deviation, which is typically noticeable in visual inspection of the data.
Q: When would you run a post-hoc power analysis, and what are its limitations?
Post-hoc power analysis is useful for diagnosing why a past study failed to find significant results, particularly when you suspect it was underpowered. Its main limitation is circularity: computing power from the observed effect size after seeing a non-significant p-value always yields low power (since a non-significant result implies a small observed effect). Pre-study power analysis using a meaningful target effect size is far more informative.
Q: Your A/B test shows p = 0.03 with a very small effect size. What do you tell the product team?
The result is statistically significant but may not be practically significant. Report both the p-value and the effect size. If Cohen's is 0.02, the effect is real but likely too small to justify deployment costs. The test was probably overpowered (too many observations), which made a trivial difference detectable. The product team should weigh the engineering cost of deploying the change against the expected business impact.
Q: How would you design a power analysis for an A/B test with a binary outcome like conversion rate?
For proportions rather than means, use the NormalIndPower class or proportions_effectsize from statsmodels. Convert the baseline conversion rate and the minimum detectable lift into an effect size (using Cohen's for proportions), then solve for the required sample size at your target power and alpha. Binary outcomes typically require larger samples than continuous outcomes for the same relative effect size.
Q: What strategies can you use to increase power without increasing sample size?
Reduce variance through paired designs, stratified randomization, or CUPED (using pre-experiment covariates). Switch from a two-tailed to a one-tailed test if the direction of the effect is known a priori. Relax alpha from 0.05 to 0.10 in exploratory contexts. Use a more sensitive test statistic (for example, ANCOVA instead of a simple t-test). Each approach has tradeoffs, but variance reduction is generally the safest because it doesn't inflate error rates.
Hands-On Practice
In experimental design, finding a 'statistically significant' result is only half the battle. The other half is ensuring your experiment is sensitive enough to detect an effect if one actually exists. This is called Statistical Power.
A clinical trial dataset to understand the relationship between sample size, effect size, and power. We will manually implement power calculations using scipy.stats (revealing the math often hidden behind 'black box' libraries) and visualize how sample size impacts our ability to discover the truth.
Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.
By calculating Cohen's and analyzing the Power Curve, we confirmed that Drug B has a massive effect compared to the Placebo. The visualization demonstrates a crucial concept: simply increasing sample size yields diminishing returns once power approaches 1.0.
In this case, the clinical trial was well-powered, almost guaranteed to find the effect. In real-world scenarios with smaller effects (lower Cohen's ), this same math helps you avoid launching expensive experiments that are doomed to fail due to insufficient data.