Mastering Hypothesis Testing: The Science of Making Data-Driven Decisions

DS
LDS Team
Let's Data Science
11 minAudio
Listen Along
0:00 / 0:00
AI voice

Picture a courtroom. The defendant stands accused, but the judge doesn't start by assuming guilt. The entire legal system rests on one principle: innocent until proven guilty. The prosecution must produce evidence so overwhelming that maintaining the presumption of innocence becomes absurd.

Hypothesis testing works the same way. You don't glance at a chart and declare that a new drug works or that a marketing campaign moved the needle. Instead, you begin with the assumption that nothing changed, and you only abandon that assumption when the data makes it unreasonable to hold. This framework sits at the core of clinical trials, A/B tests, and every published scientific finding that claims statistical significance. Without it, you're guessing.

Throughout this article, we'll follow a single running example: a clinical trial comparing a new drug (treatment group) to a placebo (control group), measuring patient improvement scores. Every formula, code block, and table will reference this scenario so the concepts stay concrete.

The hypothesis testing framework

Hypothesis testing is a statistical procedure that uses sample data to evaluate whether a claim about a population is supported by evidence. It formalizes the question "Is this effect real, or could random chance explain what I'm seeing?" into a structured decision process with quantifiable risk.

The process moves you from subjective observations ("it looks like the treatment group improved more") to objective conclusions with an explicit error rate ("there's less than a 5% chance this difference is random noise"). According to the American Statistical Association's 2016 statement on p-values, this framework remains the most widely used approach for statistical inference across every scientific discipline.

Step-by-step hypothesis testing workflow from hypothesis to decisionStep-by-step hypothesis testing workflow from hypothesis to decision

Here's the workflow at a high level:

  1. State a null hypothesis (H0H_0) and an alternative hypothesis (H1H_1)
  2. Choose a significance level (α\alpha)
  3. Collect data and compute a test statistic
  4. Calculate the p-value
  5. Compare the p-value to α\alpha and make a decision

Each of these steps has real consequences if done carelessly. Let's walk through them.

Null and alternative hypotheses

The null hypothesis (H0H_0) and alternative hypothesis (H1H_1) define the two competing realities your test will adjudicate. Getting these right is the first and most important step.

The null hypothesis (H0H_0)

This is the boring default. It states that any difference you observe in your sample is the product of random variation, not a genuine effect. In the courtroom analogy, H0H_0 is the presumption of innocence.

For our clinical trial:

H0H_0: The mean improvement in the treatment group equals the mean improvement in the control group. The drug has no effect.

The alternative hypothesis (H1H_1)

This is the claim you hope to support. It says the observed difference reflects something real.

H1H_1: The mean improvement in the treatment group exceeds the mean improvement in the control group. The drug works.

Key Insight: You never "prove" that H0H_0 is true. A hypothesis test can only reject H0H_0 or fail to reject it. "Fail to reject" is not the same as "accept." Just as a court verdict of "not guilty" doesn't mean the defendant is innocent; it means the evidence wasn't strong enough to convict.

Formulating good hypotheses

Strong hypotheses share three qualities:

QualityGood ExampleBad Example
Specific and testable"Mean improvement differs between groups""The drug is better"
Measurable"Treatment mean > control mean""Patients feel better"
Defined before seeing dataPre-registered hypothesisChanged after peeking at results

Getting the hypothesis wrong is like asking the wrong question in court. No amount of statistical rigor fixes a badly framed question.

The p-value decoded

The p-value is the probability of observing a test statistic at least as extreme as the one calculated from your data, assuming H0H_0 is true. A small p-value means the data is very unlikely under the null hypothesis, which pushes you toward rejecting H0H_0.

The coin flip intuition

Suppose you suspect a coin is unfair. Your hypotheses are:

  • H0H_0: The coin is fair (50/50)
  • H1H_1: The coin is biased

You flip it 10 times and get 10 heads in a row. If the coin were truly fair, the probability of this happening is (0.5)100.00098(0.5)^{10} \approx 0.00098, or about 1 in 1,024. That number is your p-value.

Faced with this, you have two options: believe a 1-in-1,024 event just happened by pure luck, or conclude the coin probably isn't fair. Most people pick the second option. That's hypothesis testing in a nutshell.

What the p-value is NOT

Common Pitfall: The p-value is NOT the probability that H0H_0 is true. It's NOT the probability that your result is due to chance. It's the probability of seeing data this extreme (or more extreme) if H0H_0 were true. This distinction trips up even experienced practitioners. The p-value is a statement about the data given the hypothesis, not a statement about the hypothesis given the data.

For our clinical trial, a p-value of 0.002 means: "If the drug truly had zero effect, there's only a 0.2% chance we'd see an improvement gap this large between groups." It does not mean there's a 0.2% chance the drug doesn't work.

Choosing a significance level

The significance level α\alpha is the threshold you set before running the test. If the p-value falls below α\alpha, you reject H0H_0. It represents the maximum false positive rate you're willing to tolerate.

α\alpha LevelFalse Positive RiskTypical Use Case
0.10 (10%)Higher risk, more discoveriesExploratory business analysis, early screening
0.05 (5%)Standard balanceMost scientific research, A/B testing
0.01 (1%)Conservative, fewer false alarmsMedical trials, safety-critical decisions
0.001 (0.1%)Very conservativeParticle physics (actually uses $5\sigma$), genomics

The decision rule is straightforward:

  • If p-value α\leq \alpha: Reject H0H_0 (the result is statistically significant)
  • If p-value >α> \alpha: Fail to reject H0H_0 (insufficient evidence)

Pro Tip: Don't pick α\alpha after seeing your results. That's p-hacking. Lock in α=0.05\alpha = 0.05 (or whatever threshold fits your domain) before collecting any data. Particle physics uses α=0.0000003\alpha = 0.0000003 (the famous $5\sigmathreshold)becausethecostofafalsediscoveryannouncementisenormous.InanA/Btestonbuttoncolor,threshold) because the cost of a false discovery announcement is enormous. In an A/B test on button color,\alpha = 0.05$ is usually fine.

Type I and Type II errors

Every hypothesis test carries two kinds of risk. You can't eliminate both simultaneously; reducing one increases the other (given fixed sample size).

A Type I error (α\alpha) happens when you reject a true H0H_0. You declare the drug works when it doesn't. A Type II error (β\beta) happens when you fail to reject a false H0H_0. You conclude the drug has no effect when it actually does.

Four outcomes of hypothesis testing showing Type I and Type II error regionsFour outcomes of hypothesis testing showing Type I and Type II error regions

Reject H0H_0Fail to Reject H0H_0
H0H_0 is actually true (drug doesn't work)Type I Error (false positive), probability = α\alphaCorrect decision (true negative)
H0H_0 is actually false (drug works)Correct decision (true positive), probability = $1 - \beta$ (power)Type II Error (false negative), probability = β\beta

Real-world consequences

  • Type I (false positive): A pharmaceutical company approves a drug that doesn't work. Patients endure side effects for zero benefit. The company faces lawsuits and recalls.
  • Type II (false negative): A company shelves a website redesign that actually would have boosted conversions by 12%. They leave millions in revenue on the table.

Statistical power

Statistical power is $1 - \beta$, the probability of correctly detecting a real effect. Most studies aim for power of 0.80 (80% chance of catching a true effect). Three factors increase power:

  1. Larger sample size — more data reduces noise
  2. Larger effect size — bigger real differences are easier to spot
  3. Higher α\alpha — accepting more false positives lets you catch more true positives

In Plain English: Power answers "If the drug really does improve outcomes by 5 points, how likely is my study to detect that?" With 50 patients per group, you might have only 40% power. With 200 per group, you might reach 90%. Running an underpowered study is one of the most common (and most wasteful) mistakes in applied statistics.

The mathematics of the t-test

When comparing the means of two groups — like our treatment and control groups — the Student's t-test is the workhorse. William Sealy Gosset published it in 1908 under the pseudonym "Student" while working at the Guinness brewery (yes, the beer company). The original paper remains one of the most influential in statistics.

The t-statistic measures the ratio of signal (difference between group means) to noise (variability within groups):

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Where:

  • xˉ1\bar{x}_1 is the mean improvement score of the treatment group
  • xˉ2\bar{x}_2 is the mean improvement score of the control group
  • s12s_1^2 is the variance of the treatment group
  • s22s_2^2 is the variance of the control group
  • n1n_1 is the number of patients in the treatment group
  • n2n_2 is the number of patients in the control group

In Plain English: The numerator captures how far apart the two group averages are — the "signal." The denominator captures how spread out the data is within each group — the "noise." If the treatment group's average improvement is much higher than the control's, and neither group has wild internal variation, the t-value will be large. A large t-value pushes the p-value toward zero, which means the difference is hard to explain by chance alone.

Welch's t-test versus Student's t-test

The classic Student's t-test assumes both groups have equal variance. In practice, that's often wrong. Welch's t-test drops the equal-variance assumption and adjusts the degrees of freedom accordingly. Scipy's ttest_ind uses Welch's version when you set equal_var=False, and it's almost always the safer default.

Python implementation with scipy

Let's put the theory into practice. We'll generate synthetic clinical trial data — 150 patients in a control group (placebo) and 150 in a treatment group — then run a two-sample t-test using scipy.stats.ttest_ind.

Hypotheses for our trial:

  • H0H_0: Mean improvement (treatment) == Mean improvement (control)
  • H1H_1: Mean improvement (treatment) >> Mean improvement (control)
  • α=0.05\alpha = 0.05

Expected output:

code
=== Descriptive Statistics ===
           mean   std  count
group
control    0.91  3.30    150
treatment  6.08  4.09    150

=== Welch's T-Test Results ===
T-statistic: 12.0640
P-value (two-sided): 2.35e-27
P-value (one-sided): 1.17e-27

Significance level: alpha = 0.05
Decision: REJECT H0 — the treatment significantly improves outcomes.

Cohen's d (effect size): 1.39

The p-value is astronomically small ($3.41 \times 10^{-22}),farbelowour), far below our \alpha = 0.05threshold.Werejectthreshold. We rejectH_0$ and conclude the treatment produces a statistically significant improvement. The Cohen's d of 1.20 indicates a large effect size — the treatment doesn't just work, it works substantially.

Visualizing the distributions

A good visualization confirms what the numbers tell you. When the two group distributions barely overlap, the difference is likely real.

The histogram shows two clearly separated distributions. The treatment group's scores cluster around 5.7, while the control group clusters near 1.2. The box plot confirms the same story — the treatment median is visibly higher, with minimal overlap between the interquartile ranges.

For a deeper look at the mathematical distributions behind these shapes, see our guide on Probability Distributions.

One-tailed versus two-tailed tests

The choice between one-tailed and two-tailed tests depends on your research question — specifically, whether you care about the direction of the effect.

Two-tailed test (the default)

Use this when you want to detect any difference, regardless of direction.

  • H1H_1: μtreatmentμcontrol\mu_{\text{treatment}} \neq \mu_{\text{control}}
  • The rejection region is split across both tails of the distribution
  • More conservative; requires a larger effect to reach significance

For our clinical trial, a two-tailed test asks: "Is the drug different from placebo?" It could be better or worse.

One-tailed test (directional)

Use this when you only care about a difference in one specific direction.

  • H1H_1: μtreatment>μcontrol\mu_{\text{treatment}} > \mu_{\text{control}}
  • All rejection probability is concentrated in one tail
  • More powerful for detecting effects in the hypothesized direction

For our trial, a one-tailed test asks: "Is the drug better than placebo?" If it's worse, we don't care; it's a failure regardless.

AspectOne-TailedTwo-Tailed
HypothesisDirectional (>> or <<)Non-directional (\neq)
Rejection regionOne side onlyBoth sides
P-valueHalf of two-tailed (if direction matches)Full
PowerHigher (for the specified direction)Lower
Common useClinical trials, quality controlGeneral research, exploratory analysis

Common Pitfall: Never switch from a two-tailed to a one-tailed test after seeing your results just to get a smaller p-value. This is a textbook form of p-hacking. You must commit to the test direction before collecting data. Pre-registration of your analysis plan prevents this temptation entirely.

T-test assumptions

Parametric tests like the t-test make mathematical assumptions about your data. When these assumptions break down, your p-values become unreliable. Here's what to check.

1. Independence. Each observation must be independent of every other observation. In our clinical trial, one patient's recovery shouldn't influence another's. This breaks down in clustered designs (patients in the same hospital, students in the same classroom). If independence is violated, consider a mixed-effects model instead.

2. Normality. The data (or the sampling distribution of the mean) should be approximately normal. The Central Limit Theorem rescues you with large samples (typically n>30n > 30 per group), because the distribution of sample means becomes normal regardless of the underlying data shape. With small samples, check a QQ plot or run a Shapiro-Wilk test.

3. Homogeneity of variance. The two groups should have roughly similar spread. When variances differ substantially, use Welch's t-test (set equal_var=False in scipy). We did exactly this in our code above.

What to do when assumptions fail

Expected output:

code
=== Normality Check (Shapiro-Wilk) ===
Control p-value: 0.9480 (normal)
Treatment p-value: 0.1150 (normal)

=== Variance Equality (Levene's Test) ===
P-value: 0.0283 (unequal variances)

=== Non-Parametric Alternative (Mann-Whitney U) ===
U-statistic: 18974.0
P-value (one-sided): 4.30e-25
Conclusion: Reject H0

Both groups pass the normality check, and variances are not significantly different. But notice that the Mann-Whitney U test — which makes no normality assumption at all — reaches the same conclusion with a similarly tiny p-value. When assumptions are in doubt, non-parametric tests provide a useful safety net.

If your data contains extreme values that might distort the t-test, consider cleaning them first. Our article on Statistical Outlier Detection covers techniques for identifying and handling anomalous data points.

Choosing the right statistical test

Not every comparison calls for a t-test. The right test depends on your data type, number of groups, and whether observations are paired or independent.

Decision tree for selecting the appropriate statistical testDecision tree for selecting the appropriate statistical test

ScenarioData TypeGroupsRecommended Test
Two independent group meansContinuous2Independent t-test (Welch's)
Before/after measurements on same subjectsContinuous2 (paired)Paired t-test
Three or more group meansContinuous3+One-way ANOVA
Association between categorical variablesCategorical2+Chi-square test
Proportions between groupsBinary2Z-test for proportions
Non-normal continuous data, two groupsOrdinal/continuous2Mann-Whitney U
Non-normal continuous data, three+ groupsOrdinal/continuous3+Kruskal-Wallis

For categorical comparisons — like whether response rates differ between treatment groups — see our full guide on Chi-Square Tests. When you need to model the relationship between continuous variables rather than test group differences, Linear Regression is the appropriate framework.

When to use hypothesis testing (and when not to)

Hypothesis testing is powerful, but it's not always the right tool. Understanding its boundaries is as important as understanding the mechanics.

When hypothesis testing fits

  • Confirming a specific claim. "Does the new drug improve outcomes?" "Did the redesign increase conversions?" Binary yes/no questions with a clear baseline.
  • Controlled experiments. A/B tests, randomized clinical trials, and lab experiments where you can isolate variables and control for confounders.
  • Publishing or regulatory contexts. Journal papers, FDA submissions, and any setting where the community expects frequentist significance testing.

When to choose something else

  • You need to quantify the size of an effect, not just its existence. A hypothesis test tells you "the difference is real" but not "the difference is big enough to matter." Confidence intervals give you a range for the true effect size, which is often more useful for business decisions.
  • You have prior information you want to incorporate. Bayesian inference lets you combine prior knowledge with observed data. If you've run 10 similar clinical trials before, that information shouldn't be thrown away. Bayesian approaches produce posterior distributions rather than binary reject/fail-to-reject decisions.
  • Your sample size is enormous. With millions of data points, even trivially small effects become "statistically significant." A $0.01 improvement in average order value might be significant with n=10n = 10 million, but it's meaningless for business decisions. In big-data settings, focus on effect sizes and practical significance instead.
  • You're exploring rather than confirming. Hypothesis testing is a confirmatory tool. If you're fishing through 500 features looking for interesting patterns, you'll find "significant" results by pure chance (at α=0.05\alpha = 0.05, roughly 25 out of 500). Exploratory analysis requires different tools and strict multiple-comparison corrections (like Bonferroni or Benjamini-Hochberg).

Key Insight: The most common misuse of hypothesis testing isn't getting the math wrong; it's applying the framework to questions it wasn't designed to answer. Always ask yourself: "Do I need a yes/no decision, or do I need to understand the magnitude and uncertainty of an effect?" If it's the latter, confidence intervals or Bayesian methods will serve you better.

Effect size and practical significance

Statistical significance alone isn't enough. A result can be statistically significant (tiny p-value) while being practically meaningless (tiny effect). Effect size metrics quantify how big the difference actually is.

Cohen's d is the most common effect size for comparing two means:

d=xˉ1xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

Where:

  • xˉ1\bar{x}_1 is the treatment group mean
  • xˉ2\bar{x}_2 is the control group mean
  • sps_p is the pooled standard deviation of both groups

In Plain English: Cohen's d expresses the difference between groups in standard deviation units. In our clinical trial, d=1.20d = 1.20 means the treatment group's average improvement is 1.20 standard deviations above the control group's. That's a large effect — the average treated patient improved more than about 88% of untreated patients.

Cohen's conventional benchmarks:

Cohen's dInterpretationClinical Trial Analogy
0.2Small effectDrug barely moves the needle
0.5Medium effectNoticeable improvement for most patients
0.8Large effectClear, clinically meaningful improvement
1.2+Very large effectHard to miss even without statistics

Pro Tip: Always report effect size alongside your p-value. A paper that says "p < 0.001" without mentioning effect size is hiding half the story. Reviewers and practitioners increasingly demand both. The p-value tells you whether the effect is real; Cohen's d tells you whether it matters.

Multiple testing and the Bonferroni correction

When you run multiple hypothesis tests on the same dataset, the probability of at least one false positive snowballs. This is the multiple comparisons problem, and ignoring it is one of the most common sources of spurious "discoveries."

If you test 20 independent hypotheses at α=0.05\alpha = 0.05, the probability of at least one false positive is:

P(at least one false positive)=1(1α)mP(\text{at least one false positive}) = 1 - (1 - \alpha)^m

Where:

  • α\alpha is the significance level for each individual test
  • mm is the number of tests performed

In Plain English: With 20 tests at α=0.05\alpha = 0.05, the overall false positive rate is $1 - (1 - 0.05)^{20} \approx 0.64$. That's a 64% chance of at least one spurious significant result. Running 20 tests without correction is almost guaranteed to produce a "finding" that isn't real.

The simplest fix is the Bonferroni correction: divide your significance level by the number of tests. Testing 20 hypotheses? Use α=0.05/20=0.0025\alpha = 0.05 / 20 = 0.0025 for each individual test. It's conservative (you'll miss some true effects), but it controls the family-wise error rate strictly.

Expected output:

code
=== Multiple Testing Demonstration ===
Number of tests: 20
Tests where H0 is actually true: ALL 20

Without correction (alpha = 0.05):
  False positives: 1 out of 20

With Bonferroni correction (alpha = 0.0025):
  False positives: 0 out of 20

Smallest 5 p-values: [np.float64(0.012630996809783527), np.float64(0.06582477133011187), np.float64(0.07678815872980681), np.float64(0.13785930763624335), np.float64(0.3311996223719633)]

Without correction, 2 out of 20 tests falsely flag significance even though there's genuinely no effect in any of them. The Bonferroni correction eliminates all false positives. This matters enormously in genomics (testing thousands of genes), A/B testing with multiple metrics, and any analysis where you're testing many hypotheses simultaneously.

Common mistakes to avoid

After years of seeing hypothesis testing applied (and misapplied) across clinical research, tech companies, and academic papers, these are the mistakes that cause the most damage.

1. P-hacking. Trying multiple analyses, subgroups, and variable transformations until something reaches p<0.05p < 0.05, then reporting only the significant result. This is dishonest and inflates the false positive rate far beyond the nominal α\alpha.

2. Confusing statistical significance with practical significance. A p-value of 0.001 with Cohen's d of 0.02 means the effect is real but tiny. Don't make $10 million decisions based on a statistically significant but practically irrelevant result.

3. Ignoring power before running the study. An underpowered study (say, 20 patients per group) has a high probability of missing real effects. Always compute the required sample size before data collection.

4. Treating "fail to reject H0H_0" as "no effect exists." Absence of evidence is not evidence of absence. A non-significant result with a wide confidence interval means you don't know, not that the effect is zero.

5. Violating independence. Measuring the same patient twice and treating both measurements as independent observations artificially inflates your sample size. Use paired tests or mixed-effects models when observations are related.

Proper data splitting practices help prevent these pitfalls by keeping your confirmation analysis on truly unseen data.

Conclusion

Hypothesis testing converts vague claims about data into structured decisions with explicit error rates. The framework forces you to state what you believe (H0H_0), define how much risk you'll accept (α\alpha), and let the evidence speak through the p-value. It's not perfect — it can't tell you how big an effect is, and it breaks down when applied carelessly to large datasets or multiple comparisons — but when used correctly, it remains the foundation of empirical decision-making.

The most important takeaway isn't the math; it's the discipline. Formulate your hypothesis before looking at data. Choose your significance level before running the test. Report effect sizes alongside p-values. And when you "fail to reject," resist the temptation to call it proof of nothing.

If you want to go deeper into the distributional mathematics behind these tests, our guide on Probability Distributions covers the normal, t, and chi-square distributions that underpin every calculation we've discussed. For hands-on practice with controlled experiments, the A/B Testing guide walks through the full experimental design process end to end. And when your hypothesis testing involves categorical data rather than continuous measurements, Chi-Square Tests is the natural next step.

Frequently Asked Interview Questions

Q: What is the difference between a Type I and Type II error? Which is worse?

A Type I error means rejecting a true null hypothesis (false positive); a Type II error means failing to reject a false null hypothesis (false negative). Which is worse depends on context. In medical testing, a Type I error (approving an ineffective drug) can harm patients, while in fraud detection, a Type II error (missing actual fraud) can be more costly. There's no universal answer — you must reason about the specific domain.

Q: A colleague says "the p-value is 0.03, so there's a 97% chance the treatment works." What's wrong with this statement?

The p-value doesn't give the probability that the hypothesis is true or false. It gives the probability of seeing data this extreme or more extreme assuming the null hypothesis is true. A p-value of 0.03 means "if the treatment had zero effect, there's only a 3% chance of observing this large a difference." To compute the probability a treatment works, you'd need Bayesian inference with a prior.

Q: When would you choose a one-tailed test over a two-tailed test?

Use a one-tailed test when you have a strong directional hypothesis specified before data collection, and effects in the opposite direction are irrelevant. For example, testing whether a new drug improves (not just changes) patient outcomes. The critical rule: you must commit to the direction before seeing data. Switching to one-tailed after observing results is p-hacking.

Q: Your A/B test on 10 million users shows p = 0.001 but the conversion rate improvement is 0.01%. What do you recommend?

With 10 million users, even minuscule differences become statistically significant. The 0.01% improvement is real but practically meaningless for most businesses. I'd recommend looking at effect size (Cohen's d or relative lift) and computing the expected revenue impact. Statistical significance doesn't equal business significance — and the engineering cost to ship the change might exceed the revenue gain.

Q: How do you determine the right sample size for a hypothesis test?

Run a power analysis before data collection. Specify four inputs: desired significance level (α\alpha, usually 0.05), target power ($1 - \beta$, usually 0.80), the minimum effect size you want to detect, and the expected variance. Python's statsmodels.stats.power module computes the required sample size. Underpowered studies waste resources because they can't reliably detect real effects.

Q: What assumptions does a two-sample t-test make, and what do you do when they're violated?

The t-test assumes independence between observations, approximate normality (relaxed with large nn by the CLT), and homogeneity of variance (relaxed by using Welch's t-test). When normality fails with small samples, switch to a non-parametric test like Mann-Whitney U. When independence fails (repeated measures or clustered data), use paired tests or mixed-effects models.

Q: Explain the multiple comparisons problem. How would you handle testing 50 hypotheses on the same dataset?

Testing 50 hypotheses at α=0.05\alpha = 0.05 gives roughly a 92% chance of at least one false positive. The simplest correction is Bonferroni: set α=0.05/50=0.001\alpha = 0.05 / 50 = 0.001 per test. This is conservative but controls the family-wise error rate. For less conservative control, use Benjamini-Hochberg (FDR control), which limits the expected proportion of false discoveries rather than the probability of any false discovery.

Hands-On Practice

The following code implements the hypothesis testing framework discussed in the article. We will act as the 'data judge,' determining if the new treatment significantly improves patient health compared to a placebo. Using the scipy.stats library, we perform a T-test to analyze the continuous 'improvement' score and use matplotlib to visually inspect the evidence before rendering a verdict based on the p-value.

Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.

By running this code, we successfully applied the 'Innocent Until Proven Guilty' framework to data. The low p-value obtained from the T-test provides the overwhelming evidence required to reject the null hypothesis, confirming that the observed improvement in the treatment group is not just a statistical fluke. Additionally, the Chi-Square test reinforced this by showing a significant difference in response rates.