Understanding Chi-Square Tests: Intuition to Code

You can calculate the average height of a basketball team. You can find the standard deviation of stock prices. But what do you do when your data isn't numerical? How do you calculate the "average" of {Placebo, Drug A, Drug B}? You can't.

Categorical data requires a different set of tools. When you need to determine if there is a statistically significant relationship between two categories—like whether a new website design actually leads to more signups, or if a drug treatment is more effective than a placebo—you can't rely on t-tests or correlations.

Enter the Chi-Square ( $\chi^2$ ) test. It is the bedrock of categorical analysis, allowing you to move beyond "it looks like there's a difference" to "there is a 99.9% probability this pattern isn't random."

In this guide, we will break down the mathematics of the Chi-Square test, apply it to a real-world clinical trial dataset, and show you exactly how to implement it in Python.

What is a Chi-Square test?

A Chi-Square test is a statistical hypothesis test used to determine if there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. It quantifies the discrepancy between what the data would look like if there were no relationship (the null hypothesis) and what the data actually looks like.

There are two main types of Chi-Square tests that data scientists use daily:

The Goodness of Fit Test: Checks if a single categorical variable follows a hypothesized distribution (e.g., "Is this die fair?").
The Test of Independence: Checks if two categorical variables are related (e.g., "Does the treatment group affect the recovery rate?").

Both tests rely on the same fundamental concept: measuring "surprise." If your observed data matches your expectations perfectly, the Chi-Square score is zero. The more the data deviates from expectation, the higher the score, and the more likely the difference is real.

How does the Chi-Square statistic work?

The Chi-Square statistic ( $\chi^2$ ) accumulates the squared differences between observed and expected counts, normalized by the expected counts. This normalization ensures that a deviation of 5 observations matters more when the expected count is small (10) than when it is large (10,000).

The formula is elegantly simple:

$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$

Where:

$O_i$ = Observed frequency (the actual count in your data)
$E_i$ = Expected frequency (the count predicted by the null hypothesis)
$\sum$ = Summation over all categories (or cells in a table)

In Plain English: This formula asks, "For every category, how far off was our prediction?" We square the difference to make negatives positive (and to penalize big misses more heavily), and then we divide by the expected number to keep things in perspective. A miss of 10 people is a huge deal if you only expected 20, but a rounding error if you expected 10,000. The sum of these standardized "misses" is your total "Surprise Score."

What is a contingency table?

A contingency table (or crosstab) is a matrix that displays the frequency distribution of variables. It is the prerequisite for the Chi-Square Test of Independence. The rows represent the categories of one variable, and the columns represent the categories of the other.

To understand this, let's look at our clinical trial dataset. We have patients assigned to different Treatment Groups (Placebo, Drug A, Drug B, Drug C) and a binary outcome: did they Respond (Yes/No)?

Here is how we structure the data to analyze it:

Treatment Group	No Response (0)	Responded (1)	Total
Placebo	171	116	287
Drug A	122	134	256
Drug B	85	157	242
Drug C	96	119	215
Total	474	526	1000

This table is the raw input for our calculation. The Chi-Square test will tell us if the variation in response rates (Placebo ~40% vs. Drug B ~65%) is statistically significant or just random noise.

💡 Pro Tip: In Python, the pandas.crosstab() function generates these tables instantly. Always inspect your crosstab visually before running the test to ensure your data assumes the shape you expect.

How do we calculate Expected Frequencies?

To calculate the Chi-Square statistic, we first need to know what the world would look like if the variables were completely independent. If the treatment had zero effect, what numbers would we expect to see in the table above?

If the variables are independent, the probability of falling into a specific cell is the product of the row probability and the column probability.

$E_{row, col} = \frac{(\text{Row Total}) \times (\text{Column Total})}{\text{Grand Total}}$

In Plain English: To find the expected count for a specific cell (like "Placebo patients who Responded"), take the total number of Placebo patients and multiply it by the overall success rate of the entire study. If 28.7% of patients are in the Placebo group, and 52.6% of all patients responded, then—if the drug does nothing—we'd expect $28.7\% \times 52.6\%$ of the total population to be in that specific cell.

Let's calculate the expected value for Placebo + Responded:

Row Total (Placebo) = 287
Column Total (Responded) = 526
Grand Total = 1000

$E_{Placebo, Responded} = \frac{287 \times 526}{1000} = 150.96$

Reality Check:

Expected: ~151 people
Observed: 116 people

We observed far fewer responders in the Placebo group than we would expect if the treatment didn't matter. This large gap $(116 - 150.96)$ contributes significantly to our Chi-Square score, suggesting the treatment does matter.

When does the Chi-Square test fail?

The Chi-Square test is robust, but it is not magic. It relies on specific assumptions about your data. Violating these assumptions renders the p-value meaningless.

The most critical assumptions are:

Categorical Data: Both variables must be categorical (nominal or ordinal). You cannot use continuous variables like "Height" without binning them first.
Independence of Observations: Each data point must be independent. You cannot measure the same patient three times and treat them as three separate rows.
Large Expected Frequencies: A common rule of thumb is that every cell in the contingency table should have an expected count of at least 5.

⚠️ Common Pitfall: If you have small sample sizes (e.g., a cell with only 1 or 2 observations), the Chi-Square approximation breaks down. In these cases, you should use Fisher's Exact Test instead, which calculates the exact probability rather than an approximation.

How to implement Chi-Square in Python

Let's apply this to our clinical trial dataset. We will answer two business questions:

Goodness of Fit: Are patients evenly distributed across the four treatment groups?
Test of Independence: Is there a significant relationship between the Treatment Group and the Response?

We will use scipy.stats for the calculations and pandas for data manipulation.

Loading the Data

First, let's load our prepared clinical trial dataset.

python

import pandas as pd
import scipy.stats as stats
import numpy as np

# Load the dataset
df = pd.read_csv('lds_stats_probability.csv')

# Quick look at the relevant columns
print(df[['treatment_group', 'responded_to_treatment']].head())

Scenario 1: The Goodness of Fit Test

Before testing the drug's effectiveness, we need to verify our study design. Did we assign roughly equal numbers of patients to the Placebo, Drug A, Drug B, and Drug C groups?

Null Hypothesis ( $H_0$ ): The groups are equal (250 patients in each). Alternative Hypothesis ( $H_1$ ): The groups are not equal.

python

# Count observed patients in each group
observed_counts = df['treatment_group'].value_counts().sort_index()

print("Observed Counts:")
print(observed_counts)

# Total patients
n = len(df)
expected_count = n / 4  # We expect 250 per group if perfectly balanced

# Create expected array [250, 250, 250, 250]
expected_counts = [expected_count] * 4

# Run Chi-Square Goodness of Fit
chi2_stat, p_val = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)

print(f"\nChi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_val:.4e}")

Expected Output:

text

Observed Counts:
treatment_group
Drug_A     256
Drug_B     242
Drug_C     215
Placebo    287
Name: count, dtype: int64

Chi-Square Statistic: 10.7760
P-value: 1.3001e-02

Interpretation: The p-value is approximately 0.013, which is less than the standard alpha of 0.05. This suggests our groups are not perfectly balanced. However, in a randomized clinical trial, some variation is natural. While statistically significant, the imbalance isn't massive (215 vs 287), but it's something a statistician would note.

Scenario 2: The Test of Independence (Crucial)

Now for the main event. Does the drug work? We test if treatment_group and responded_to_treatment are independent.

Null Hypothesis ( $H_0$ ): Treatment and Response are independent (Drugs don't work better than Placebo). Alternative Hypothesis ( $H_1$ ): Treatment and Response are dependent (The drug affects the outcome).

python

# 1. Create the Contingency Table
contingency_table = pd.crosstab(df['treatment_group'], df['responded_to_treatment'])

print("Contingency Table (Observed):")
print(contingency_table)

# 2. Run Chi-Square Test of Independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\nChi-Square Statistic: {chi2:.4f}")
print(f"P-value: {p:.4e}")
print(f"Degrees of Freedom: {dof}")

print("\nExpected Frequencies (if H0 were true):")
print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

Expected Output:

text

Contingency Table (Observed):
responded_to_treatment    0    1
treatment_group
Drug_A                  122  134
Drug_B                   85  157
Drug_C                   96  119
Placebo                 171  116

Chi-Square Statistic: 32.3680
P-value: 4.3773e-07
Degrees of Freedom: 3

Expected Frequencies (if H0 were true):
responded_to_treatment        0        1
treatment_group
Drug_A                  121.344  134.656
Drug_B                  114.708  127.292
Drug_C                  101.910  113.090
Placebo                 136.038  150.962

Analysis: The p-value is 4.38e-07 (0.000000438), which is far below 0.05. Conclusion: We reject the null hypothesis. There is a strong statistical dependency between the treatment group and the response rate. Specifically, Drug B (Observed 157 vs Expected 127) performed much better than chance, while the Placebo (Observed 116 vs Expected 151) performed worse.

Scenario 3: Investigating Confounders (Gender vs. Response)

Sometimes relationships exist where you don't expect them. Let's check if gender affects the response rate.

python

# Crosstab for Gender vs Response
gender_crosstab = pd.crosstab(df['gender'], df['responded_to_treatment'])

chi2_gender, p_gender, dof_gender, exp_gender = stats.chi2_contingency(gender_crosstab)

print(f"Gender vs Response - P-value: {p_gender:.4e}")

Result: You will likely see a significant p-value (e.g., ~0.001), indicating that gender also plays a role in recovery rates. This implies we might need to stratify our analysis or use multivariate techniques (like Logistic Regression) to isolate the drug's effect from the gender effect.

Understanding Degrees of Freedom

You might have noticed the dof (Degrees of Freedom) output in the code above.

$df = (r - 1) \times (c - 1)$

Where $r$ is the number of rows and $c$ is the number of columns.

In Plain English: Degrees of freedom represents how much "wiggle room" the data has. In our 4x2 table (4 Treatments, 2 Outcomes), if you know the row totals, column totals, and the values of 3 cells, you can mathematically deduce the values of all other cells. The Chi-Square distribution changes shape based on these degrees of freedom. A calculated $\chi^2$ of 32.4 is huge for 3 degrees of freedom (rare event), but might be normal for 20 degrees of freedom.

Conclusion

The Chi-Square test is the definitive tool for uncovering relationships in categorical data. It allows you to transform qualitative observations into quantitative evidence.

Here is what we covered:

The Intuition: Chi-Square measures the "surprise" factor—the distance between what you see and what you'd expect if nothing interesting was happening.
The Calculation: It sums the squared standardized residuals: $\sum (O-E)^2 / E$ .
The Application: We used it to prove that treatment groups in our clinical trial had significantly different success rates.

While Chi-Square is powerful, it only tells you that a relationship exists, not how strong it is or in which direction. To understand the magnitude of the effect (e.g., "Drug B increases success chance by 20%"), you should follow up with effect size metrics like Cramer's V or Odds Ratios.

To deepen your statistical toolkit, you should explore Mastering Hypothesis Testing to understand the p-value framework better, or read about Probability Distributions to see where these expected values come from. For comparing numerical means rather than categories, check out our guide on ANOVA.

Hands-On Practice

The following Python code demonstrates how to perform both types of Chi-Square tests using the scipy.stats library. First, we use the Goodness of Fit test to check if the patients were evenly sampled across the four treatment groups. Then, we use the Test of Independence to determine if the treatment received actually impacted patient recovery rates. We visualize the results using matplotlib to verify the statistical findings.

Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.

Try It Yourself

Statistics & Probability

Loading editor...

0/50 runs(Ctrl+Enter)

Statistics & Probability: 1,000 clinical trial records for statistical analysis and probability distributions

In this analysis, the Goodness of Fit test revealed slight imbalances in group sizes (p < 0.05), likely due to the random assignment process in a sample of this size. More importantly, the Test of Independence returned a very small p-value, leading us to reject the null hypothesis. This statistically confirms that the choice of drug significantly impacts patient recovery rates, validating the apparent differences seen in the charts.

Understanding Chi-Square Tests: From Intuition to Implementation

What is a Chi-Square test?

How does the Chi-Square statistic work?

What is a contingency table?

How do we calculate Expected Frequencies?

When does the Chi-Square test fail?

How to implement Chi-Square in Python

Loading the Data

Scenario 1: The Goodness of Fit Test

Scenario 2: The Test of Independence (Crucial)

Scenario 3: Investigating Confounders (Gender vs. Response)

Understanding Degrees of Freedom

Conclusion

Hands-On Practice

Try It Yourself

Related Articles

Solving the "What If": A Practical Guide to Causal Inference

Survival Analysis Guide: Predicting "When" Instead of "If"

Related Articles

Solving the "What If": A Practical Guide to Causal Inference

Survival Analysis Guide: Predicting "When" Instead of "If"