You can calculate the average height of a basketball team. You can find the standard deviation of stock prices. But what do you do when your data isn't numerical? How do you calculate the "average" of {Placebo, Drug A, Drug B}? You can't.
Categorical data requires a different set of tools. When you need to determine if there is a statistically significant relationship between two categories—like whether a new website design actually leads to more signups, or if a drug treatment is more effective than a placebo—you can't rely on t-tests or correlations.
Enter the Chi-Square () test. It is the bedrock of categorical analysis, allowing you to move beyond "it looks like there's a difference" to "there is a 99.9% probability this pattern isn't random."
In this guide, we will break down the mathematics of the Chi-Square test, apply it to a real-world clinical trial dataset, and show you exactly how to implement it in Python.
What is a Chi-Square test?
A Chi-Square test is a statistical hypothesis test used to determine if there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. It quantifies the discrepancy between what the data would look like if there were no relationship (the null hypothesis) and what the data actually looks like.
There are two main types of Chi-Square tests that data scientists use daily:
- The Goodness of Fit Test: Checks if a single categorical variable follows a hypothesized distribution (e.g., "Is this die fair?").
- The Test of Independence: Checks if two categorical variables are related (e.g., "Does the treatment group affect the recovery rate?").
Both tests rely on the same fundamental concept: measuring "surprise." If your observed data matches your expectations perfectly, the Chi-Square score is zero. The more the data deviates from expectation, the higher the score, and the more likely the difference is real.
How does the Chi-Square statistic work?
The Chi-Square statistic () accumulates the squared differences between observed and expected counts, normalized by the expected counts. This normalization ensures that a deviation of 5 observations matters more when the expected count is small (10) than when it is large (10,000).
The formula is elegantly simple:
Where:
- = Observed frequency (the actual count in your data)
- = Expected frequency (the count predicted by the null hypothesis)
- = Summation over all categories (or cells in a table)
In Plain English: This formula asks, "For every category, how far off was our prediction?" We square the difference to make negatives positive (and to penalize big misses more heavily), and then we divide by the expected number to keep things in perspective. A miss of 10 people is a huge deal if you only expected 20, but a rounding error if you expected 10,000. The sum of these standardized "misses" is your total "Surprise Score."
What is a contingency table?
A contingency table (or crosstab) is a matrix that displays the frequency distribution of variables. It is the prerequisite for the Chi-Square Test of Independence. The rows represent the categories of one variable, and the columns represent the categories of the other.
To understand this, let's look at our clinical trial dataset. We have patients assigned to different Treatment Groups (Placebo, Drug A, Drug B, Drug C) and a binary outcome: did they Respond (Yes/No)?
Here is how we structure the data to analyze it:
| Treatment Group | No Response (0) | Responded (1) | Total |
|---|---|---|---|
| Placebo | 171 | 116 | 287 |
| Drug A | 122 | 134 | 256 |
| Drug B | 85 | 157 | 242 |
| Drug C | 96 | 119 | 215 |
| Total | 474 | 526 | 1000 |
This table is the raw input for our calculation. The Chi-Square test will tell us if the variation in response rates (Placebo ~40% vs. Drug B ~65%) is statistically significant or just random noise.
💡 Pro Tip: In Python, the pandas.crosstab() function generates these tables instantly. Always inspect your crosstab visually before running the test to ensure your data assumes the shape you expect.
How do we calculate Expected Frequencies?
To calculate the Chi-Square statistic, we first need to know what the world would look like if the variables were completely independent. If the treatment had zero effect, what numbers would we expect to see in the table above?
If the variables are independent, the probability of falling into a specific cell is the product of the row probability and the column probability.
In Plain English: To find the expected count for a specific cell (like "Placebo patients who Responded"), take the total number of Placebo patients and multiply it by the overall success rate of the entire study. If 28.7% of patients are in the Placebo group, and 52.6% of all patients responded, then—if the drug does nothing—we'd expect of the total population to be in that specific cell.
Let's calculate the expected value for Placebo + Responded:
- Row Total (Placebo) = 287
- Column Total (Responded) = 526
- Grand Total = 1000
Reality Check:
- Expected: ~151 people
- Observed: 116 people
We observed far fewer responders in the Placebo group than we would expect if the treatment didn't matter. This large gap contributes significantly to our Chi-Square score, suggesting the treatment does matter.
When does the Chi-Square test fail?
The Chi-Square test is robust, but it is not magic. It relies on specific assumptions about your data. Violating these assumptions renders the p-value meaningless.
The most critical assumptions are:
- Categorical Data: Both variables must be categorical (nominal or ordinal). You cannot use continuous variables like "Height" without binning them first.
- Independence of Observations: Each data point must be independent. You cannot measure the same patient three times and treat them as three separate rows.
- Large Expected Frequencies: A common rule of thumb is that every cell in the contingency table should have an expected count of at least 5.
⚠️ Common Pitfall: If you have small sample sizes (e.g., a cell with only 1 or 2 observations), the Chi-Square approximation breaks down. In these cases, you should use Fisher's Exact Test instead, which calculates the exact probability rather than an approximation.
How to implement Chi-Square in Python
Let's apply this to our clinical trial dataset. We will answer two business questions:
- Goodness of Fit: Are patients evenly distributed across the four treatment groups?
- Test of Independence: Is there a significant relationship between the Treatment Group and the Response?
We will use scipy.stats for the calculations and pandas for data manipulation.
Loading the Data
First, let's load our prepared clinical trial dataset.
import pandas as pd
import scipy.stats as stats
import numpy as np
# Load the dataset
df = pd.read_csv('lds_stats_probability.csv')
# Quick look at the relevant columns
print(df[['treatment_group', 'responded_to_treatment']].head())
Scenario 1: The Goodness of Fit Test
Before testing the drug's effectiveness, we need to verify our study design. Did we assign roughly equal numbers of patients to the Placebo, Drug A, Drug B, and Drug C groups?
Null Hypothesis (): The groups are equal (250 patients in each). Alternative Hypothesis (): The groups are not equal.
# Count observed patients in each group
observed_counts = df['treatment_group'].value_counts().sort_index()
print("Observed Counts:")
print(observed_counts)
# Total patients
n = len(df)
expected_count = n / 4 # We expect 250 per group if perfectly balanced
# Create expected array [250, 250, 250, 250]
expected_counts = [expected_count] * 4
# Run Chi-Square Goodness of Fit
chi2_stat, p_val = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)
print(f"\nChi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_val:.4e}")
Expected Output:
Observed Counts:
treatment_group
Drug_A 256
Drug_B 242
Drug_C 215
Placebo 287
Name: count, dtype: int64
Chi-Square Statistic: 10.7760
P-value: 1.3001e-02
Interpretation: The p-value is approximately 0.013, which is less than the standard alpha of 0.05. This suggests our groups are not perfectly balanced. However, in a randomized clinical trial, some variation is natural. While statistically significant, the imbalance isn't massive (215 vs 287), but it's something a statistician would note.
Scenario 2: The Test of Independence (Crucial)
Now for the main event. Does the drug work? We test if treatment_group and responded_to_treatment are independent.
Null Hypothesis (): Treatment and Response are independent (Drugs don't work better than Placebo). Alternative Hypothesis (): Treatment and Response are dependent (The drug affects the outcome).
# 1. Create the Contingency Table
contingency_table = pd.crosstab(df['treatment_group'], df['responded_to_treatment'])
print("Contingency Table (Observed):")
print(contingency_table)
# 2. Run Chi-Square Test of Independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-Square Statistic: {chi2:.4f}")
print(f"P-value: {p:.4e}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies (if H0 were true):")
print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))
Expected Output:
Contingency Table (Observed):
responded_to_treatment 0 1
treatment_group
Drug_A 122 134
Drug_B 85 157
Drug_C 96 119
Placebo 171 116
Chi-Square Statistic: 32.3680
P-value: 4.3773e-07
Degrees of Freedom: 3
Expected Frequencies (if H0 were true):
responded_to_treatment 0 1
treatment_group
Drug_A 121.344 134.656
Drug_B 114.708 127.292
Drug_C 101.910 113.090
Placebo 136.038 150.962
Analysis:
The p-value is 4.38e-07 (0.000000438), which is far below 0.05.
Conclusion: We reject the null hypothesis. There is a strong statistical dependency between the treatment group and the response rate. Specifically, Drug B (Observed 157 vs Expected 127) performed much better than chance, while the Placebo (Observed 116 vs Expected 151) performed worse.
Scenario 3: Investigating Confounders (Gender vs. Response)
Sometimes relationships exist where you don't expect them. Let's check if gender affects the response rate.
# Crosstab for Gender vs Response
gender_crosstab = pd.crosstab(df['gender'], df['responded_to_treatment'])
chi2_gender, p_gender, dof_gender, exp_gender = stats.chi2_contingency(gender_crosstab)
print(f"Gender vs Response - P-value: {p_gender:.4e}")
Result: You will likely see a significant p-value (e.g., ~0.001), indicating that gender also plays a role in recovery rates. This implies we might need to stratify our analysis or use multivariate techniques (like Logistic Regression) to isolate the drug's effect from the gender effect.
Understanding Degrees of Freedom
You might have noticed the dof (Degrees of Freedom) output in the code above.
Where is the number of rows and is the number of columns.
In Plain English: Degrees of freedom represents how much "wiggle room" the data has. In our 4x2 table (4 Treatments, 2 Outcomes), if you know the row totals, column totals, and the values of 3 cells, you can mathematically deduce the values of all other cells. The Chi-Square distribution changes shape based on these degrees of freedom. A calculated of 32.4 is huge for 3 degrees of freedom (rare event), but might be normal for 20 degrees of freedom.
Conclusion
The Chi-Square test is the definitive tool for uncovering relationships in categorical data. It allows you to transform qualitative observations into quantitative evidence.
Here is what we covered:
- The Intuition: Chi-Square measures the "surprise" factor—the distance between what you see and what you'd expect if nothing interesting was happening.
- The Calculation: It sums the squared standardized residuals: .
- The Application: We used it to prove that treatment groups in our clinical trial had significantly different success rates.
While Chi-Square is powerful, it only tells you that a relationship exists, not how strong it is or in which direction. To understand the magnitude of the effect (e.g., "Drug B increases success chance by 20%"), you should follow up with effect size metrics like Cramer's V or Odds Ratios.
To deepen your statistical toolkit, you should explore Mastering Hypothesis Testing to understand the p-value framework better, or read about Probability Distributions to see where these expected values come from. For comparing numerical means rather than categories, check out our guide on ANOVA.
Hands-On Practice
The following Python code demonstrates how to perform both types of Chi-Square tests using the scipy.stats library. First, we use the Goodness of Fit test to check if the patients were evenly sampled across the four treatment groups. Then, we use the Test of Independence to determine if the treatment received actually impacted patient recovery rates. We visualize the results using matplotlib to verify the statistical findings.
Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.
Try It Yourself
Statistics & Probability: 1,000 clinical trial records for statistical analysis and probability distributions
In this analysis, the Goodness of Fit test revealed slight imbalances in group sizes (p < 0.05), likely due to the random assignment process in a sample of this size. More importantly, the Test of Independence returned a very small p-value, leading us to reject the null hypothesis. This statistically confirms that the choice of drug significantly impacts patient recovery rates, validating the apparent differences seen in the charts.