<!— slug: chi-square-tests-categorical-data-python —> <!— excerpt: Master Chi-Square tests for categorical data. Covers Goodness of Fit, Independence tests, expected frequencies, effect size, and Python scipy implementation. —>
You can compute the mean height of a basketball team or the standard deviation of stock prices. But what happens when your data is categorical? There's no "average" of {Placebo, Drug A, Drug B}. You can't run a t-test on treatment labels. Categorical data demands its own toolkit, and the Chi-Square () test sits at the center of it.
Karl Pearson introduced the Chi-Square test in his 1900 paper, and it remains one of the most frequently used tests in applied statistics. It answers a simple question: does the pattern you see in your categorical data reflect a real relationship, or could random chance alone produce it? In a clinical trial comparing four treatment arms, for example, the Chi-Square test tells you whether the variation in recovery rates across groups is statistically meaningful or just noise.
Throughout this article, we'll work with a single running example: a 1,000-patient clinical trial comparing Placebo, Drug A, Drug B, and Drug C. Every formula, code block, and table will reference this dataset so the concepts stay grounded in one concrete scenario.
The Chi-Square test and its two main variants
The Chi-Square test is a hypothesis test that measures how far observed categorical counts deviate from expected counts under a null hypothesis. If the observed data matches expectations, the statistic is near zero. The bigger the gap between what you observe and what you'd expect if nothing interesting were happening, the larger the statistic and the stronger the evidence against the null.
Data scientists rely on two main variants:
| Variant | Question it answers | Input | Example |
|---|---|---|---|
| Goodness of Fit | Does a single categorical variable follow a specific distribution? | One variable, known expected proportions | "Are patients equally split across 4 treatment groups?" |
| Test of Independence | Are two categorical variables related? | Two variables in a contingency table | "Does the treatment group affect the recovery rate?" |
A third variant, the Homogeneity test, checks whether different populations have the same distribution of a categorical variable. It uses the same math as the Test of Independence but differs in study design. If you recruited patients from three hospitals and want to know whether recovery rates differ by hospital, that's a homogeneity test.
Click to expandWhich Chi-Square test to use based on your research question
Key Insight: Both the Independence and Homogeneity tests use identical calculations (same formula, same degrees of freedom). The distinction is conceptual: Independence tests sample from one population and ask whether two variables are related. Homogeneity tests sample from multiple populations and ask whether a single variable has the same distribution everywhere.
The Chi-Square formula
The Chi-Square statistic accumulates squared differences between observed and expected counts, normalized by the expected counts. This normalization is important: a discrepancy of 10 patients matters far more when you expected 20 than when you expected 10,000.
Where:
- is the Chi-Square test statistic
- is the observed frequency (the actual count) in category
- is the expected frequency (the count predicted by the null hypothesis) in category
- is the total number of categories (or cells in a contingency table)
- denotes summation across all categories
In Plain English: For each cell in your data, ask: "How far off was my prediction?" Square the difference (so negative and positive gaps both count as misses), then divide by the expected count to put things in perspective. A shortfall of 35 patients is alarming when you expected 151 but trivial when you expected 10,000. Add up all those standardized misses, and you get a single "surprise score." A higher score means your data looks less and less like what you'd see if the null hypothesis were true.
The resulting value follows a Chi-Square probability distribution, and you compare it against a critical value (or compute a p-value) to decide whether to reject the null.
Contingency tables and crosstabs
A contingency table (also called a crosstab) is a frequency matrix showing how observations distribute across the categories of two variables. It's the required input for the Test of Independence.
In our clinical trial, we have patients assigned to four treatment groups (Placebo, Drug A, Drug B, Drug C) and a binary outcome: responded (Yes or No). The contingency table looks like this:
| Treatment Group | No Response | Responded | Row Total |
|---|---|---|---|
| Placebo | 171 | 116 | 287 |
| Drug A | 122 | 134 | 256 |
| Drug B | 85 | 157 | 242 |
| Drug C | 96 | 119 | 215 |
| Column Total | 474 | 526 | 1,000 |
Notice the response rates: Placebo sits at roughly 40%, while Drug B reaches about 65%. The Chi-Square test will tell us whether this spread is statistically significant.
Pro Tip: In Python, pd.crosstab() generates these tables in one line. Always inspect the crosstab visually before running the test. Catch data issues early: a column with all zeros, a missing category, or a mislabeled value will silently poison your results.
Calculating expected frequencies
To compute , you first need the expected counts: what the table would look like if the two variables were completely independent. If the treatment truly had zero effect, each group's response rate would match the overall study-wide rate.
The expected frequency for any cell is:
Where:
- is the expected count for the cell in row and column
- is the total number of observations in row
- is the total number of observations in column
- is the total number of observations in the entire table
In Plain English: Take the Placebo + Responded cell. There are 287 Placebo patients and 526 total responders out of 1,000. If the treatment does nothing, the Placebo group should respond at the same rate as everyone else (52.6%). So we expect $287 \times 526 / 1000 = 150.96$ responders in the Placebo group. We actually observed only 116. That gap of about 35 patients is one of the biggest contributors to the Chi-Square statistic.
Click to expandHow observed and expected frequencies are computed from marginal totals in a contingency table
Here's the full expected frequency table for our clinical trial:
| Treatment Group | Expected No Response | Expected Responded |
|---|---|---|
| Placebo | $287 \times 474 / 1000 = 136.04$ | $287 \times 526 / 1000 = 150.96$ |
| Drug A | $256 \times 474 / 1000 = 121.34$ | $256 \times 526 / 1000 = 134.66$ |
| Drug B | $242 \times 474 / 1000 = 114.71$ | $242 \times 526 / 1000 = 127.29$ |
| Drug C | $215 \times 474 / 1000 = 101.91$ | $215 \times 526 / 1000 = 113.09$ |
Every expected count is well above 5, so the Chi-Square approximation is valid. The Placebo group shows the biggest discrepancy between observed (116) and expected (150.96) respondents, while Drug B overperforms its expectation (157 observed vs 127.29 expected).
Degrees of freedom and the Chi-Square distribution
Degrees of freedom (df) determine which Chi-Square distribution to compare your test statistic against. The formula differs by test type.
For the Goodness of Fit test:
Where:
- is the number of categories
For the Test of Independence:
Where:
- is the number of rows in the contingency table
- is the number of columns
In Plain English: Degrees of freedom measure how much "wiggle room" the data has once you fix the row and column totals. In our 4-row by 2-column table, once you know the totals and fill in any 3 cells, you can calculate every remaining cell by subtraction. So . A value of 32.4 is extremely unlikely under 3 degrees of freedom (the distribution is concentrated near small values), but might be unremarkable with 20 degrees of freedom (the distribution spreads wider).
The Chi-Square distribution is right-skewed and only takes non-negative values. As degrees of freedom increase, the distribution shifts right and becomes more symmetric, approaching a normal distribution for large df.
When to use Chi-Square (and when not to)
The Chi-Square test is the right tool in specific situations. Using it where its assumptions don't hold produces meaningless p-values.
Use Chi-Square when:
- Both variables are categorical (nominal or ordinal)
- Each observation is independent (no repeated measures on the same subject)
- All expected cell counts are at least 5
- Your sample is reasonably large (typically )
Do NOT use Chi-Square when:
- Expected counts are below 5 in any cell. Use Fisher's Exact Test instead, which computes exact probabilities rather than relying on the Chi-Square approximation.
- Observations are paired or dependent. If you measured the same patient before and after treatment, use McNemar's test.
- Your data is continuous. Don't bin numerical data just to force it into a Chi-Square framework. Use a t-test, ANOVA, or a non-parametric alternative.
- You want to measure effect size. Chi-Square tells you that a relationship exists, not how strong it is. Follow up with Cramer's V or odds ratios.
- You have ordinal data and care about ordering. Chi-Square ignores the ordering of categories. Use the Cochran-Armitage trend test instead.
Click to expandDecision guide for checking Chi-Square assumptions before running the test
Common Pitfall: Researchers sometimes split continuous variables into arbitrary bins (e.g., "low," "medium," "high" income) just so they can run a Chi-Square test. This throws away information. If your variables are numerical, stick with correlation, regression, or non-parametric tests. Only use Chi-Square when your data is genuinely categorical.
Python implementation with scipy
Let's put all the theory into practice. We'll generate a synthetic clinical trial dataset that matches the counts from our running example, then run both tests using scipy.stats.
Generating the clinical trial data
Expected output:
Total patients: 1000
Group sizes:
treatment_group
Drug_A 256
Drug_B 242
Drug_C 215
Placebo 287
Name: count, dtype: int64
Overall response rate: 52.6%
Goodness of Fit test: balanced group assignment
Before testing drug effectiveness, we should verify the study design. Were patients assigned roughly equally across the four groups? With 1,000 patients, perfect balance would put 250 in each arm.
Null Hypothesis (): Patients are equally distributed (250 per group). Alternative Hypothesis (): The distribution is not equal.
Expected output:
Goodness of Fit Test: Equal Group Assignment
Observed: [256 242 215 287]
Expected: [250, 250, 250, 250]
Chi-Square statistic: 10.7760
Degrees of freedom: 3
P-value: 0.0130
P-value (0.0130) < alpha (0.05): Reject H0.
The groups are NOT equally distributed.
The p-value is about 0.013, which falls below our 0.05 threshold. Technically, the groups aren't perfectly balanced. But look at the actual numbers: the biggest gap is 287 vs 215, a difference of 72 patients out of 1,000. In randomized clinical trials, some imbalance is expected. A statistician would note the imbalance and might adjust for it in a regression model, but it wouldn't invalidate the trial.
Test of Independence: does the drug work?
Now for the critical question: is there a statistically significant relationship between the treatment group and recovery? This is where the Test of Independence comes in.
Null Hypothesis (): Treatment group and response are independent (the drugs don't affect recovery). Alternative Hypothesis (): Treatment group and response are dependent (the drugs do affect recovery).
Expected output:
Contingency Table (Observed):
No Response Responded
treatment_group
Drug_A 122 134
Drug_B 85 157
Drug_C 96 119
Placebo 171 116
Chi-Square statistic: 32.3680
P-value: 4.38e-07
Degrees of freedom: 3
Expected Frequencies (if H0 were true):
No Response Responded
treatment_group
Drug_A 121.34 134.66
Drug_B 114.71 127.29
Drug_C 101.91 113.09
Placebo 136.04 150.96
Response rates:
Drug_A: 134/256 = 52.3% (expected: 134.7)
Drug_B: 157/242 = 64.9% (expected: 127.3)
Drug_C: 119/215 = 55.3% (expected: 113.1)
Placebo: 116/287 = 40.4% (expected: 151.0)
The p-value is $4.38 \times 10^{-7}$, far below any reasonable significance level. We reject the null hypothesis with extreme confidence: treatment group and recovery rate are not independent. Drug B's 64.9% response rate versus Placebo's 40.4% is not a coincidence.
But here's the catch: the Chi-Square test tells us that a relationship exists. It doesn't tell us which drug is better or how much better. For that, we need post-hoc analysis and effect size measures.
Measuring effect size with Cramer's V
Expected output:
Chi-Square: 32.3680
N: 1000
Min(rows, cols) - 1: 1
Cramer's V: 0.1799
Effect size interpretation: small to medium
A Cramer's V of 0.18 indicates a small-to-medium association. The relationship is real (the p-value confirms that), but the treatment group explains only a modest share of the variation in recovery. This is typical in medical research where many factors beyond treatment influence outcomes.
Pro Tip: Always report effect size alongside p-values. With a large enough sample, even trivially small differences become "statistically significant." A p-value of 0.001 with a Cramer's V of 0.02 means the effect exists but is practically meaningless.
Visualizing the results
Expected output:
Chi-Square = 32.37, p = 4.38e-07
The left panel shows the stacked proportions: Drug B's "Responded" bar is visibly taller than Placebo's. The right panel makes the gap between observed and expected counts explicit. Drug B exceeded its expected responder count by about 30, while Placebo fell short by about 35.
Post-hoc analysis: which groups differ?
The Chi-Square test tells you the overall relationship is significant, but with four treatment groups, you'll want to know which specific pairs differ. This is analogous to running post-hoc tests after ANOVA.
The standard approach is to examine standardized residuals (also called adjusted residuals). A standardized residual with an absolute value above 2 flags that particular cell as contributing disproportionately to the overall Chi-Square value.
Expected output:
Standardized Residuals:
No Response Responded
treatment_group
Drug_A 0.060 -0.057
Drug_B -2.774 2.633
Drug_C -0.585 0.556
Placebo 2.998 -2.846
Significant cells (|residual| > 2):
Drug_B x No Response: -2.774 (below expected)
Drug_B x Responded: 2.633 (above expected)
Placebo x No Response: 2.998 (above expected)
Placebo x Responded: -2.846 (below expected)
The residuals confirm what we suspected: the significant Chi-Square result is driven primarily by Drug B (outperforming) and Placebo (underperforming). Drug A and Drug C show residuals well within the threshold, meaning they don't differ much from the overall average.
Key Insight: Standardized residuals are the Chi-Square test's equivalent of pairwise comparisons. When you report results, don't just say "the test was significant." Identify which cells drive the significance, because that's what decision-makers actually need to know.
Goodness of Fit: beyond equal proportions
The Goodness of Fit test isn't limited to checking for uniform distributions. You can test whether observed data matches any hypothesized distribution. Suppose a previous study found that Drug B is preferred by patients, and you hypothesize that in any clinical trial, the enrollment would follow a 20% Placebo, 25% Drug A, 30% Drug B, 25% Drug C split.
Expected output:
Goodness of Fit: Prior Study Distribution
Group Observed Expected (O-E)^2/E
Placebo 287 200.0 37.845
Drug_A 256 250.0 0.144
Drug_B 242 300.0 11.213
Drug_C 215 250.0 4.900
Chi-Square: 54.1023
P-value: 1.0671e-11
df: 3
The data strongly rejects this hypothesized distribution. Placebo enrollment was much higher than the 20% hypothesis, and Drug B enrollment was much lower than 30%. Each cell's contribution to shows where the mismatch is worst: Placebo alone contributes 37.8 of the total 54.1.
Production considerations
Chi-Square tests are computationally cheap. scipy.stats.chi2_contingency runs in time where and are the table dimensions. Even a 100-by-100 contingency table with millions of observations computes in milliseconds. The bottleneck is almost always constructing the contingency table from raw data (pd.crosstab() is where is the number of rows).
Memory is rarely an issue either. The test only needs the aggregated counts, not the raw observations. If your dataset has 100 million rows but only 10 categories, the contingency table is still tiny.
A few practical notes for production pipelines:
- Automate assumption checks. Before running Chi-Square, verify that all expected counts exceed 5. If any cell falls below, automatically switch to Fisher's exact test or merge sparse categories.
- Correct for multiple comparisons. If you're testing dozens of variable pairs, apply Bonferroni correction or control the false discovery rate (FDR) with Benjamini-Hochberg. Otherwise, 1 in 20 tests will appear significant by chance.
- Log everything. In A/B testing pipelines, record the value, p-value, degrees of freedom, sample size, effect size, and whether any expected count fell below 5. This audit trail is essential when results are questioned months later.
Conclusion
The Chi-Square test is the standard tool for determining whether categorical variables are related. It works by measuring the gap between what you observe and what you'd expect under the null hypothesis, then converting that gap into a p-value through the Chi-Square distribution.
The key to using it well goes beyond running chi2_contingency. Always verify the assumptions first, especially the expected count threshold. Report effect size (Cramer's V) alongside p-values because statistical significance and practical significance are different things entirely. And when you get a significant result with more than two categories, dig into the standardized residuals to identify which cells are actually driving the effect.
For a deeper understanding of the p-value framework behind this test, read Mastering Hypothesis Testing. If you're working with continuous outcomes rather than categorical ones, ANOVA is the natural parallel. And if your Chi-Square analysis reveals that a categorical variable predicts an outcome, the next step is often Logistic Regression to model that relationship with full control over confounders.
Frequently Asked Interview Questions
Q: What is the difference between the Chi-Square Goodness of Fit test and the Test of Independence?
The Goodness of Fit test examines whether a single categorical variable follows a specified distribution (e.g., are dice rolls uniformly distributed?). The Test of Independence examines whether two categorical variables are related in a contingency table (e.g., does treatment type affect recovery?). Both use the same formula, but they differ in how expected frequencies are computed and in the degrees of freedom calculation.
Q: When should you use Fisher's Exact Test instead of Chi-Square?
Fisher's Exact Test is the right choice when any expected cell count in the contingency table falls below 5. The Chi-Square test relies on a large-sample approximation to the Chi-Square distribution, and that approximation breaks down with small expected counts. Fisher's test computes exact probabilities, so it works correctly regardless of sample size. For large samples, both tests give nearly identical results, but Fisher's is computationally slower.
Q: A Chi-Square test returns a p-value of 0.001. Does that mean the effect is large?
No. A very small p-value means the relationship is unlikely to be due to chance, but it says nothing about the magnitude of the effect. With a sufficiently large sample, even tiny, practically meaningless differences become statistically significant. Always report an effect size measure like Cramer's V alongside the p-value. A Cramer's V below 0.1 signals a negligible effect regardless of how small the p-value is.
Q: Can you use the Chi-Square test with ordinal data?
You can, but the standard Chi-Square test ignores the ordering of categories. If you have ordinal data like {Low, Medium, High} income brackets, the test treats "Low vs Medium" and "Low vs High" as equally different. The Cochran-Armitage trend test is a better option when you want to test for a monotonic trend across ordered categories.
Q: How do you handle a significant Chi-Square test with more than two groups?
A significant result tells you that at least one group differs from the others, but not which one. You can examine standardized residuals (cells with absolute values above 2 are the main contributors) or run pairwise Chi-Square tests with a Bonferroni correction. This is directly analogous to running post-hoc pairwise t-tests after a significant ANOVA result.
Q: Your A/B test has three variants and a control. Can you run Chi-Square instead of multiple proportion z-tests?
Yes, and it's actually the recommended approach. Running separate z-tests for each variant against the control inflates the Type I error rate because of multiple comparisons. A single Chi-Square test of independence across all four groups controls the overall error rate at your chosen alpha level. If the Chi-Square test is significant, then investigate individual comparisons with appropriate corrections.
Q: What assumptions does the Chi-Square test make about the sampling process?
It assumes observations are independently sampled (no repeated measures), each observation falls into exactly one cell (categories are mutually exclusive and exhaustive), and the sample was drawn randomly from the population. Violating independence is the most common and most damaging assumption violation. For example, if family members are in the same trial, their outcomes are correlated, and the test's p-value becomes unreliable.
Q: What is Cramer's V and how do you interpret it?
Cramer's V is an effect size metric derived from the Chi-Square statistic: . It ranges from 0 (no association) to 1 (perfect association). For a 2x2 table, it equals the absolute value of the phi coefficient. Rough interpretation: below 0.1 is negligible, 0.1-0.3 is small, 0.3-0.5 is medium, and above 0.5 is large. Always interpret in context because thresholds vary by field.
Hands-On Practice
The following Python code demonstrates how to perform both types of Chi-Square tests using the scipy.stats library. First, we use the Goodness of Fit test to check if the patients were evenly sampled across the four treatment groups. Then, we use the Test of Independence to determine if the treatment received actually impacted patient recovery rates. We visualize the results using matplotlib to verify the statistical findings.
Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.
In this analysis, the Goodness of Fit test revealed slight imbalances in group sizes (p < 0.05), likely due to the random assignment process in a sample of this size. More importantly, the Test of Independence returned a very small p-value, leading us to reject the null hypothesis. This statistically confirms that the choice of drug significantly impacts patient recovery rates, validating the apparent differences seen in the charts.