Skip to content

Statistics and Hypothesis Testing Interview Questions for Data Scientists

DS
LDS Team
Let's Data Science
16 min

Statistics and experimentation screens have quietly become the hardest part of the data science interview loop. A/B testing appeared in 14 percentage points more data science job postings in 2026 than two years earlier, and causal inference skills jumped 17 percentage points over the same period, a reflection of teams shifting from descriptive dashboards to rigorous, experiment-driven decision-making. If you can walk through a p-value calculation, design a valid A/B test from scratch, and explain power analysis to a product manager without losing them, you clear a filter that eliminates a significant share of applicants.

The questions in this article are drawn from our research of publicly available interview prep resources, academic materials, and community discussions including InterviewQuery, DataLemur, StrataScratch, and forums such as r/datascience and Blind. These represent patterns that data scientists have reported encountering in statistics and experimentation interviews at technology companies. We do not claim these are proprietary questions from any specific organization.

Probability Fundamentals That Actually Get Asked

Probability questions at the start of a statistics screen are not warm-up softball. They reveal whether you can reason from first principles under pressure.

Bayes' theorem in a business scenario

A spam filter marks an email as spam. The base rate of spam is 30%. The filter has a 95% true positive rate and a 2% false positive rate. Given the filter flagged the email, what is the probability it is actually spam?

Using Bayes' theorem:

python
# P(spam | flagged) using Bayes' theorem
p_spam = 0.30          # prior: 30% of emails are spam
p_flag_given_spam = 0.95   # true positive rate
p_flag_given_not_spam = 0.02  # false positive rate

p_not_spam = 1 - p_spam

# P(flagged) = P(flag|spam)*P(spam) + P(flag|not spam)*P(not spam)
p_flag = (p_flag_given_spam * p_spam) + (p_flag_given_not_spam * p_not_spam)

# Bayes' theorem
p_spam_given_flag = (p_flag_given_spam * p_spam) / p_flag

print(f"P(spam | flagged) = {p_spam_given_flag:.4f}")
# Output: P(spam | flagged) = 0.9532

With a 30% base rate, the posterior (95.3%) lands close to the true positive rate (95%) because the prior is relatively high. The real lesson becomes vivid with a rare event: if only 1% of emails were spam, the same filter with a 95% TPR and 2% FPR would flag an email that has just a 33% chance of being actual spam. That is the base rate fallacy, and it trips up candidates every time.

Key Insight: Interviewers ask Bayesian questions to see if you reason about priors. Forgetting the base rate is the single most common mistake. Always state the prior before computing the posterior.

Expected value with a decision twist

You roll a fair six-sided die and earn the face value in dollars. After seeing the result, you may roll once more and take the new value instead, forfeiting the first. When should you roll again?

The expected value of a single roll is (1+2+3+4+5+6)/6 = 3.5. You should roll again only if your first result is below 3.5, meaning a result of 1, 2, or 3. If you roll a 4, 5, or 6, keep it. The expected value of the optimal strategy is (3/6) × 3.5 + (1/6) × 4 + (1/6) × 5 + (1/6) × 6 = 1.75 + 2.5 = $4.25.

From the Interviewer's Perspective: This question tests structured decision-making under uncertainty, the same reasoning used when deciding whether to continue an experiment or call it early.

Hypothesis Testing From First Principles

The p-value: what it means and how to explain it

A standard interview question at a social platform with millions of daily active users runs like this: "Explain a p-value to a non-technical product manager."

The wrong answer: "It is the probability that the null hypothesis is true." This is one of the most persistent misconceptions in statistics.

The correct answer: A p-value is the probability of observing results at least as extreme as what you measured, assuming the null hypothesis is true. It quantifies how surprised you should be by the data if nothing were actually happening.

Here is a plain-English framing that works in interviews:

"If the new feature had zero effect, there is only a 4% chance we would see a lift this large just by chance. Since 4% is below our threshold of 5%, we have enough statistical evidence to conclude the feature probably has a real effect."

What a p-value does NOT mean:

  • It is not the probability the result is due to chance
  • It is not the probability the null hypothesis is true
  • It is not a measure of effect size or practical importance

Common Mistake: Conflating statistical significance with practical significance. A p-value of 0.001 with an effect size of 0.01% conversion lift may be statistically significant but commercially irrelevant. Always pair the p-value with an effect size and a confidence interval.

Type I and Type II errors with business consequences

Error TypeStatistical NameA/B Test ScenarioBusiness Consequence
Type I (α)False positiveConcluding a bad feature worksShip a feature that does not help users, wasting engineering resources
Type II (β)False negativeMissing a feature that worksFail to ship a valuable feature, losing competitive advantage

At most technology companies, α is set to 0.05 and power (1 - β) to 0.80. These are conventions, not laws. A medical device company sets α at 0.01 because the cost of a false positive is severe. A rapid-iteration product team might accept α = 0.10 to move faster, accepting more false positives.

Key Insight: Interviewers ask you to define these errors in context, not in the abstract. If they say "we run 100 independent A/B tests per year with α=0.05," the expected number of false positives among tests where the null is true is 5. That is a real operational problem.

Null hypothesis formulation for business questions

A large e-commerce company asks: "We changed the checkout button from green to orange. How do you formulate the hypothesis test?"

Null hypothesis (H₀): The conversion rate for the orange button is equal to the conversion rate for the green button. (μ_orange = μ_green)

Alternative hypothesis (H₁): The conversion rates differ. (μ_orange ≠ μ_green), two-tailed test unless you have a directional prior.

The choice between one-tailed and two-tailed matters. A one-tailed test has more statistical power for detecting an effect in the specified direction, but using it requires genuine directional prior belief, not choosing it post-hoc because the data pointed one way.

A/B Test Design Questions

Sample size calculation

An ad platform asks: "How do you calculate the required sample size for an A/B test?" This question is asked in roughly half of all experimentation-focused data science screens.

The inputs are: baseline conversion rate, minimum detectable effect (MDE), significance level α, and desired power (1 - β).

python
import numpy as np
from scipy.stats import norm

def ab_test_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    """
    Calculate per-group sample size for a two-proportion z-test.

    Parameters
    ----------
    baseline_rate : float  - current conversion rate (e.g., 0.10 for 10%)
    mde          : float  - minimum detectable effect, absolute (e.g., 0.02 for +2pp)
    alpha        : float  - significance level (default 0.05)
    power        : float  - desired power (default 0.80)

    Returns
    -------
    int  - required sample size per group
    """
    treatment_rate = baseline_rate + mde

    # Pooled proportion under H0
    p_pool = (baseline_rate + treatment_rate) / 2

    # z-scores for alpha and beta
    z_alpha = norm.ppf(1 - alpha / 2)   # two-tailed
    z_beta  = norm.ppf(power)

    # Sample size formula
    numerator   = (z_alpha * np.sqrt(2 * p_pool * (1 - p_pool)) +
                   z_beta  * np.sqrt(baseline_rate * (1 - baseline_rate) +
                                     treatment_rate * (1 - treatment_rate))) ** 2
    denominator = (treatment_rate - baseline_rate) ** 2

    n = numerator / denominator
    return int(np.ceil(n))

# Example: 10% baseline, want to detect +2pp lift, standard settings
n = ab_test_sample_size(baseline_rate=0.10, mde=0.02)
print(f"Required sample size per group: {n:,}")
# Output: Required sample size per group: 3,841

The total experiment requires 7,682 users (3,841 per group). If your site gets 1,000 visitors per day, that is roughly 8 days of data. If your site gets 100 visitors per day, that is 77 days, long enough that seasonal effects become a concern.

Common Mistake: Calculating sample size after the experiment ends. If you compute the required sample size only after peeking at results and stopping early, you have invalidated the test. Determine sample size before you start.

The meaning of 80% power

Power = 0.80 means: if the treatment truly has the effect you assumed (your MDE), you have an 80% probability of detecting it as statistically significant. Equivalently, you have a 20% chance of a Type II error (missing a real effect).

Four factors control power: sample size (larger increases power), effect size (larger effects are easier to detect), significance level (a more permissive α increases power but also false positives), and variance (lower variance increases power).

Multiple testing and the Bonferroni correction

If you run 10 independent A/B tests simultaneously each at α = 0.05, the probability of at least one false positive is 1 - (0.95)^10 ≈ 40%. This is the familywise error rate problem.

The Bonferroni correction divides α by the number of tests. For 10 tests, each individual test uses α = 0.05 / 10 = 0.005.

python
import numpy as np

# Family-wise error rate without correction
n_tests = 10
alpha   = 0.05
fwer_uncorrected = 1 - (1 - alpha) ** n_tests
print(f"FWER without correction: {fwer_uncorrected:.4f}")
# Output: FWER without correction: 0.4013

# Bonferroni corrected alpha
alpha_bonferroni = alpha / n_tests
print(f"Bonferroni corrected alpha per test: {alpha_bonferroni:.4f}")
# Output: Bonferroni corrected alpha per test: 0.0050

# FWER after correction (bounded by alpha, approximately)
fwer_corrected = 1 - (1 - alpha_bonferroni) ** n_tests
print(f"FWER with Bonferroni correction: {fwer_corrected:.4f}")
# Output: FWER with Bonferroni correction: 0.0489

Key Insight: Bonferroni is conservative. When tests are correlated (testing related metrics), it over-corrects, reducing power. The Benjamini-Hochberg procedure (False Discovery Rate control) is often preferred when running many correlated tests.

The peeking problem

A social platform with millions of daily active users asks: "We check our A/B test results every day and stop when p < 0.05. What is wrong with this approach?"

Peeking means checking results before the predetermined sample size is reached and stopping early if significance is achieved. Every additional look at the data is an additional opportunity to observe a spurious significant result. Research by Evan Miller showed that peeking every day for 30 days inflates the effective false positive rate from 5% to approximately 22%.

The solutions are: pre-register the sample size and end date before starting, use sequential testing methods (which adjust significance thresholds for interim analyses), or use Bayesian approaches that handle continuous monitoring without inflating error rates.

Choosing the Right Statistical Test

This is a practical decision tree question that appears frequently at companies with mature data teams.

t-test vs. Welch's t-test vs. Mann-Whitney U

TestWhen to UseAssumption
Student's t-testTwo independent groups, equal variances, normal distributionAssumes equal variance
Welch's t-testTwo independent groups, unequal variances, normal distributionDoes not assume equal variance
Mann-Whitney UNon-normal data, ordinal data, heavy outliersAssumes similar distribution shapes

In practice, Welch's t-test is the safer default for two-group comparisons because it performs well even when variances are equal. Note that scipy.stats.ttest_ind defaults to equal_var=True (Student's t-test), so you must explicitly pass equal_var=False to get Welch's correction.

Use Mann-Whitney U when:

  • Conversion revenue data is heavily right-skewed (a small number of large purchases dominate)
  • Sample sizes are small (under 30 per group) and normality cannot be verified
  • The outcome is ordinal (satisfaction ratings on a 1-5 scale)

From the Interviewer's Perspective: Candidates who default to "always use a t-test" fail this question. The right answer names the test, states the assumption being checked, and explains the consequence of violating it.

Chi-squared for categorical outcomes

Use the chi-squared test when both the treatment variable and the outcome are categorical. A common scenario: did the button color change affect the distribution of users across plan tiers (free, basic, premium)?

python
import numpy as np
import scipy.stats as stats

# Observed frequencies: rows = variant, columns = plan tier
# [Free, Basic, Premium]
control   = [120, 60, 20]
treatment = [100, 70, 30]

observed = np.array([control, treatment])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-squared statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")
# Output:
# Chi-squared statistic: 4.5874
# Degrees of freedom: 2
# P-value: 0.1009

The chi-squared test compares observed cell frequencies to expected frequencies under independence. If p < α, you reject independence, meaning the variant affected the distribution across categories.

Common Mistake: Applying chi-squared when expected cell counts are below 5. In that case, use Fisher's exact test instead.

Common Interview Scenarios and How to Answer Them

"An A/B test shows p = 0.04. Should we ship it?"

The wrong answer is "yes, it is significant." The complete answer walks through four additional checks:

  1. Effect size and practical significance. What is the actual conversion lift? A p-value of 0.04 on a 0.01% lift at your company's scale might not justify the implementation cost or the risk of unintended side effects.

  2. Was the test properly designed? Was sample size determined in advance? Was the experiment stopped at the pre-planned end date, not early? Was randomization verified?

  3. Did you check guardrail metrics? An experiment can lift the primary metric while degrading a secondary one. If the button change increased checkout conversion by 2% but increased cart abandonment rate by 5%, the net result is negative.

  4. Multiple testing. If this was one of 20 experiments running simultaneously, a p-value of 0.04 is within the range of false positives you expect just from running that many tests.

Key Insight: The answer interviewers want is not "yes, p < 0.05 means ship." It is a structured framework that treats statistical significance as one input in a broader decision, not the final word.

"We ran an experiment but the two groups had different sizes. Is this a problem?"

Unequal group sizes are not inherently invalid. The two-proportion z-test and Welch's t-test handle unequal n. But unequal group sizes raise a more important question: why are they unequal?

If unequal assignment was intentional (you chose 80/20 allocation to limit exposure to a risky change), that is fine. If assignment should have been 50/50 but ended up 60/40, that suggests a randomization bug, a systematic bias in which users ended up in which bucket. The fix is to run an A/A test (same experience in both groups) and check whether the two groups differ on pre-experiment covariates.

From the Interviewer's Perspective: This question tests whether you distinguish between group size imbalance (a statistical detail) and randomization failure (a validity-threatening problem). Most candidates address only the statistics. The strong answer addresses the root cause.

"Our treatment group conversion rate jumped 40% but our overall metric moved only 2%. Explain."

This is a classic Simpson's Paradox and novelty effect scenario. Several explanations are worth naming:

  • Simpson's Paradox: The overall metric aggregates multiple user segments. The treatment group may be disproportionately composed of high-converting users, making the treatment look effective within that group even though the aggregate story is different.
  • Novelty effect: Users interact with a new feature more frequently simply because it is new. Conversion rates decline as novelty fades. Run the experiment longer, or look at cohorts of users over time.
  • Metric definition mismatch: "Conversion rate" and "overall metric" may be measuring different things. A 40% lift in one-click purchases might not move revenue per user if users are trading down from larger purchases.

Python for Statistical Tests

Two-sample t-test with Welch's correction

python
import numpy as np
import scipy.stats as stats

np.random.seed(42)

# Simulate session duration (seconds) for control and treatment
control   = np.random.normal(loc=180, scale=45, size=500)
treatment = np.random.normal(loc=192, scale=50, size=480)

# Welch's t-test — must set equal_var=False (default is True)
t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)

print(f"Control mean:   {control.mean():.2f}s")
print(f"Treatment mean: {treatment.mean():.2f}s")
print(f"t-statistic:    {t_stat:.4f}")
print(f"p-value:        {p_value:.4f}")

if p_value < 0.05:
    print("Result: Statistically significant at alpha=0.05")
else:
    print("Result: Not statistically significant at alpha=0.05")

Mann-Whitney U for non-normal data

python
import numpy as np
import scipy.stats as stats

np.random.seed(42)

# Revenue per user is typically right-skewed — not normally distributed
control_revenue   = np.random.exponential(scale=25, size=400)
treatment_revenue = np.random.exponential(scale=28, size=390)

u_stat, p_value = stats.mannwhitneyu(
    control_revenue, treatment_revenue, alternative='two-sided'
)

print(f"Mann-Whitney U statistic: {u_stat:.2f}")
print(f"P-value: {p_value:.4f}")

Chi-squared test for proportions across categories

python
import numpy as np
import scipy.stats as stats

# Click-through on three ad formats: Text, Image, Video
# Rows: control vs treatment, Columns: ad format clicked
observed = np.array([
    [210, 85, 55],   # control
    [195, 110, 70],  # treatment
])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-squared: {chi2:.4f}")
print(f"DOF: {dof}")
print(f"P-value: {p_value:.4f}")
print("\nExpected frequencies:")
print(np.round(expected, 1))

Bootstrap confidence interval from scratch

Bootstrap resampling is useful when you cannot assume a parametric distribution and want a confidence interval for a statistic like the median or a custom metric.

python
import numpy as np

np.random.seed(42)

# Observed conversion data: 1 = converted, 0 = did not
n = 1000
conversions = np.random.binomial(1, p=0.12, size=n)

def bootstrap_ci(data, stat_func, n_bootstrap=5000, ci=0.95):
    """
    Compute bootstrap confidence interval for a statistic.

    Parameters
    ----------
    data       : array-like  - observed data
    stat_func  : callable    - function to compute statistic (e.g., np.mean)
    n_bootstrap: int         - number of bootstrap samples
    ci         : float       - confidence level (default 0.95)

    Returns
    -------
    tuple (lower, upper) confidence interval bounds
    """
    bootstrap_stats = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        bootstrap_stats.append(stat_func(sample))

    lower = np.percentile(bootstrap_stats, (1 - ci) / 2 * 100)
    upper = np.percentile(bootstrap_stats, (1 + ci) / 2 * 100)
    return lower, upper

observed_rate = conversions.mean()
lower, upper  = bootstrap_ci(conversions, np.mean)

print(f"Observed conversion rate: {observed_rate:.4f}")
print(f"95% Bootstrap CI: ({lower:.4f}, {upper:.4f})")

Bootstrap CIs are particularly useful for revenue metrics where the distribution has heavy tails and the central limit theorem applies only weakly for modest sample sizes.

Key Insight: Interviewers at data-mature companies expect you to know when parametric tests break down. Mentioning bootstrap or non-parametric methods unprompted signals that you understand the assumptions behind standard tests, not just their mechanics.

Conclusion

The questions that trip up candidates are rarely the ones about definitions. They are the scenario-based questions that require you to connect statistics to business consequences: recognizing that a p = 0.04 result still requires scrutiny, that unequal group sizes may signal randomization failure rather than a math problem, and that running 20 simultaneous tests at α = 0.05 guarantees about one false positive regardless of how carefully each individual test is designed.

The practical preparation path is to work through the calculation questions with actual Python code until the patterns are automatic. Sample size formulas, Bonferroni corrections, and bootstrap intervals should be reproducible from memory under interview conditions. Conceptual fluency follows from being able to explain each result in one sentence to a non-technical stakeholder.

For broader context on how statistics and experimentation fit into the full data science interview process, see The Complete Data Science Interview Guide and How to Prepare for Data Science Take-Home Assignments on letsdatascience.com.

The candidates who clear statistics screens are not necessarily the ones who know the most formulas. They are the ones who can explain why a formula exists, when it breaks, and what to do instead.

Career Q&A

How much statistics do I actually need to know for a data science interview at a product company?

At most product-focused technology companies, the statistics screen is narrower than candidates expect. You need fluency in hypothesis testing, A/B test design, p-value interpretation, and Type I/II error tradeoffs. You need working knowledge of t-tests, chi-squared, and when to reach for a non-parametric alternative. Deep knowledge of time series models, ARIMA, or Bayesian networks is rarely tested unless the role specifically lists those in the job description. The ceiling for most screens is the topics covered in this article, applied to ambiguous business scenarios.

Should I use statsmodels or scipy for power analysis and sample size calculations?

Both work. scipy.stats.norm.ppf gives you the z-scores to compute sample sizes from scratch (which is what this article demonstrates) and shows interviewers you understand the underlying formula rather than just calling a library function. statsmodels.stats.power offers the TTestIndPower().solve_power() convenience function, which is appropriate for take-home assignments or code reviews. In an interview, demonstrating the formula from first principles with scipy is generally the stronger signal.

What is the most common statistics mistake candidates make in interviews?

Conflating statistical significance with practical significance. Candidates who say "p < 0.05, we should ship it" miss the second half of the answer: what is the effect size, does the lift justify the implementation cost, and did any guardrail metrics move in the wrong direction? Statistical significance is a necessary condition for shipping, not a sufficient one.

How do I answer a statistics question when I am not sure of the exact formula?

Structure your answer out loud. State what the test is trying to measure, name the assumptions you would check, and describe the decision logic (reject H₀ if p < α). Interviewers care more about your reasoning process than formula precision. If you cannot recall the exact Bonferroni correction formula, saying "divide α by the number of tests to control the familywise error rate" demonstrates the concept even without the algebraic form.

Are Bayesian methods tested in data science interviews?

Rarely in depth, but Bayes' theorem at the basic level (prior, likelihood, posterior) appears in almost every statistics screen in some form. The biased coin problem, the medical test false positive rate problem, and the spam filter scenario all require Bayes' theorem. Fully Bayesian inference (MCMC, conjugate priors, hierarchical models) is tested only in roles where the job description explicitly mentions Bayesian modeling.

How do I handle a multiple testing question if I have never used Bonferroni correction in a real project?

Acknowledge the principle clearly even if you have not applied it in production. Explain that running k tests at nominal α = 0.05 inflates the familywise error rate, that Bonferroni correction divides α by k as a conservative bound, and that Benjamini-Hochberg is preferred when tests are correlated and you want to control the false discovery rate rather than the familywise error rate. That answer passes most interview screens.

How important is Python code in statistics interviews compared to conceptual explanation?

It depends on the format. In a phone screen or behavioral round, conceptual explanation is everything; no one can run your code anyway. In a technical screen or whiteboard session, being able to sketch the logic of scipy.stats.ttest_ind or write a bootstrap loop signals practical skill. For take-home assignments, runnable, commented code with interpretable output is the expectation. Know both modes and be ready to switch between them.

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths