Skip to content

Probability Distributions: The Hidden Framework Behind Your Data

DS
LDS Team
Let's Data Science
14 minAudio
Listen Along
0:00/ 0:00
AI voice

<!— slug: probability-distributions-the-hidden-framework-behind-your-data —> <!— excerpt: Master probability distributions from normal to Poisson with Python code, visual diagnostics, and a real e-commerce dataset. Includes when to use each, fitting pipelines, and interview prep. —>

A single customer walks into your online store and spends $147.32. Another spends $12.99. A third drops $2,041.00. Each purchase feels random — unpredictable. But aggregate 10,000 purchases, and a clear shape emerges: most cluster between $40 and $65, a long tail stretches toward high-value orders, and the contour of that shape tells you more about your business than any individual receipt ever could. That shape is a probability distribution, and understanding probability distributions is the difference between guessing and predicting.

A probability distribution is the mathematical function that maps every possible outcome of a random variable to its likelihood. Actuaries use distributions to price insurance policies. A/B tests depend on them to distinguish real effects from noise. Machine learning models — from logistic regression to Gaussian Mixture Models — rely on distributional assumptions at every layer.

Throughout this article, we'll work with one consistent scenario: an e-commerce store analyzing 10,000 customer purchases. Every formula, every code block, and every table will reference this dataset so the concepts stay grounded in one concrete example.

Discrete versus continuous random variables

The most fundamental distinction in probability is whether your random variable produces countable or uncountable outcomes. Getting this wrong means applying the wrong math from the start.

Discrete random variables take on countable values — typically integers. The number of orders your store receives per hour (0, 1, 2, 3...), the number of items in a shopping cart, or the count of customers who click "Buy Now" out of 500 visitors. You can list every possible outcome, even if the list is infinite (0, 1, 2, ...).

Continuous random variables take on any value within a range, including decimals that go on forever. The dollar amount of a single purchase ($47.82, $47.821, $47.8213...), the time a customer spends browsing before checkout, or the exact weight of a shipped package. You cannot list every possible outcome because there are infinitely many values in any interval, no matter how small.

PropertyDiscreteContinuous
ValuesCountable (integers, categories)Uncountable (any real number in range)
Probability functionPMF: P(X=k)P(X = k) gives exact probabilityPDF: f(x)f(x) gives density, not probability
Probability of exact valueCan be nonzeroAlways zero
Probability fromDirect lookup or summationIntegration (area under curve)
Example (e-commerce)Items in cart: 0, 1, 2, 3...Purchase amount: $0.01 to $10,000+

Key Insight: The probability of a continuous variable taking any single exact value is zero. The probability that a purchase is exactly $47.820000... is zero because the "width" of a single point on the number line is zero. Probabilities for continuous variables only make sense over intervals — the probability that a purchase falls between $40 and $60.

PMF, PDF, and CDF — three lenses on probability

Relationship between PMF, PDF, and CDF for discrete and continuous distributionsClick to expandRelationship between PMF, PDF, and CDF for discrete and continuous distributions

Three mathematical functions describe how probability spreads across outcomes. Each answers a different question about your data.

Probability Mass Function (PMF) — for discrete variables

The PMF assigns a concrete probability to each specific outcome. If your store averages 4 orders per hour, the PMF tells you: "the probability of exactly 0 orders is 0.018, exactly 1 is 0.073, exactly 2 is 0.147," and so on. Every bar in a PMF plot represents a real probability, and all bars must sum to 1.

kP(X=k)=1andP(X=k)0    k\sum_{k} P(X = k) = 1 \quad \text{and} \quad P(X = k) \geq 0 \;\; \forall k

Where:

  • P(X=k)P(X = k) is the probability of the random variable XX taking exact value kk
  • The sum runs over every possible value kk the variable can take

In Plain English: If you listed the probability of every possible number of orders in an hour and added them all up, you'd get exactly 1. No outcome can have negative probability, and some outcome must happen.

Probability Density Function (PDF) — for continuous variables

The PDF describes relative likelihood at a point, not direct probability. For purchase amounts, the PDF might be tall around $50 (high density, many purchases cluster here) and nearly flat past $200 (low density, few purchases that large). Actual probabilities come from areas under the curve: the probability of a purchase between $40 and $60 is the area under the PDF from 40 to 60.

f(x)dx=1andf(x)0    x\int_{-\infty}^{\infty} f(x)\,dx = 1 \quad \text{and} \quad f(x) \geq 0 \;\; \forall x

Where:

  • f(x)f(x) is the probability density at point xx
  • The integral over the entire real line equals 1 (total probability)

In Plain English: The density curve can never go below zero, and the total area under it must equal 1. But a density value at a specific point can exceed 1 — that means the data is tightly concentrated there, not that probability exceeds 100%. Only area gives you probability.

Common Pitfall: Seeing f(x)=2.5f(x) = 2.5 at some point and thinking "that's a 250% probability." A density value is not a probability. It's how concentrated the data is at that point. Think of it like population density (people per square mile) versus actual population count.

Cumulative Distribution Function (CDF) — for both

The CDF answers: "What fraction of outcomes fall at or below value xx?" It works for both discrete and continuous variables, making it the most versatile of the three.

F(x)=P(Xx)F(x) = P(X \leq x)

Where:

  • F(x)F(x) is the CDF evaluated at xx
  • P(Xx)P(X \leq x) is the probability that the random variable is at most xx

In Plain English: If the CDF at $50 equals 0.62, it means 62% of all purchases are $50 or less. The CDF always starts at 0 (far left) and ends at 1 (far right), rising monotonically. For discrete variables it looks like a staircase; for continuous variables it's a smooth S-curve.

The CDF is the integral of the PDF (or cumulative sum of the PMF). Conversely, the PDF is the derivative of the CDF. This relationship means you can always convert between them.

The normal distribution

The normal (Gaussian) distribution is the single most important distribution in statistics. Its dominance comes not from being common in raw data — most real-world measurements are at least slightly skewed — but from the Central Limit Theorem (CLT): average enough independent observations from any distribution, and those averages converge to a normal shape. The CLT is why sample means, measurement errors, and aggregated metrics so reliably look bell-shaped.

According to the original CLT formalization by Lyapunov (1901), this convergence holds under remarkably weak conditions — the individual variables don't even need to follow the same distribution, as long as no single variable dominates the sum.

The PDF formula

f(x)=1σ2πe12(xμσ)2f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}

Where:

  • f(x)f(x) is the probability density at value xx
  • μ\mu (mu) is the mean — the center of the bell
  • σ\sigma (sigma) is the standard deviation — controls the width
  • 1σ2π\frac{1}{\sigma\sqrt{2\pi}} is the normalizing constant that forces total area to 1
  • e12()2e^{-\frac{1}{2}(\cdot)^2} is the exponential decay that creates the bell shape

In Plain English: The density drops off exponentially as a value moves further from the mean. If average purchase amount is $52, a purchase of $55 (close to the mean) has high density, while $120 (far from the mean) has very low density. The standard deviation controls how quickly the density falls — a small σ\sigma means a narrow, peaked bell; a large σ\sigma means a wide, flat one.

The 68-95-99.7 rule

For any normal distribution:

  • 68.27% of values fall within 1 standard deviation of the mean
  • 95.45% within 2 standard deviations
  • 99.73% within 3 standard deviations

If average purchase amount is $52 with \sigma = \15,then68, then **68%** of purchases fall between \37 and $67, 95% between $22 and $82, and a customer spending $97+ is in the top 0.13% — a statistically rare event worth flagging for fraud review or VIP treatment.

Python: fitting and visualizing the normal distribution

Expected output:

code
Fitted mu: 51.99, sigma: 15.01
68% range: &#36;36.98 to &#36;67.00
95% range: &#36;21.97 to &#36;82.01

Common Pitfall: Assuming normality when your data has fat tails causes severe underestimation of extreme events. Financial risk models that assumed normal distributions notoriously failed during the 2008 crisis because real market returns have heavier tails (excess kurtosis). Always verify with a QQ plot before committing to a normal assumption.

The binomial distribution

The binomial distribution counts the number of successes in a fixed number of independent yes/no trials, each with the same success probability. It answers questions like: "Out of 500 store visitors today, how many will convert if our conversion rate is 8%?"

Three requirements

The binomial model requires:

  1. Fixed number of trials (nn) — you know in advance how many visitors you're tracking
  2. Two outcomes per trial — each visitor either converts (success) or doesn't (failure)
  3. Independence — one visitor's decision doesn't influence another's

If these conditions break — say, visitors share a promotional link and influence each other — the binomial model produces unreliable variance estimates. Consider a beta-binomial model instead.

The PMF formula

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}

Where:

  • P(X=k)P(X = k) is the probability of exactly kk successes out of nn trials
  • (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!} is the binomial coefficient — the number of ways to arrange kk successes in nn trials
  • pkp^k is the probability that kk specific trials all succeed
  • (1p)nk(1-p)^{n-k} is the probability that the remaining trials all fail

In Plain English: Your store has an 8% conversion rate. Out of 500 visitors, what's the chance exactly 40 convert? You multiply three things: the probability of 40 specific visitors converting ($0.08^{40}), the probability of the other 460 not converting (&#36;0.92^{460}), and the number of ways to pick which 40 visitors are the converters ((50040)\binom{500}{40}).

Key properties:

  • Mean: μ=np\mu = np — with 500 visitors at 8%, expect 40 conversions on average
  • Variance: σ2=np(1p)\sigma^2 = np(1-p) — the spread depends on both nn and pp
  • As nn grows large, the binomial approximates a normal distribution (CLT again)

Python: visualizing binomial conversion counts

Expected output:

code
Expected conversions: 40
Std dev: 6.1
P(50+ conversions) = 0.0622 (6.2%)

The Poisson distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate. It answers: "If our store averages 4 orders per hour, what's the probability of getting 10 in the next hour?"

The Poisson is the workhorse distribution for count data: website hits per minute, support tickets per day, server errors per deployment. According to Ladislaus Bortkiewicz's classic 1898 study, even deaths from horse kicks in the Prussian cavalry followed a Poisson process — the original proof that rare, independent events produce predictable aggregate patterns.

The PMF formula

P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k \, e^{-\lambda}}{k!}

Where:

  • P(X=k)P(X = k) is the probability of exactly kk events in the interval
  • λ\lambda (lambda) is the average rate of events per interval
  • eλe^{-\lambda} is a decay factor ensuring probabilities sum to 1
  • k!k! is the factorial — a brake that makes very high counts increasingly unlikely

In Plain English: Your store averages 4 orders per hour. The Poisson formula tells you the probability of exactly 10 orders in the next hour: 410e410!0.005\frac{4^{10} \cdot e^{-4}}{10!} \approx 0.005. That's about 0.5% — rare, but not impossible during a flash sale or viral social media moment.

The unique property: For a Poisson variable, mean and variance are both λ\lambda. If your data's variance significantly exceeds its mean, you have overdispersion and the Poisson model is a poor fit. Switch to a Negative Binomial distribution instead.

Python: Poisson order counts

Expected output:

code
Mean = Variance = 4
P(8+ orders in 1 hour) = 0.0511 (5.1%)
P(0 orders in 1 hour) = 0.0183

The exponential distribution

While the Poisson counts how many events happen in an interval, the exponential distribution measures how long you wait between consecutive events. If orders arrive at a Poisson rate of λ\lambda per hour, the gaps between consecutive orders follow an exponential distribution with the same rate parameter. The two distributions are mathematical partners.

The PDF formula

f(x)=λeλx,x0f(x) = \lambda \, e^{-\lambda x}, \quad x \geq 0

Where:

  • f(x)f(x) is the probability density at waiting time xx
  • λ\lambda is the rate parameter (events per unit time)
  • eλxe^{-\lambda x} is exponential decay — density drops as waiting time increases
  • Mean waiting time: $1/\lambda$

In Plain English: If orders arrive at 4 per hour on average, the average wait between orders is $1/4 = 0.25hours(15minutes).Shortwaitsarethemostprobablethedensityishighestrightathours (15 minutes). Short waits are the most probable — the density is highest right atx = 0$ and decays from there. A 45-minute gap between orders is possible but rare.

The memoryless property

The exponential distribution is the only continuous distribution that is memoryless:

P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t)

If you've already waited 10 minutes without an order, the probability of waiting another 5 minutes is exactly the same as if you'd just started timing. The distribution has no memory of the past. This makes it appropriate for events with a constant hazard rate (random server failures, radioactive decay) but inappropriate when aging matters (human lifespan, battery degradation).

Python: exponential wait times

Expected output:

code
Mean wait time: 15 minutes
P(wait > 30 min) = 0.1353 (13.5%)

Pro Tip: In scipy.stats, the exponential distribution uses scale = 1/lambda, not lambda directly. If your data has a mean wait of 15 minutes, pass scale=15 (in minutes) — not scale=1/15. Confusing rate and scale is one of the most common bugs in distribution fitting code.

The uniform distribution

The uniform distribution assigns equal probability to every outcome within a fixed range. It represents "maximum ignorance" — the complete absence of a reason to favor any value over another.

The PDF formula

f(x)=1ba,axbf(x) = \frac{1}{b - a}, \quad a \leq x \leq b

Where:

  • f(x)f(x) is the constant probability density between aa and bb
  • aa is the minimum value
  • bb is the maximum value
  • f(x)=0f(x) = 0 outside the interval [a,b][a, b]

In Plain English: Your store's random discount generator assigns each customer a discount between 5% and 25%, with every percentage equally likely. The density is 1255=0.05\frac{1}{25-5} = 0.05 everywhere between 5 and 25, and zero outside. The probability of getting a discount between 10% and 15% is simply 1510255=0.25\frac{15-10}{25-5} = 0.25 — no calculus required, just proportions.

Key properties:

  • Mean: a+b2\frac{a + b}{2} (the midpoint)
  • Variance: (ba)212\frac{(b - a)^2}{12}

The uniform distribution appears in random number generators, Monte Carlo simulation (where random seeds are drawn uniformly), hash functions, and as uninformative priors in Bayesian statistics.

The log-normal distribution

Many real-world quantities — purchase amounts, income, city populations, stock prices — are strictly positive, right-skewed, and span several orders of magnitude. The log-normal distribution models these naturally. A variable XX is log-normal if ln(X)\ln(X) follows a normal distribution.

The PDF formula

f(x)=1xσln2πe(lnxμln)22σln2,x>0f(x) = \frac{1}{x \, \sigma_{\ln} \sqrt{2\pi}} \, e^{-\frac{(\ln x - \mu_{\ln})^2}{2\sigma_{\ln}^2}}, \quad x > 0

Where:

  • μln\mu_{\ln} is the mean of ln(X)\ln(X) (not the mean of XX itself)
  • σln\sigma_{\ln} is the standard deviation of ln(X)\ln(X)
  • The $1/x$ factor in front creates the right skew

In Plain English: Take the logarithm of every purchase amount in your store. If those log-values form a bell curve, then the original dollar amounts follow a log-normal distribution. Most purchases cluster at modest amounts, but the right tail stretches to include those occasional $2,000+ orders — a pattern the normal distribution cannot produce because it would assign nonzero probability to negative purchases.

Python: log-normal purchase amounts

Expected output:

code
Median purchase: &#36;49.33
Mean purchase: &#36;59.15
95th percentile: &#36;132.37

Key Insight: For log-normal data, the mean is always greater than the median. The "average purchase amount" is pulled upward by the long right tail. Reporting the median gives a better sense of the typical customer; the mean tells you about revenue concentration. Both matter, but for different decisions.

How distributions connect to each other

Family tree of probability distributions showing how one leads to anotherClick to expandFamily tree of probability distributions showing how one leads to another

Distributions don't exist in isolation — they form a connected family where one transforms into another under specific conditions. Understanding these connections means you can switch between models as your data changes or your question shifts.

Starting distributionConditionResulting distribution
Bernoulli (single trial)Repeat nn timesBinomial(nn, pp)
Binomial(nn, pp)nn large, pp small, λ=np\lambda = npPoisson(λ\lambda)
Binomial(nn, pp)nn large, np>5np > 5Normal(npnp, np(1p)\sqrt{np(1-p)})
Poisson(λ\lambda)λ>20\lambda > 20Normal(λ\lambda, λ\sqrt{\lambda})
Poisson (count per interval)Time between eventsExponential(λ\lambda)
ExponentialSum of kk exponentialsGamma(kk, λ\lambda)
NormaleXe^XLog-Normal
NormalSquare and sum nn valuesChi-squared(nn)

Pro Tip: These connections are practical shortcuts. If your Poisson λ\lambda is 50 (high-traffic store), you can approximate it with a normal distribution N(50,50)N(50, \sqrt{50}) and use z-scores instead of Poisson tables. The approximation error is negligible.

Choosing the right distribution for your data

Step-by-step pipeline for fitting a probability distribution to observed dataClick to expandStep-by-step pipeline for fitting a probability distribution to observed data

Real data doesn't arrive with a label saying "I'm Poisson." Identifying which distribution generated your data is a critical step before building any statistical model, and the process combines visual inspection, domain knowledge, and formal testing.

Step 1: histogram and shape analysis

Plot a histogram and examine the shape:

  • Symmetric bell-shaped around a central peak: Normal candidate
  • Right-skewed with a long tail, values strictly positive: Exponential, Log-Normal, or Gamma
  • Discrete integer counts: Poisson or Binomial
  • Flat with hard boundaries: Uniform
  • Bounded between 0 and 1 (proportions, rates): Beta

Step 2: QQ plots for deeper comparison

A QQ (quantile-quantile) plot compares your data's quantiles against a theoretical distribution's quantiles. Points falling along the diagonal mean a match. Specific deviations reveal specific problems:

  • Points curving upward at the right end: heavier right tail than the reference
  • Points curving downward at the left end: heavier left tail
  • S-shaped pattern: lighter tails than the reference (platykurtic)

The left plot shows points hugging the diagonal — a good fit. The right plot shows a severe upward curve at the right end, revealing that exponential data has a much heavier right tail than the normal distribution allows.

Step 3: formal goodness-of-fit tests

When visual inspection is ambiguous, statistical tests provide quantitative answers.

Shapiro-Wilk is the gold standard for testing normality (works best with n<5000n < 5000):

Expected output:

code
Normal data — Shapiro-Wilk p-value: 0.6273 -> Cannot reject normality
Exponential data — Shapiro-Wilk p-value: 0.000000 -> Reject normality

Kolmogorov-Smirnov (KS) test is more general — it compares your data against any reference distribution by measuring the maximum gap between the empirical and theoretical CDFs:

Expected output:

code
KS test (exponential fit) p-value: 0.5402
-> Good fit

Pro Tip: With large samples (n > 5,000), Shapiro-Wilk and KS tests become overly sensitive, flagging trivially small deviations as "statistically significant." At n=100,000n = 100,000, nearly every dataset fails a normality test. Combine the formal test (tells you if a deviation is detectable) with the QQ plot (tells you if it matters practically).

Quick reference: distribution selector

Flowchart for selecting the right probability distribution based on data characteristicsClick to expandFlowchart for selecting the right probability distribution based on data characteristics

ScenarioDistributionKey ParametersE-commerce example
Measurements clustering around a meanNormalμ\mu, σ\sigmaAverage purchase amounts
Count of successes in fixed trialsBinomialnn, ppConversions from 500 visitors
Count of events in a time intervalPoissonλ\lambdaOrders per hour
Waiting time between eventsExponentialλ\lambdaMinutes between orders
All outcomes equally likely in a rangeUniformaa, bbRandom discount percentage
Positive, right-skewed dataLog-Normalμln\mu_{\ln}, σln\sigma_{\ln}Purchase amounts with high-value tail
Overdispersed countsNegative Binomialrr, ppSupport tickets (variance >> mean)

When to use distributions (and when not to)

Picking the right distribution is not just academic — it directly affects model accuracy, inference validity, and business decisions.

When distributions matter most

  1. Hypothesis testing — Every t-test, ANOVA, and chi-square test rests on distributional assumptions. Using a t-test on heavily skewed data produces unreliable p-values. See our guide on hypothesis testing.
  2. Generative models — GMMs, Naive Bayes, and VAEs explicitly model data as mixtures of distributions.
  3. Risk and reliability — Insurance pricing, credit scoring, and failure prediction all rest on fitting the right distribution to tail behavior.
  4. Simulation and planning — Monte Carlo methods need realistic distributions for each input variable (demand, lead time, failure rate) to produce meaningful results.

When NOT to assume a parametric distribution

  1. When your data is multimodal — Two peaks suggest a mixture of populations (e.g., casual shoppers vs. wholesale buyers), not a single distribution. Fit a mixture model or segment first.
  2. When you have very small samples (n < 30) — Goodness-of-fit tests have almost no power. Use non-parametric methods that make no distributional assumption.
  3. When the data-generating process changes over time — Non-stationary time series violate the assumption of a fixed distribution. Check for stationarity first.
  4. When tree-based models are your goal — Random forests, XGBoost, and other tree methods don't assume any distribution. Spending time fitting distributions to features is wasted effort for these models.

Production considerations

Distribution fitting looks clean in a notebook, but production pipelines add constraints.

Computational complexity

OperationTime complexityNotes
MLE parameter fitting (scipy)O(n)O(n) per iterationNewton-Raphson with ~5-20 iterations typical
Shapiro-Wilk testO(nlogn)O(n \log n)Sorting dominates; limited to n5000n \leq 5000 in scipy
KS testO(nlogn)O(n \log n)Empirical CDF sort; works at any nn
QQ plot generationO(nlogn)O(n \log n)Sorting + quantile computation
PDF/CDF evaluationO(1)O(1) per pointVectorized: O(n)O(n) for nn points

Memory and scaling

  • Under 100K rows: All scipy distribution operations run in under a second on a standard machine (16 GB RAM). No special considerations needed.
  • 1M+ rows: MLE fitting still runs in seconds, but plotting histograms with too many bins can consume excessive memory. Use bins='auto' or cap at 200 bins.
  • 100M+ rows: Consider sampling. Fit distributions on a random 100K sample — the parameter estimates will be nearly identical due to the CLT. The KS test statistic shrinks as O(1/n)O(1/\sqrt{n}), so extremely large samples reject every distribution. Use visual diagnostics instead.

Common deployment pitfalls

  • Fitting once, deploying forever — Distributions shift as your user base changes. Re-fit monthly or set up drift detection (compare current data's KS statistic against the original fit).
  • Ignoring censored data — If your logging system caps values (e.g., session time maxes out at 30 minutes), use survival analysis techniques instead of naive MLE.
  • Confusing scipy's parameterization — scipy's expon uses scale = 1/lambda, gamma uses a = shape and scale = 1/beta. Read the docstrings carefully. As of scipy 1.15 (current stable), the scipy.stats documentation lists every distribution's parameter conventions.

Conclusion

Probability distributions transform raw numbers into a structured understanding of uncertainty. The normal distribution tells you how measurements cluster and why sample averages behave predictably. The binomial quantifies success rates in repeated trials. The Poisson captures the rhythm of events over time. The exponential models the gaps between those events. The log-normal handles the right-skewed, positive-only data that dominates real-world transactions.

The practical value lies in matching the right distribution to your data before any modeling begins. A histogram plus a QQ plot takes 30 seconds and prevents the kind of assumption violations that invalidate entire analyses. A normal assumption applied to exponential wait times produces nonsensical negative predictions. A Poisson model applied to overdispersed counts understates variability by a factor of two or more.

These distributions feed directly into the statistical methods that build on them. The normal distribution underpins the Central Limit Theorem, which justifies confidence intervals and hypothesis testing. Combining distributions with prior beliefs leads to Bayesian statistics, where your uncertainty updates as new evidence arrives. Every advanced technique in data science — from A/B testing to deep learning — rests on the distributional foundations covered here.

Frequently Asked Interview Questions

Q: What is the difference between a PMF and a PDF?

A PMF (Probability Mass Function) assigns actual probabilities to each specific outcome of a discrete random variable — P(X=3)=0.18P(X = 3) = 0.18 is a real probability. A PDF (Probability Density Function) gives density values for continuous variables, not probabilities. The probability of a continuous variable equaling any single exact value is zero; you get probabilities only by integrating the PDF over an interval. A PDF value can exceed 1 (it represents concentration, not probability).

Q: Your A/B test shows a 2% conversion rate difference. How do you decide which distribution to use for significance testing?

Conversion is a binary outcome (converted or not), so each visitor's result follows a Bernoulli trial. The total number of conversions follows a binomial distribution. With large sample sizes (which most A/B tests have), the normal approximation to the binomial applies, and you can use a z-test for proportions. The key assumptions to verify are independence between visitors and a fixed probability pp within each group.

Q: How do you check whether your data follows a normal distribution?

Three approaches in order of usefulness: (1) Plot a histogram — is it roughly symmetric and bell-shaped? (2) Create a QQ plot against a normal reference — do points fall on the diagonal? (3) Run a formal test like Shapiro-Wilk (best under 5,000 samples) or Kolmogorov-Smirnov. In practice, combine visual and statistical methods because formal tests become overly sensitive with large samples, rejecting practically-normal data for tiny deviations.

Q: When would you choose a Poisson distribution over a binomial?

Use Poisson when you're counting events in a continuous interval (time or space) without a fixed upper limit — website clicks per hour, defects per manufactured unit. Use binomial when you have a fixed number of trials with a known success probability — conversions out of 500 visitors, defective items in a batch of 100. The Poisson is actually the limit of the binomial as nn \to \infty and p0p \to 0 with λ=np\lambda = np held constant.

Q: You're modeling customer purchase amounts and the data is right-skewed. Which distribution would you consider and why?

Log-normal is the first candidate because it naturally produces positive, right-skewed data and arises when the value is the product of many small multiplicative factors (e.g., discount stacking, basket size effects). To verify, take the log of each purchase amount — if the log-transformed data looks normal (check with a QQ plot or Shapiro-Wilk), log-normal is a good fit. Other candidates include the Gamma distribution (especially for waiting times or aggregated counts) and the Weibull (for lifetime/duration data).

Q: What does "overdispersion" mean, and how do you handle it?

Overdispersion means the variance in your data exceeds what the assumed distribution predicts. For Poisson data, mean should equal variance. If variance is, say, 3x the mean, the Poisson model underestimates the probability of extreme counts. Common causes include unobserved heterogeneity (different customer segments with different rates) or correlation between events. The fix is switching to a Negative Binomial distribution, which adds a dispersion parameter that decouples the mean-variance relationship.

Q: Explain the memoryless property of the exponential distribution. Give a practical example where it applies and one where it fails.

Memorylessness means P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) — if you've already waited 10 minutes for an event, the probability of waiting another 5 is identical to the unconditional probability of waiting 5 minutes from scratch. It applies when the hazard rate is constant: random server errors, customer arrivals at a store (if the rate doesn't change with time of day). It fails for any aging or wear process — a 10-year-old car is more likely to break down than a new one, so the failure rate isn't constant, and a Weibull or Gamma distribution fits better.

Q: You have a dataset of 10 million rows. How does sample size affect distribution fitting?

At 10M rows, MLE parameter estimates are extremely precise — you likely have 4-5 significant digits of accuracy. However, formal goodness-of-fit tests (Shapiro-Wilk, KS) will reject every candidate distribution because they detect deviations as small as O(1/n)O(1/\sqrt{n}), which at n=10Mn = 10M is tiny. The practical approach is to fit on a random 100K subsample (parameters will match to 2 decimal places), use QQ plots for visual assessment, and focus on whether the deviation matters for your application rather than whether it's statistically detectable.

Hands-On Practice

The article highlights the 'paradox of statistics': individual events are random, but aggregates follow predictable laws. To truly understand your data, you must identify the underlying probability distributions (Normal, Binomial, Poisson, etc.) that generated it.

In the code below, we will use scipy.stats and matplotlib to verify the distributions mentioned in the article against the clinical trial dataset. We will visualize the difference between Continuous (Normal) and Discrete (Binomial) data and mathematically test whether the data follows the 'Bell Curve' assumptions required for many machine learning models.

Dataset: Clinical Trial (Statistics & Probability) Clinical trial dataset with 1000 patients designed for statistics and probability tutorials. Contains treatment groups (4 levels) for ANOVA, binary outcomes for hypothesis testing and A/B testing, confounders for causal inference, time-to-event data for survival analysis, and pre-generated distribution samples.

By plotting the histograms and comparing them to theoretical curves (PDF for continuous, PMF for discrete), we confirmed that sample_normal follows the Gaussian laws and sample_binomial follows the coin-flip logic. Furthermore, we used scipy.stats.normaltest to mathematically prove that sample_skewed violates the Normality assumption, a critical check before applying parametric statistical methods.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths