Fewer than 30% of data science candidates can write a correct Day-7 retention query from scratch, according to patterns observed by interview prep platforms. Cohort analysis sits at the intersection of product intuition and technical precision, and you need to define a metric clearly, implement it without errors, and explain what the output actually means to a product team. That combination is what makes this category one of the most differentiating in any product-facing data role.
This article walks through 13 worked questions covering every major pattern: DAU/WAU/MAU aggregations, cohort table construction, N-day and rolling retention, churn detection, and advanced metrics like L28 and power user curves. Each question includes a schema, a full SQL solution, and an explanation of what the interviewer is actually evaluating.
The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience. These patterns have been reported by candidates across interviews at technology companies with consumer-facing products. We do not claim these are proprietary questions from any specific organization.
DAU, WAU, and MAU Calculations
Active user counts are the first thing any product team looks at. Interviewers ask these questions to see whether you can define "active" clearly before you write a single line of SQL, because the definition changes completely depending on what the product does.
The schema used throughout this section:
-- events table
user_id INTEGER
event_type VARCHAR -- 'app_open', 'like', 'comment', 'purchase', etc.
event_date DATE
Question 1: Compute daily, weekly, and monthly active users from a single events table
A consumer app tracks user activity in an events table. Write queries to compute DAU, WAU, and MAU for each period in the dataset. An "active" user is one who triggered at least one app_open event.
Before writing anything, define the denominator precisely. DAU counts distinct users per calendar day. WAU groups by ISO week. MAU groups by calendar month. All three use COUNT(DISTINCT user_id), never COUNT(*), which double-counts users who log multiple events.
-- DAU
SELECT
event_date,
COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type = 'app_open'
GROUP BY event_date
ORDER BY event_date;
-- WAU
SELECT
DATE_TRUNC('week', event_date) AS week_start,
COUNT(DISTINCT user_id) AS wau
FROM events
WHERE event_type = 'app_open'
GROUP BY DATE_TRUNC('week', event_date)
ORDER BY week_start;
-- MAU
SELECT
DATE_TRUNC('month', event_date) AS month_start,
COUNT(DISTINCT user_id) AS mau
FROM events
WHERE event_type = 'app_open'
GROUP BY DATE_TRUNC('month', event_date)
ORDER BY month_start;
Expected output (DAU sample):
| event_date | dau |
|---|---|
| 2024-01-01 | 4821 |
| 2024-01-02 | 5103 |
| 2024-01-03 | 4990 |
Key Insight: DATE_TRUNC('week', ...) anchors to Monday in most SQL dialects. If the interviewer specifies Sunday-anchored weeks, use DATE_TRUNC('week', event_date + INTERVAL '1 day') - INTERVAL '1 day' or clarify the expected behavior.
Common Mistake: Filtering on a single event type when the question says "any activity." Always re-read the active-user definition before writing the WHERE clause.
What the Interviewer Is Testing: Whether you distinguish COUNT(DISTINCT) from COUNT(*), and whether you proactively clarify the definition of "active."
Question 2: Rolling 7-Day Average DAU Using a Window Function
DAU spikes on weekends and dips midweek for most consumer products. A rolling 7-day average smooths that noise. Write a query that computes, for every date, the rolling 7-day average of daily active user counts.
The naive approach (a self-join on dates) works but is slow on large tables. The window function approach using ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is more readable, but it counts rows, not distinct users across the window. The correct production approach groups daily distinct users first, then applies the window.
WITH daily_active AS (
SELECT
event_date,
COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type = 'app_open'
GROUP BY event_date
)
SELECT
event_date,
dau,
AVG(dau) OVER (
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg_dau
FROM daily_active
ORDER BY event_date;
Expected output (sample):
| event_date | dau | rolling_7d_avg_dau |
|---|---|---|
| 2024-01-01 | 4821 | 4821.00 |
| 2024-01-07 | 5012 | 4934.14 |
| 2024-01-14 | 5198 | 5047.71 |
Key Insight: The first 6 rows of the rolling average use fewer than 7 days. This is correct behavior (the window expands from 1 to 7), but mention it in your interview. Some interviewers want you to exclude the ramp-up period with a subquery filter like WHERE event_date >= (SELECT MIN(event_date) FROM daily_active) + INTERVAL '6 days'.
Common Mistake: Applying AVG() directly over raw events instead of over pre-aggregated daily counts. This gives a wrong result because each row is an event, not a user-day.
What the Interviewer Is Testing: Window function syntax, specifically the ROWS frame clause, and whether you understand the difference between smoothing and aggregation.
Question 3: DAU/MAU ratio as an engagement stickiness metric
The DAU/MAU ratio measures how frequently the typical monthly active user opens the app. A ratio of 0.5 means the average MAU is active 15 of 30 days. Write a query that computes this ratio for each month in the dataset.
WITH monthly AS (
SELECT
DATE_TRUNC('month', event_date) AS month_start,
COUNT(DISTINCT user_id) AS mau
FROM events
WHERE event_type = 'app_open'
GROUP BY 1
),
daily AS (
SELECT
DATE_TRUNC('month', event_date) AS month_start,
event_date,
COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type = 'app_open'
GROUP BY 1, 2
),
daily_avg AS (
SELECT
month_start,
AVG(dau) AS avg_dau
FROM daily
GROUP BY 1
)
SELECT
m.month_start,
m.mau,
da.avg_dau,
ROUND(da.avg_dau / m.mau, 3) AS dau_mau_ratio
FROM monthly m
JOIN daily_avg da USING (month_start)
ORDER BY m.month_start;
Expected output:
| month_start | mau | avg_dau | dau_mau_ratio |
|---|---|---|---|
| 2024-01-01 | 48210 | 4986.2 | 0.103 |
| 2024-02-01 | 51430 | 5301.4 | 0.103 |
Key Insight: A messaging app might have a DAU/MAU of 0.50 or higher. A utility app might be 0.05. Neither number is inherently good; context matters. Say this in the interview. Common Mistake: Dividing avg_dau by MAU computed from different filters. Both numerator and denominator must use the same activity definition. What the Interviewer Is Testing: Multi-step aggregation with CTEs, and whether you understand what the metric actually signals about product health.
Cohort Table Construction
A cohort table is the foundation for all retention analysis. Building one from scratch is asked in almost every senior product analytics interview and requires you to anchor each user to their first activity date before joining back to the full event log.
The schema used throughout this section:
-- users table
user_id INTEGER
signup_date DATE
-- events table
user_id INTEGER
event_date DATE
event_type VARCHAR
Question 4: Assign each user to a monthly cohort based on first activity
Before building a retention table, you need a cohort assignment. Write a query that assigns each user to the month they were first active, using the events table (not the users table, because the interviewer wants to test whether you know to derive cohort from actual activity, not just signup date).
WITH first_activity AS (
SELECT
user_id,
MIN(event_date) AS first_active_date,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM events
GROUP BY user_id
)
SELECT
user_id,
first_active_date,
cohort_month
FROM first_activity
ORDER BY cohort_month, user_id;
Expected output:
| user_id | first_active_date | cohort_month |
|---|---|---|
| 101 | 2024-01-03 | 2024-01-01 |
| 102 | 2024-01-17 | 2024-01-01 |
| 103 | 2024-02-05 | 2024-02-01 |
Key Insight: Using the users.signup_date rather than MIN(event_date) will produce a different, and usually less accurate, cohort assignment. A user who signs up but never logs in should not appear in an engagement cohort. Clarify which to use.
Common Mistake: Using FIRST_VALUE() instead of MIN(). Both work, but MIN() is simpler and less error-prone when there are ties.
What the Interviewer Is Testing: Whether you understand the difference between signup date and first-activity date as cohort anchors, and whether you proactively ask about the definition.
Question 5: Build a monthly retention cohort table
The canonical cohort analysis question. Write a query that returns, for each cohort month, the number of users still active in each of the 6 months following their first activity. The output should be a matrix: cohort_month × months_since_acquisition.
WITH first_activity AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM events
GROUP BY user_id
),
user_activity AS (
SELECT
e.user_id,
fa.cohort_month,
DATE_TRUNC('month', e.event_date) AS activity_month
FROM events e
JOIN first_activity fa ON e.user_id = fa.user_id
),
month_offsets AS (
SELECT
cohort_month,
activity_month,
(
(EXTRACT(YEAR FROM activity_month) - EXTRACT(YEAR FROM cohort_month)) * 12
+ (EXTRACT(MONTH FROM activity_month) - EXTRACT(MONTH FROM cohort_month))
) AS months_since_acquisition,
COUNT(DISTINCT user_id) AS retained_users
FROM user_activity
GROUP BY 1, 2, 3
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM first_activity
GROUP BY cohort_month
)
SELECT
mo.cohort_month,
cs.cohort_size,
mo.months_since_acquisition,
mo.retained_users,
ROUND(
100.0 * mo.retained_users / cs.cohort_size, 1
) AS retention_pct
FROM month_offsets mo
JOIN cohort_sizes cs ON mo.cohort_month = cs.cohort_month
WHERE mo.months_since_acquisition BETWEEN 0 AND 6
ORDER BY mo.cohort_month, mo.months_since_acquisition;
Expected output (sample):
| cohort_month | cohort_size | months_since_acquisition | retained_users | retention_pct |
|---|---|---|---|---|
| 2024-01-01 | 4,210 | 0 | 4,210 | 100.0 |
| 2024-01-01 | 4,210 | 1 | 2,947 | 70.0 |
| 2024-01-01 | 4,210 | 2 | 2,105 | 50.0 |
| 2024-01-01 | 4,210 | 3 | 1,642 | 39.0 |
| 2024-02-01 | 3,918 | 0 | 3,918 | 100.0 |
| 2024-02-01 | 3,918 | 1 | 2,624 | 67.0 |
Key Insight: Month 0 always shows 100% retention because we anchor the cohort on first activity, so every user is active in the month they first appeared. If you see month-0 retention below 100%, something is wrong with your join logic.
Common Mistake: Using DATEDIFF(month, ...) without accounting for year boundaries. The arithmetic formula (year_diff * 12) + month_diff works correctly across year boundaries; many dialect-specific DATEDIFF implementations do not.
What the Interviewer Is Testing: Multi-CTE construction, the JOIN strategy for matching cohort month to activity month, and the offset calculation.
N-Day Retention
N-day retention asks a precise question: of the users who were first active on date D, what fraction were also active exactly N days later? The word "exactly" is critical, and it distinguishes this from rolling retention, which uses "on or after."
The schema used throughout this section:
-- logins table
user_id INTEGER
login_date DATE
Question 6: Day-1 and Day-7 retention rates
A social platform tracks daily logins. Write a single query that returns Day-1 and Day-7 retention rates, defined as the fraction of users who first logged in on a given date and also logged in exactly 1 and exactly 7 days later, respectively.
The self-join approach is the most readable in an interview setting:
WITH first_login AS (
SELECT
user_id,
MIN(login_date) AS first_date
FROM logins
GROUP BY user_id
)
SELECT
fl.first_date,
COUNT(DISTINCT fl.user_id) AS cohort_size,
COUNT(DISTINCT CASE WHEN l1.login_date = fl.first_date + INTERVAL '1 day'
THEN l1.user_id END) AS day1_retained,
COUNT(DISTINCT CASE WHEN l7.login_date = fl.first_date + INTERVAL '7 days'
THEN l7.user_id END) AS day7_retained,
ROUND(
100.0 * COUNT(DISTINCT CASE WHEN l1.login_date = fl.first_date + INTERVAL '1 day'
THEN l1.user_id END)
/ COUNT(DISTINCT fl.user_id), 1
) AS day1_retention_pct,
ROUND(
100.0 * COUNT(DISTINCT CASE WHEN l7.login_date = fl.first_date + INTERVAL '7 days'
THEN l7.user_id END)
/ COUNT(DISTINCT fl.user_id), 1
) AS day7_retention_pct
FROM first_login fl
LEFT JOIN logins l1
ON fl.user_id = l1.user_id
AND l1.login_date = fl.first_date + INTERVAL '1 day'
LEFT JOIN logins l7
ON fl.user_id = l7.user_id
AND l7.login_date = fl.first_date + INTERVAL '7 days'
GROUP BY fl.first_date
ORDER BY fl.first_date;
Expected output:
| first_date | cohort_size | day1_retained | day7_retained | day1_retention_pct | day7_retention_pct |
|---|---|---|---|---|---|
| 2024-01-01 | 312 | 187 | 134 | 59.9 | 42.9 |
| 2024-01-02 | 298 | 174 | 121 | 58.4 | 40.6 |
Key Insight: Use LEFT JOIN, not INNER JOIN. If you use INNER JOIN, users who were NOT retained on Day-7 disappear from the result entirely, destroying the denominator.
Common Mistake: Using login_date >= first_date + INTERVAL '7 days' instead of = first_date + INTERVAL '7 days'. The >= condition gives you rolling retention, which is a different metric.
What the Interviewer Is Testing: The distinction between Day-N (equality join) and rolling retention (inequality join), and correct LEFT JOIN usage.
Question 7: Rolling Day-30 retention using window functions
Rolling retention is more generous than exact Day-N retention. A user is "rolling retained at Day-30" if they returned at any point on or after their first active day plus 30 days. This better represents whether a user found lasting value in the product.
WITH first_login AS (
SELECT
user_id,
MIN(login_date) AS first_date
FROM logins
GROUP BY user_id
),
future_activity AS (
SELECT
fl.user_id,
fl.first_date,
MAX(l.login_date) AS last_seen_date
FROM first_login fl
LEFT JOIN logins l
ON fl.user_id = l.user_id
AND l.login_date > fl.first_date
GROUP BY fl.user_id, fl.first_date
)
SELECT
first_date,
COUNT(DISTINCT user_id) AS cohort_size,
COUNT(DISTINCT CASE WHEN last_seen_date >= first_date + INTERVAL '30 days'
THEN user_id END) AS rolling_day30_retained,
ROUND(
100.0 * COUNT(DISTINCT CASE WHEN last_seen_date >= first_date + INTERVAL '30 days'
THEN user_id END)
/ COUNT(DISTINCT user_id), 1
) AS rolling_day30_pct
FROM future_activity
WHERE first_date <= CURRENT_DATE - INTERVAL '30 days'
GROUP BY first_date
ORDER BY first_date;
Expected output:
| first_date | cohort_size | rolling_day30_retained | rolling_day30_pct |
|---|---|---|---|
| 2024-01-01 | 312 | 201 | 64.4 |
| 2024-01-02 | 298 | 189 | 63.4 |
Key Insight: Rolling Day-30 retention is always equal to or higher than exact Day-30 retention because it counts anyone who returned at any point after Day-30, not only those who returned on Day-30 exactly.
Common Mistake: Forgetting to filter last_seen_date as a lookback floor. Users who first appeared very recently will have no future data, artificially lowering retention. Add a filter: WHERE first_date <= CURRENT_DATE - INTERVAL '30 days'.
What the Interviewer Is Testing: Understanding of the conceptual difference between Day-N and rolling retention, and the implications of incomplete cohorts.
Question 8: The date spine technique for filling retention gaps
When computing a retention curve, you expect a row for every cohort-period combination. But if no users from the January cohort returned in March, your query returns no row for (January, 2). This creates misleading NULL gaps in dashboards. Use a date spine to guarantee all combinations appear.
-- Step 1: Build the date spine
WITH date_spine AS (
SELECT
generate_series(
DATE_TRUNC('month', (SELECT MIN(login_date) FROM logins)),
DATE_TRUNC('month', (SELECT MAX(login_date) FROM logins)),
INTERVAL '1 month'
)::DATE AS month_date
),
-- Step 2: Build cohorts
cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(login_date))::DATE AS cohort_month
FROM logins
GROUP BY user_id
),
-- Step 3: Generate every cohort x period combination
cohort_periods AS (
SELECT DISTINCT
c.cohort_month,
ds.month_date AS activity_month,
(
(EXTRACT(YEAR FROM ds.month_date) - EXTRACT(YEAR FROM c.cohort_month)) * 12
+ (EXTRACT(MONTH FROM ds.month_date) - EXTRACT(MONTH FROM c.cohort_month))
) AS period_number
FROM cohorts c
CROSS JOIN date_spine ds
WHERE ds.month_date >= c.cohort_month
),
-- Step 4: Compute actual retained users
actual_retention AS (
SELECT
c.cohort_month,
DATE_TRUNC('month', l.login_date)::DATE AS activity_month,
COUNT(DISTINCT c.user_id) AS retained_users
FROM cohorts c
JOIN logins l ON c.user_id = l.user_id
GROUP BY 1, 2
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts GROUP BY cohort_month
)
SELECT
cp.cohort_month,
cp.period_number,
cs.cohort_size,
COALESCE(ar.retained_users, 0) AS retained_users,
ROUND(
100.0 * COALESCE(ar.retained_users, 0) / cs.cohort_size, 1
) AS retention_pct
FROM cohort_periods cp
JOIN cohort_sizes cs ON cp.cohort_month = cs.cohort_month
LEFT JOIN actual_retention ar
ON cp.cohort_month = ar.cohort_month
AND cp.activity_month = ar.activity_month
ORDER BY cp.cohort_month, cp.period_number;
Key Insight: CROSS JOIN between cohorts and the date spine creates the full matrix of expected rows. The subsequent LEFT JOIN to actual data fills in real values and leaves 0 (via COALESCE) wherever nobody showed up.
Common Mistake: Forgetting the WHERE ds.month_date >= c.cohort_month filter on the cross join. Without it, you generate impossible rows where the activity month precedes the cohort month.
What the Interviewer Is Testing: Awareness of the missing-row problem in retention analysis and knowledge of the date spine pattern. This is a senior-level question; interviewers use it to separate candidates who have built real retention dashboards from those who have only read about them.
Churn Detection
Churn detection asks: which users stopped coming back? The mechanics vary, as some products define churn as 30 days of inactivity, others use subscription cancellation dates, others use a probabilistic model. In SQL interviews, the most common framing is either "find users who were active in period X but not in period X+1" or "find users whose last activity was more than N days ago."
The schema used throughout this section:
-- user_activity table
user_id INTEGER
activity_date DATE
-- subscriptions table
user_id INTEGER
plan_id VARCHAR
start_date DATE
end_date DATE -- NULL if still active
Question 9: Identify churned users with LAG
A subscription app wants to detect users who were active (logged in) in one month but completely absent the next month. Write a query that flags each user's monthly activity status and identifies the first month they churned.
WITH monthly_activity AS (
SELECT
user_id,
DATE_TRUNC('month', activity_date)::DATE AS activity_month
FROM user_activity
GROUP BY 1, 2
),
with_lag AS (
SELECT
user_id,
activity_month,
LAG(activity_month) OVER (
PARTITION BY user_id
ORDER BY activity_month
) AS prev_activity_month
FROM monthly_activity
),
gap_flagged AS (
SELECT
user_id,
activity_month,
prev_activity_month,
-- Gap > 1 month means the user skipped at least one month
CASE WHEN prev_activity_month IS NULL THEN 'first_seen'
WHEN activity_month > prev_activity_month + INTERVAL '1 month' THEN 'returned_after_churn'
ELSE 'active'
END AS status
FROM with_lag
)
SELECT *
FROM gap_flagged
ORDER BY user_id, activity_month;
To find the first churn month for each user (the month after they last appeared before a gap):
WITH monthly_activity AS (
SELECT
user_id,
DATE_TRUNC('month', activity_date)::DATE AS activity_month
FROM user_activity
GROUP BY 1, 2
),
with_lead AS (
SELECT
user_id,
activity_month,
LEAD(activity_month) OVER (
PARTITION BY user_id
ORDER BY activity_month
) AS next_activity_month
FROM monthly_activity
)
SELECT
user_id,
activity_month AS last_active_month,
next_activity_month AS returned_month,
activity_month + INTERVAL '1 month' AS churn_month
FROM with_lead
WHERE
next_activity_month IS NULL
OR next_activity_month > activity_month + INTERVAL '1 month'
ORDER BY user_id;
Key Insight: A NULL next_activity_month means the user never came back after that month, so they are permanently churned (at least through the data window). Treat these differently from users who churned and then reactivated.
Common Mistake: Using activity_date directly instead of truncating to the month. Two logins in the same month will create two rows, and LAG will compare consecutive login dates rather than consecutive months.
What the Interviewer Is Testing: Fluency with LAG/LEAD window functions and the ability to translate "user stopped coming back" into a precise SQL condition.
Question 10: Time-since-last-activity churn scoring
A product team wants a simple churn risk score: for every user, how many days has it been since their last activity as of today? Users inactive for 30+ days are "at risk," and those inactive for 90+ days are "churned."
WITH last_seen AS (
SELECT
user_id,
MAX(activity_date) AS last_activity_date
FROM user_activity
GROUP BY user_id
)
SELECT
user_id,
last_activity_date,
CURRENT_DATE - last_activity_date AS days_since_active,
CASE
WHEN CURRENT_DATE - last_activity_date >= 90 THEN 'churned'
WHEN CURRENT_DATE - last_activity_date >= 30 THEN 'at_risk'
ELSE 'active'
END AS churn_status
FROM last_seen
ORDER BY days_since_active DESC;
Expected output:
| user_id | last_activity_date | days_since_active | churn_status |
|---|---|---|---|
| 5021 | 2023-10-14 | 167 | churned |
| 6102 | 2024-02-01 | 57 | at_risk |
| 7834 | 2024-03-10 | 19 | active |
Key Insight: The thresholds (30 and 90 days) should always be discussed with the interviewer. A ride-sharing app might define churn at 60 days; a productivity tool might use 14. Never hardcode these without asking.
Common Mistake: Not accounting for users who never had any activity after signup. If you join to the users table with a LEFT JOIN, users with no events will have last_activity_date = NULL. Add COALESCE(last_activity_date, signup_date) or handle NULLs explicitly.
What the Interviewer Is Testing: Simple aggregation combined with business logic in a CASE statement, and awareness that product-specific context should shape threshold definitions.
Question 11: Subscription cohort retention across three post-signup months
A SaaS tool bills subscriptions monthly. Write a query that returns, for each signup cohort and plan type, the retention rate in months 1, 2, and 3 after signup. A subscription is "retained" in a given month if its end_date is NULL (still active) or falls after that month's end.
WITH cohort_base AS (
SELECT
user_id,
plan_id,
DATE_TRUNC('month', start_date)::DATE AS cohort_month,
start_date,
end_date
FROM subscriptions
),
cohort_sizes AS (
SELECT
cohort_month,
plan_id,
COUNT(DISTINCT user_id) AS cohort_size
FROM cohort_base
GROUP BY 1, 2
),
retention AS (
SELECT
cb.cohort_month,
cb.plan_id,
cs.cohort_size,
-- Month 1: still active 1 month after start
COUNT(DISTINCT CASE
WHEN cb.end_date IS NULL
OR cb.end_date >= cb.cohort_month + INTERVAL '2 months'
THEN cb.user_id END) AS month_1_retained,
-- Month 2
COUNT(DISTINCT CASE
WHEN cb.end_date IS NULL
OR cb.end_date >= cb.cohort_month + INTERVAL '3 months'
THEN cb.user_id END) AS month_2_retained,
-- Month 3
COUNT(DISTINCT CASE
WHEN cb.end_date IS NULL
OR cb.end_date >= cb.cohort_month + INTERVAL '4 months'
THEN cb.user_id END) AS month_3_retained
FROM cohort_base cb
JOIN cohort_sizes cs
ON cb.cohort_month = cs.cohort_month
AND cb.plan_id = cs.plan_id
GROUP BY 1, 2, 3
)
SELECT
cohort_month,
plan_id,
cohort_size,
ROUND(100.0 * month_1_retained / cohort_size, 1) AS month_1_pct,
ROUND(100.0 * month_2_retained / cohort_size, 1) AS month_2_pct,
ROUND(100.0 * month_3_retained / cohort_size, 1) AS month_3_pct
FROM retention
ORDER BY cohort_month, plan_id;
Expected output:
| cohort_month | plan_id | cohort_size | month_1_pct | month_2_pct | month_3_pct |
|---|---|---|---|---|---|
| 2024-01-01 | basic | 1,240 | 72.3 | 58.1 | 46.9 |
| 2024-01-01 | pro | 418 | 85.4 | 73.2 | 64.1 |
| 2024-02-01 | basic | 1,108 | 71.0 | 57.3 | 45.8 |
Key Insight: The end_date >= cohort_month + INTERVAL '2 months' logic means "still active at the start of month 2," which is equivalent to "retained through month 1." Walk through the interval arithmetic explicitly in your interview answer.
Common Mistake: Treating end_date in the same calendar month as start_date as retained. Per the standard SaaS definition, a subscription cancelled in its first month was never retained. Adjust your interval check accordingly.
What the Interviewer Is Testing: Translating a business rule (monthly retention for a subscription product) into date arithmetic, and handling the NULL end_date case for active subscriptions.
Advanced Product Metrics
The final tier of cohort analysis questions goes beyond simple retention rates. These questions test whether you can build the metrics product teams actually use in weekly reviews: L28 engagement depth, power user identification, and week-over-week retention curves.
Question 12: L28, Count Days Active in the Last 28 Days Per User
L28 (also called the "activity histogram" or "power user curve") counts, for each user, how many of the last 28 days they were active. It is more nuanced than binary DAU/MAU because it shows engagement depth: a user active 28 of 28 days is fundamentally different from one active 1 of 28 days, even though both count toward MAU.
-- As of a specific reference date (e.g., today)
WITH window_activity AS (
SELECT
user_id,
COUNT(DISTINCT activity_date) AS days_active_l28
FROM user_activity
WHERE activity_date > CURRENT_DATE - INTERVAL '28 days'
AND activity_date <= CURRENT_DATE
GROUP BY user_id
),
-- Join back to all users who were ever active (to include 0-day users)
all_users AS (
SELECT DISTINCT user_id FROM user_activity
)
SELECT
au.user_id,
COALESCE(wa.days_active_l28, 0) AS days_active_l28
FROM all_users au
LEFT JOIN window_activity wa ON au.user_id = wa.user_id
ORDER BY days_active_l28 DESC;
To build the distribution (the power user curve itself):
WITH window_activity AS (
SELECT
user_id,
COUNT(DISTINCT activity_date) AS days_active_l28
FROM user_activity
WHERE activity_date > CURRENT_DATE - INTERVAL '28 days'
AND activity_date <= CURRENT_DATE
GROUP BY user_id
)
SELECT
days_active_l28,
COUNT(user_id) AS user_count,
ROUND(100.0 * COUNT(user_id) / SUM(COUNT(user_id)) OVER (), 1) AS pct_of_mau
FROM window_activity
GROUP BY days_active_l28
ORDER BY days_active_l28;
Expected output:
| days_active_l28 | user_count | pct_of_mau |
|---|---|---|
| 1 | 12,400 | 24.1 |
| 2 | 8,200 | 15.9 |
| 7 | 4,100 | 8.0 |
| 28 | 1,800 | 3.5 |
Key Insight: The shape of this distribution reveals product health. A J-shaped curve where most users appear only 1-3 days is a warning sign. A flatter distribution with a meaningful tail at 20-28 days means you have a healthy core of power users. Common Mistake: Using a 30-day or 31-day window instead of exactly 28 days. L28 is a fixed 28-day window by convention, not "this calendar month." It is comparable across months because the denominator never changes. What the Interviewer Is Testing: Knowledge of the L28 metric convention, window date arithmetic, and the ability to translate a histogram query into business insight.
Question 13: Week-over-week retention curves with a single query
Product teams often track cohort retention as a weekly curve: of users who joined in week W, what fraction returned in W+1, W+2, W+3? Write a query that generates this table for all cohorts in the dataset.
WITH first_week AS (
SELECT
user_id,
DATE_TRUNC('week', MIN(activity_date))::DATE AS cohort_week
FROM user_activity
GROUP BY user_id
),
weekly_activity AS (
SELECT
ua.user_id,
fw.cohort_week,
DATE_TRUNC('week', ua.activity_date)::DATE AS activity_week,
-- Weeks since acquisition
(ua.activity_date - fw.cohort_week) / 7 AS week_number
FROM user_activity ua
JOIN first_week fw ON ua.user_id = fw.user_id
),
cohort_sizes AS (
SELECT cohort_week, COUNT(DISTINCT user_id) AS size
FROM first_week
GROUP BY cohort_week
)
SELECT
wa.cohort_week,
cs.size AS cohort_size,
wa.week_number,
COUNT(DISTINCT wa.user_id) AS retained_users,
ROUND(
100.0 * COUNT(DISTINCT wa.user_id) / cs.size, 1
) AS retention_pct
FROM weekly_activity wa
JOIN cohort_sizes cs ON wa.cohort_week = cs.cohort_week
WHERE wa.week_number BETWEEN 0 AND 12
GROUP BY wa.cohort_week, cs.size, wa.week_number
ORDER BY wa.cohort_week, wa.week_number;
Expected output:
| cohort_week | cohort_size | week_number | retained_users | retention_pct |
|---|---|---|---|---|
| 2024-01-01 | 1,840 | 0 | 1,840 | 100.0 |
| 2024-01-01 | 1,840 | 1 | 1,067 | 58.0 |
| 2024-01-01 | 1,840 | 2 | 790 | 42.9 |
| 2024-01-01 | 1,840 | 4 | 589 | 32.0 |
| 2024-01-08 | 1,712 | 0 | 1,712 | 100.0 |
Key Insight: Week-over-week cohort curves often reveal onboarding quality. If week-1 retention is below 30%, that is almost always an onboarding problem, not a product problem. Saying this in an interview demonstrates product instinct alongside SQL skill.
Common Mistake: Integer division in (activity_date - cohort_week) / 7 giving unexpected results when the dates are not on the same day of the week. Use DATE_TRUNC('week', ...) on both sides to normalize to Monday before subtracting.
What the Interviewer Is Testing: The complete cohort retention pattern applied to weekly granularity, integer arithmetic for week offsets, and product interpretation of the output.
Conclusion
Cohort analysis questions test something that pure SQL syntax questions cannot: whether you understand how user behavior unfolds over time. The patterns above appear in almost every product data science interview because they mirror real work. Building a Day-7 retention query is not an academic exercise. It is the first thing a new data scientist does when joining a consumer app team.
Three things separate candidates who pass these questions from those who do not. First, they define the metric before writing SQL. The moment you hear "retention," you ask: Day-N or rolling? Exact date or any date after? Every active user or only those from a specific cohort? Second, they use CTEs to break the problem into named, testable pieces. A single 40-line query with no CTEs will almost always have an error that is hard to find. Four 10-line CTEs make each piece verifiable. Third, they always use LEFT JOIN when preserving a denominator. If a user was not retained, you still need them in your count. Dropping them with an INNER JOIN silently corrupts your retention rate.
The date spine technique (Question 8) and the L28 power user curve (Question 12) are genuinely senior-level patterns. If you can explain why a cohort matrix has missing rows and how a cross join with a generated series solves it, you are demonstrating the kind of thinking that separates people who have built real analytics infrastructure from those who have only practiced interview prep problems.
For a deeper foundation in the SQL patterns underlying these techniques, the SQL Window Functions Guide on letsdatascience.com and the Product Data Science Interview Prep series on letsdatascience.com cover the LAG/LEAD patterns and CTE construction techniques used throughout this article.
Every question above has a correct answer that you can rehearse. What you cannot rehearse is the product intuition that makes your answer memorable to an interviewer. Practice explaining what the output means, not just how to produce it.
Career Q&A
How much time should I spend on SQL prep for a product data science interview?
For a role where product analytics is central (consumer apps, SaaS, social platforms), SQL should get about 40% of your technical prep time. The other 60% goes to product sense, A/B testing, and statistics. Retention and cohort questions appear in nearly every product DS loop, so they deserve dedicated practice sessions, not just passive reading.
Should I memorize the exact date arithmetic formulas before the interview?
Know the general pattern, not the exact syntax. Interviewers care more that you construct the logic correctly than that you recall (EXTRACT(YEAR FROM d2) - EXTRACT(YEAR FROM d1)) * 12 + ... from memory. Most interview environments allow you to look up dialect-specific functions. What they want to see is that you know why the arithmetic is needed and what it calculates.
My retention query works in the interview but gives wrong numbers. How do I debug it?
Walk through a single user manually. Pick a user from your sample data, trace their first activity date, compute the expected month offset by hand, then compare that against what your query produces for that user. This process catches 90% of errors: wrong join type, incorrect interval arithmetic, or a GROUP BY that eliminates the user entirely. Say this debugging process out loud. It demonstrates structured thinking.
Is it better to use a self-join or window functions for retention queries?
Self-joins are easier to explain in an interview: "I join the events table to itself, matching on user and date." Window functions with complex frame clauses can confuse interviewers who are not deeply technical. For Day-N retention specifically, a self-join with a precise join condition is often cleaner. For rolling aggregations and month-over-month comparisons, LAG/LEAD and window frames are the right tool. Know both; default to whichever is more readable given the question.
What do interviewers mean when they say "your denominator is wrong"?
It almost always means you filtered out users who were NOT retained before counting. The denominator of any retention rate is the full cohort: every user who was first active in the anchor period, regardless of whether they came back. Using INNER JOIN instead of LEFT JOIN, or adding a WHERE clause on the retained event, shrinks the denominator and inflates the retention rate. Always verify your denominator independently with a COUNT(DISTINCT user_id) on just the cohort CTE.
How should I explain cohort analysis to a non-technical interviewer?
Use the class reunion analogy: "Imagine you track all students who graduated in the same year as one group. Month 1, you count how many showed up to the one-month reunion. Month 6, how many showed up to the six-month reunion. Each graduating class is a cohort. The fraction who show up each month is the retention rate." This explanation shows you can communicate across audiences, a skill that matters as much as the SQL.
Do these patterns translate across databases (Redshift, BigQuery, Snowflake)?
Yes, with minor syntax adjustments. DATE_TRUNC is standard in PostgreSQL, Redshift, and Snowflake. BigQuery uses DATE_TRUNC(date, MONTH) with a slightly different argument order. GENERATE_SERIES is PostgreSQL/Redshift; BigQuery uses GENERATE_DATE_ARRAY; Snowflake uses GENERATOR. Mention in the interview which dialect you are most comfortable with, then note the equivalent function in the system they use. This signals real-world experience.
Sources
- Retention in SQL: How to Calculate User and Cohort Retention, StrataScratch (2024)
- Active User Retention, DataLemur SQL Interview Question (2024)
- SQL Interview Questions, DataLemur Question Bank (2026)
- Subscription Retention, InterviewQuery SQL Problem (2024)
- SQL Interview Questions for Data Analysts, InterviewQuery (2025)
- 40 Most Asked Product Analyst Interview Questions, InterviewQuery (2025)
- 5 SQL Interview Patterns You Need To Know, DataLemur Blog (2023)
- The Power User Curve, Andrew Chen (2017)
- Rolling Growth Metrics in SQL, Chuong Do, Medium (2019)
- Building A Complete Time-Series with SQL Date Spines, Connor Phillips (2023)
- Cohort Analysis, SQL for Data Analysis, O'Reilly (2022)
- How to Measure Cohort Retention, Lenny's Newsletter (2021)
- SQL Window Functions Interview Questions, DataLemur (2024)
- Top 33 SQL Interview Questions, DataInterview (2026)