Skip to content

SQL Cohort Analysis and Retention Interview Questions

DS
LDS Team
Let's Data Science
19 min

Fewer than 30% of data science candidates can write a correct Day-7 retention query from scratch, according to patterns observed by interview prep platforms. Cohort analysis sits at the intersection of product intuition and technical precision, and you need to define a metric clearly, implement it without errors, and explain what the output actually means to a product team. That combination is what makes this category one of the most differentiating in any product-facing data role.

This article walks through 13 worked questions covering every major pattern: DAU/WAU/MAU aggregations, cohort table construction, N-day and rolling retention, churn detection, and advanced metrics like L28 and power user curves. Each question includes a schema, a full SQL solution, and an explanation of what the interviewer is actually evaluating.

The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience. These patterns have been reported by candidates across interviews at technology companies with consumer-facing products. We do not claim these are proprietary questions from any specific organization.

DAU, WAU, and MAU Calculations

Active user counts are the first thing any product team looks at. Interviewers ask these questions to see whether you can define "active" clearly before you write a single line of SQL, because the definition changes completely depending on what the product does.

The schema used throughout this section:

sql
-- events table
user_id    INTEGER
event_type VARCHAR   -- 'app_open', 'like', 'comment', 'purchase', etc.
event_date DATE

Question 1: Compute daily, weekly, and monthly active users from a single events table

A consumer app tracks user activity in an events table. Write queries to compute DAU, WAU, and MAU for each period in the dataset. An "active" user is one who triggered at least one app_open event.

Before writing anything, define the denominator precisely. DAU counts distinct users per calendar day. WAU groups by ISO week. MAU groups by calendar month. All three use COUNT(DISTINCT user_id), never COUNT(*), which double-counts users who log multiple events.

sql
-- DAU
SELECT
    event_date,
    COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_type = 'app_open'
GROUP BY event_date
ORDER BY event_date;

-- WAU
SELECT
    DATE_TRUNC('week', event_date) AS week_start,
    COUNT(DISTINCT user_id) AS wau
FROM events
WHERE event_type = 'app_open'
GROUP BY DATE_TRUNC('week', event_date)
ORDER BY week_start;

-- MAU
SELECT
    DATE_TRUNC('month', event_date) AS month_start,
    COUNT(DISTINCT user_id) AS mau
FROM events
WHERE event_type = 'app_open'
GROUP BY DATE_TRUNC('month', event_date)
ORDER BY month_start;

Expected output (DAU sample):

event_datedau
2024-01-014821
2024-01-025103
2024-01-034990

Key Insight: DATE_TRUNC('week', ...) anchors to Monday in most SQL dialects. If the interviewer specifies Sunday-anchored weeks, use DATE_TRUNC('week', event_date + INTERVAL '1 day') - INTERVAL '1 day' or clarify the expected behavior. Common Mistake: Filtering on a single event type when the question says "any activity." Always re-read the active-user definition before writing the WHERE clause. What the Interviewer Is Testing: Whether you distinguish COUNT(DISTINCT) from COUNT(*), and whether you proactively clarify the definition of "active."

Question 2: Rolling 7-Day Average DAU Using a Window Function

DAU spikes on weekends and dips midweek for most consumer products. A rolling 7-day average smooths that noise. Write a query that computes, for every date, the rolling 7-day average of daily active user counts.

The naive approach (a self-join on dates) works but is slow on large tables. The window function approach using ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is more readable, but it counts rows, not distinct users across the window. The correct production approach groups daily distinct users first, then applies the window.

sql
WITH daily_active AS (
    SELECT
        event_date,
        COUNT(DISTINCT user_id) AS dau
    FROM events
    WHERE event_type = 'app_open'
    GROUP BY event_date
)
SELECT
    event_date,
    dau,
    AVG(dau) OVER (
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_avg_dau
FROM daily_active
ORDER BY event_date;

Expected output (sample):

event_datedaurolling_7d_avg_dau
2024-01-0148214821.00
2024-01-0750124934.14
2024-01-1451985047.71

Key Insight: The first 6 rows of the rolling average use fewer than 7 days. This is correct behavior (the window expands from 1 to 7), but mention it in your interview. Some interviewers want you to exclude the ramp-up period with a subquery filter like WHERE event_date >= (SELECT MIN(event_date) FROM daily_active) + INTERVAL '6 days'. Common Mistake: Applying AVG() directly over raw events instead of over pre-aggregated daily counts. This gives a wrong result because each row is an event, not a user-day. What the Interviewer Is Testing: Window function syntax, specifically the ROWS frame clause, and whether you understand the difference between smoothing and aggregation.

Question 3: DAU/MAU ratio as an engagement stickiness metric

The DAU/MAU ratio measures how frequently the typical monthly active user opens the app. A ratio of 0.5 means the average MAU is active 15 of 30 days. Write a query that computes this ratio for each month in the dataset.

sql
WITH monthly AS (
    SELECT
        DATE_TRUNC('month', event_date) AS month_start,
        COUNT(DISTINCT user_id)         AS mau
    FROM events
    WHERE event_type = 'app_open'
    GROUP BY 1
),
daily AS (
    SELECT
        DATE_TRUNC('month', event_date) AS month_start,
        event_date,
        COUNT(DISTINCT user_id)         AS dau
    FROM events
    WHERE event_type = 'app_open'
    GROUP BY 1, 2
),
daily_avg AS (
    SELECT
        month_start,
        AVG(dau) AS avg_dau
    FROM daily
    GROUP BY 1
)
SELECT
    m.month_start,
    m.mau,
    da.avg_dau,
    ROUND(da.avg_dau / m.mau, 3) AS dau_mau_ratio
FROM monthly m
JOIN daily_avg da USING (month_start)
ORDER BY m.month_start;

Expected output:

month_startmauavg_daudau_mau_ratio
2024-01-01482104986.20.103
2024-02-01514305301.40.103

Key Insight: A messaging app might have a DAU/MAU of 0.50 or higher. A utility app might be 0.05. Neither number is inherently good; context matters. Say this in the interview. Common Mistake: Dividing avg_dau by MAU computed from different filters. Both numerator and denominator must use the same activity definition. What the Interviewer Is Testing: Multi-step aggregation with CTEs, and whether you understand what the metric actually signals about product health.

Cohort Table Construction

A cohort table is the foundation for all retention analysis. Building one from scratch is asked in almost every senior product analytics interview and requires you to anchor each user to their first activity date before joining back to the full event log.

The schema used throughout this section:

sql
-- users table
user_id     INTEGER
signup_date DATE

-- events table
user_id     INTEGER
event_date  DATE
event_type  VARCHAR

Question 4: Assign each user to a monthly cohort based on first activity

Before building a retention table, you need a cohort assignment. Write a query that assigns each user to the month they were first active, using the events table (not the users table, because the interviewer wants to test whether you know to derive cohort from actual activity, not just signup date).

sql
WITH first_activity AS (
    SELECT
        user_id,
        MIN(event_date)                              AS first_active_date,
        DATE_TRUNC('month', MIN(event_date))         AS cohort_month
    FROM events
    GROUP BY user_id
)
SELECT
    user_id,
    first_active_date,
    cohort_month
FROM first_activity
ORDER BY cohort_month, user_id;

Expected output:

user_idfirst_active_datecohort_month
1012024-01-032024-01-01
1022024-01-172024-01-01
1032024-02-052024-02-01

Key Insight: Using the users.signup_date rather than MIN(event_date) will produce a different, and usually less accurate, cohort assignment. A user who signs up but never logs in should not appear in an engagement cohort. Clarify which to use. Common Mistake: Using FIRST_VALUE() instead of MIN(). Both work, but MIN() is simpler and less error-prone when there are ties. What the Interviewer Is Testing: Whether you understand the difference between signup date and first-activity date as cohort anchors, and whether you proactively ask about the definition.

Question 5: Build a monthly retention cohort table

The canonical cohort analysis question. Write a query that returns, for each cohort month, the number of users still active in each of the 6 months following their first activity. The output should be a matrix: cohort_month × months_since_acquisition.

sql
WITH first_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(event_date)) AS cohort_month
    FROM events
    GROUP BY user_id
),
user_activity AS (
    SELECT
        e.user_id,
        fa.cohort_month,
        DATE_TRUNC('month', e.event_date) AS activity_month
    FROM events e
    JOIN first_activity fa ON e.user_id = fa.user_id
),
month_offsets AS (
    SELECT
        cohort_month,
        activity_month,
        (
            (EXTRACT(YEAR  FROM activity_month) - EXTRACT(YEAR  FROM cohort_month)) * 12
          + (EXTRACT(MONTH FROM activity_month) - EXTRACT(MONTH FROM cohort_month))
        ) AS months_since_acquisition,
        COUNT(DISTINCT user_id) AS retained_users
    FROM user_activity
    GROUP BY 1, 2, 3
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
    FROM first_activity
    GROUP BY cohort_month
)
SELECT
    mo.cohort_month,
    cs.cohort_size,
    mo.months_since_acquisition,
    mo.retained_users,
    ROUND(
        100.0 * mo.retained_users / cs.cohort_size, 1
    ) AS retention_pct
FROM month_offsets mo
JOIN cohort_sizes cs ON mo.cohort_month = cs.cohort_month
WHERE mo.months_since_acquisition BETWEEN 0 AND 6
ORDER BY mo.cohort_month, mo.months_since_acquisition;

Expected output (sample):

cohort_monthcohort_sizemonths_since_acquisitionretained_usersretention_pct
2024-01-014,21004,210100.0
2024-01-014,21012,94770.0
2024-01-014,21022,10550.0
2024-01-014,21031,64239.0
2024-02-013,91803,918100.0
2024-02-013,91812,62467.0

Key Insight: Month 0 always shows 100% retention because we anchor the cohort on first activity, so every user is active in the month they first appeared. If you see month-0 retention below 100%, something is wrong with your join logic. Common Mistake: Using DATEDIFF(month, ...) without accounting for year boundaries. The arithmetic formula (year_diff * 12) + month_diff works correctly across year boundaries; many dialect-specific DATEDIFF implementations do not. What the Interviewer Is Testing: Multi-CTE construction, the JOIN strategy for matching cohort month to activity month, and the offset calculation.

N-Day Retention

N-day retention asks a precise question: of the users who were first active on date D, what fraction were also active exactly N days later? The word "exactly" is critical, and it distinguishes this from rolling retention, which uses "on or after."

The schema used throughout this section:

sql
-- logins table
user_id    INTEGER
login_date DATE

Question 6: Day-1 and Day-7 retention rates

A social platform tracks daily logins. Write a single query that returns Day-1 and Day-7 retention rates, defined as the fraction of users who first logged in on a given date and also logged in exactly 1 and exactly 7 days later, respectively.

The self-join approach is the most readable in an interview setting:

sql
WITH first_login AS (
    SELECT
        user_id,
        MIN(login_date) AS first_date
    FROM logins
    GROUP BY user_id
)
SELECT
    fl.first_date,
    COUNT(DISTINCT fl.user_id)                                           AS cohort_size,
    COUNT(DISTINCT CASE WHEN l1.login_date = fl.first_date + INTERVAL '1 day'
                        THEN l1.user_id END)                            AS day1_retained,
    COUNT(DISTINCT CASE WHEN l7.login_date = fl.first_date + INTERVAL '7 days'
                        THEN l7.user_id END)                            AS day7_retained,
    ROUND(
        100.0 * COUNT(DISTINCT CASE WHEN l1.login_date = fl.first_date + INTERVAL '1 day'
                                    THEN l1.user_id END)
              / COUNT(DISTINCT fl.user_id), 1
    )                                                                    AS day1_retention_pct,
    ROUND(
        100.0 * COUNT(DISTINCT CASE WHEN l7.login_date = fl.first_date + INTERVAL '7 days'
                                    THEN l7.user_id END)
              / COUNT(DISTINCT fl.user_id), 1
    )                                                                    AS day7_retention_pct
FROM first_login fl
LEFT JOIN logins l1
    ON fl.user_id = l1.user_id
    AND l1.login_date = fl.first_date + INTERVAL '1 day'
LEFT JOIN logins l7
    ON fl.user_id = l7.user_id
    AND l7.login_date = fl.first_date + INTERVAL '7 days'
GROUP BY fl.first_date
ORDER BY fl.first_date;

Expected output:

first_datecohort_sizeday1_retainedday7_retainedday1_retention_pctday7_retention_pct
2024-01-0131218713459.942.9
2024-01-0229817412158.440.6

Key Insight: Use LEFT JOIN, not INNER JOIN. If you use INNER JOIN, users who were NOT retained on Day-7 disappear from the result entirely, destroying the denominator. Common Mistake: Using login_date >= first_date + INTERVAL '7 days' instead of = first_date + INTERVAL '7 days'. The >= condition gives you rolling retention, which is a different metric. What the Interviewer Is Testing: The distinction between Day-N (equality join) and rolling retention (inequality join), and correct LEFT JOIN usage.

Question 7: Rolling Day-30 retention using window functions

Rolling retention is more generous than exact Day-N retention. A user is "rolling retained at Day-30" if they returned at any point on or after their first active day plus 30 days. This better represents whether a user found lasting value in the product.

sql
WITH first_login AS (
    SELECT
        user_id,
        MIN(login_date) AS first_date
    FROM logins
    GROUP BY user_id
),
future_activity AS (
    SELECT
        fl.user_id,
        fl.first_date,
        MAX(l.login_date) AS last_seen_date
    FROM first_login fl
    LEFT JOIN logins l
        ON fl.user_id = l.user_id
        AND l.login_date > fl.first_date
    GROUP BY fl.user_id, fl.first_date
)
SELECT
    first_date,
    COUNT(DISTINCT user_id)                                                 AS cohort_size,
    COUNT(DISTINCT CASE WHEN last_seen_date >= first_date + INTERVAL '30 days'
                        THEN user_id END)                                   AS rolling_day30_retained,
    ROUND(
        100.0 * COUNT(DISTINCT CASE WHEN last_seen_date >= first_date + INTERVAL '30 days'
                                    THEN user_id END)
              / COUNT(DISTINCT user_id), 1
    )                                                                        AS rolling_day30_pct
FROM future_activity
WHERE first_date <= CURRENT_DATE - INTERVAL '30 days'
GROUP BY first_date
ORDER BY first_date;

Expected output:

first_datecohort_sizerolling_day30_retainedrolling_day30_pct
2024-01-0131220164.4
2024-01-0229818963.4

Key Insight: Rolling Day-30 retention is always equal to or higher than exact Day-30 retention because it counts anyone who returned at any point after Day-30, not only those who returned on Day-30 exactly. Common Mistake: Forgetting to filter last_seen_date as a lookback floor. Users who first appeared very recently will have no future data, artificially lowering retention. Add a filter: WHERE first_date <= CURRENT_DATE - INTERVAL '30 days'. What the Interviewer Is Testing: Understanding of the conceptual difference between Day-N and rolling retention, and the implications of incomplete cohorts.

Question 8: The date spine technique for filling retention gaps

When computing a retention curve, you expect a row for every cohort-period combination. But if no users from the January cohort returned in March, your query returns no row for (January, 2). This creates misleading NULL gaps in dashboards. Use a date spine to guarantee all combinations appear.

sql
-- Step 1: Build the date spine
WITH date_spine AS (
    SELECT
        generate_series(
            DATE_TRUNC('month', (SELECT MIN(login_date) FROM logins)),
            DATE_TRUNC('month', (SELECT MAX(login_date) FROM logins)),
            INTERVAL '1 month'
        )::DATE AS month_date
),
-- Step 2: Build cohorts
cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(login_date))::DATE AS cohort_month
    FROM logins
    GROUP BY user_id
),
-- Step 3: Generate every cohort x period combination
cohort_periods AS (
    SELECT DISTINCT
        c.cohort_month,
        ds.month_date AS activity_month,
        (
            (EXTRACT(YEAR  FROM ds.month_date) - EXTRACT(YEAR  FROM c.cohort_month)) * 12
          + (EXTRACT(MONTH FROM ds.month_date) - EXTRACT(MONTH FROM c.cohort_month))
        ) AS period_number
    FROM cohorts c
    CROSS JOIN date_spine ds
    WHERE ds.month_date >= c.cohort_month
),
-- Step 4: Compute actual retained users
actual_retention AS (
    SELECT
        c.cohort_month,
        DATE_TRUNC('month', l.login_date)::DATE AS activity_month,
        COUNT(DISTINCT c.user_id) AS retained_users
    FROM cohorts c
    JOIN logins l ON c.user_id = l.user_id
    GROUP BY 1, 2
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
    FROM cohorts GROUP BY cohort_month
)
SELECT
    cp.cohort_month,
    cp.period_number,
    cs.cohort_size,
    COALESCE(ar.retained_users, 0)      AS retained_users,
    ROUND(
        100.0 * COALESCE(ar.retained_users, 0) / cs.cohort_size, 1
    )                                   AS retention_pct
FROM cohort_periods cp
JOIN cohort_sizes cs ON cp.cohort_month = cs.cohort_month
LEFT JOIN actual_retention ar
    ON cp.cohort_month   = ar.cohort_month
    AND cp.activity_month = ar.activity_month
ORDER BY cp.cohort_month, cp.period_number;

Key Insight: CROSS JOIN between cohorts and the date spine creates the full matrix of expected rows. The subsequent LEFT JOIN to actual data fills in real values and leaves 0 (via COALESCE) wherever nobody showed up. Common Mistake: Forgetting the WHERE ds.month_date >= c.cohort_month filter on the cross join. Without it, you generate impossible rows where the activity month precedes the cohort month. What the Interviewer Is Testing: Awareness of the missing-row problem in retention analysis and knowledge of the date spine pattern. This is a senior-level question; interviewers use it to separate candidates who have built real retention dashboards from those who have only read about them.

Churn Detection

Churn detection asks: which users stopped coming back? The mechanics vary, as some products define churn as 30 days of inactivity, others use subscription cancellation dates, others use a probabilistic model. In SQL interviews, the most common framing is either "find users who were active in period X but not in period X+1" or "find users whose last activity was more than N days ago."

The schema used throughout this section:

sql
-- user_activity table
user_id       INTEGER
activity_date DATE

-- subscriptions table
user_id    INTEGER
plan_id    VARCHAR
start_date DATE
end_date   DATE    -- NULL if still active

Question 9: Identify churned users with LAG

A subscription app wants to detect users who were active (logged in) in one month but completely absent the next month. Write a query that flags each user's monthly activity status and identifies the first month they churned.

sql
WITH monthly_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', activity_date)::DATE AS activity_month
    FROM user_activity
    GROUP BY 1, 2
),
with_lag AS (
    SELECT
        user_id,
        activity_month,
        LAG(activity_month) OVER (
            PARTITION BY user_id
            ORDER BY activity_month
        ) AS prev_activity_month
    FROM monthly_activity
),
gap_flagged AS (
    SELECT
        user_id,
        activity_month,
        prev_activity_month,
        -- Gap > 1 month means the user skipped at least one month
        CASE WHEN prev_activity_month IS NULL THEN 'first_seen'
             WHEN activity_month > prev_activity_month + INTERVAL '1 month' THEN 'returned_after_churn'
             ELSE 'active'
        END AS status
    FROM with_lag
)
SELECT *
FROM gap_flagged
ORDER BY user_id, activity_month;

To find the first churn month for each user (the month after they last appeared before a gap):

sql
WITH monthly_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', activity_date)::DATE AS activity_month
    FROM user_activity
    GROUP BY 1, 2
),
with_lead AS (
    SELECT
        user_id,
        activity_month,
        LEAD(activity_month) OVER (
            PARTITION BY user_id
            ORDER BY activity_month
        ) AS next_activity_month
    FROM monthly_activity
)
SELECT
    user_id,
    activity_month          AS last_active_month,
    next_activity_month     AS returned_month,
    activity_month + INTERVAL '1 month' AS churn_month
FROM with_lead
WHERE
    next_activity_month IS NULL
    OR next_activity_month > activity_month + INTERVAL '1 month'
ORDER BY user_id;

Key Insight: A NULL next_activity_month means the user never came back after that month, so they are permanently churned (at least through the data window). Treat these differently from users who churned and then reactivated. Common Mistake: Using activity_date directly instead of truncating to the month. Two logins in the same month will create two rows, and LAG will compare consecutive login dates rather than consecutive months. What the Interviewer Is Testing: Fluency with LAG/LEAD window functions and the ability to translate "user stopped coming back" into a precise SQL condition.

Question 10: Time-since-last-activity churn scoring

A product team wants a simple churn risk score: for every user, how many days has it been since their last activity as of today? Users inactive for 30+ days are "at risk," and those inactive for 90+ days are "churned."

sql
WITH last_seen AS (
    SELECT
        user_id,
        MAX(activity_date) AS last_activity_date
    FROM user_activity
    GROUP BY user_id
)
SELECT
    user_id,
    last_activity_date,
    CURRENT_DATE - last_activity_date         AS days_since_active,
    CASE
        WHEN CURRENT_DATE - last_activity_date >= 90 THEN 'churned'
        WHEN CURRENT_DATE - last_activity_date >= 30 THEN 'at_risk'
        ELSE 'active'
    END                                        AS churn_status
FROM last_seen
ORDER BY days_since_active DESC;

Expected output:

user_idlast_activity_datedays_since_activechurn_status
50212023-10-14167churned
61022024-02-0157at_risk
78342024-03-1019active

Key Insight: The thresholds (30 and 90 days) should always be discussed with the interviewer. A ride-sharing app might define churn at 60 days; a productivity tool might use 14. Never hardcode these without asking. Common Mistake: Not accounting for users who never had any activity after signup. If you join to the users table with a LEFT JOIN, users with no events will have last_activity_date = NULL. Add COALESCE(last_activity_date, signup_date) or handle NULLs explicitly. What the Interviewer Is Testing: Simple aggregation combined with business logic in a CASE statement, and awareness that product-specific context should shape threshold definitions.

Question 11: Subscription cohort retention across three post-signup months

A SaaS tool bills subscriptions monthly. Write a query that returns, for each signup cohort and plan type, the retention rate in months 1, 2, and 3 after signup. A subscription is "retained" in a given month if its end_date is NULL (still active) or falls after that month's end.

sql
WITH cohort_base AS (
    SELECT
        user_id,
        plan_id,
        DATE_TRUNC('month', start_date)::DATE AS cohort_month,
        start_date,
        end_date
    FROM subscriptions
),
cohort_sizes AS (
    SELECT
        cohort_month,
        plan_id,
        COUNT(DISTINCT user_id) AS cohort_size
    FROM cohort_base
    GROUP BY 1, 2
),
retention AS (
    SELECT
        cb.cohort_month,
        cb.plan_id,
        cs.cohort_size,
        -- Month 1: still active 1 month after start
        COUNT(DISTINCT CASE
            WHEN cb.end_date IS NULL
              OR cb.end_date >= cb.cohort_month + INTERVAL '2 months'
            THEN cb.user_id END)                                    AS month_1_retained,
        -- Month 2
        COUNT(DISTINCT CASE
            WHEN cb.end_date IS NULL
              OR cb.end_date >= cb.cohort_month + INTERVAL '3 months'
            THEN cb.user_id END)                                    AS month_2_retained,
        -- Month 3
        COUNT(DISTINCT CASE
            WHEN cb.end_date IS NULL
              OR cb.end_date >= cb.cohort_month + INTERVAL '4 months'
            THEN cb.user_id END)                                    AS month_3_retained
    FROM cohort_base cb
    JOIN cohort_sizes cs
        ON cb.cohort_month = cs.cohort_month
        AND cb.plan_id = cs.plan_id
    GROUP BY 1, 2, 3
)
SELECT
    cohort_month,
    plan_id,
    cohort_size,
    ROUND(100.0 * month_1_retained / cohort_size, 1) AS month_1_pct,
    ROUND(100.0 * month_2_retained / cohort_size, 1) AS month_2_pct,
    ROUND(100.0 * month_3_retained / cohort_size, 1) AS month_3_pct
FROM retention
ORDER BY cohort_month, plan_id;

Expected output:

cohort_monthplan_idcohort_sizemonth_1_pctmonth_2_pctmonth_3_pct
2024-01-01basic1,24072.358.146.9
2024-01-01pro41885.473.264.1
2024-02-01basic1,10871.057.345.8

Key Insight: The end_date >= cohort_month + INTERVAL '2 months' logic means "still active at the start of month 2," which is equivalent to "retained through month 1." Walk through the interval arithmetic explicitly in your interview answer. Common Mistake: Treating end_date in the same calendar month as start_date as retained. Per the standard SaaS definition, a subscription cancelled in its first month was never retained. Adjust your interval check accordingly. What the Interviewer Is Testing: Translating a business rule (monthly retention for a subscription product) into date arithmetic, and handling the NULL end_date case for active subscriptions.

Advanced Product Metrics

The final tier of cohort analysis questions goes beyond simple retention rates. These questions test whether you can build the metrics product teams actually use in weekly reviews: L28 engagement depth, power user identification, and week-over-week retention curves.

Question 12: L28, Count Days Active in the Last 28 Days Per User

L28 (also called the "activity histogram" or "power user curve") counts, for each user, how many of the last 28 days they were active. It is more nuanced than binary DAU/MAU because it shows engagement depth: a user active 28 of 28 days is fundamentally different from one active 1 of 28 days, even though both count toward MAU.

sql
-- As of a specific reference date (e.g., today)
WITH window_activity AS (
    SELECT
        user_id,
        COUNT(DISTINCT activity_date) AS days_active_l28
    FROM user_activity
    WHERE activity_date > CURRENT_DATE - INTERVAL '28 days'
      AND activity_date <= CURRENT_DATE
    GROUP BY user_id
),
-- Join back to all users who were ever active (to include 0-day users)
all_users AS (
    SELECT DISTINCT user_id FROM user_activity
)
SELECT
    au.user_id,
    COALESCE(wa.days_active_l28, 0) AS days_active_l28
FROM all_users au
LEFT JOIN window_activity wa ON au.user_id = wa.user_id
ORDER BY days_active_l28 DESC;

To build the distribution (the power user curve itself):

sql
WITH window_activity AS (
    SELECT
        user_id,
        COUNT(DISTINCT activity_date) AS days_active_l28
    FROM user_activity
    WHERE activity_date > CURRENT_DATE - INTERVAL '28 days'
      AND activity_date <= CURRENT_DATE
    GROUP BY user_id
)
SELECT
    days_active_l28,
    COUNT(user_id)                                        AS user_count,
    ROUND(100.0 * COUNT(user_id) / SUM(COUNT(user_id)) OVER (), 1) AS pct_of_mau
FROM window_activity
GROUP BY days_active_l28
ORDER BY days_active_l28;

Expected output:

days_active_l28user_countpct_of_mau
112,40024.1
28,20015.9
74,1008.0
281,8003.5

Key Insight: The shape of this distribution reveals product health. A J-shaped curve where most users appear only 1-3 days is a warning sign. A flatter distribution with a meaningful tail at 20-28 days means you have a healthy core of power users. Common Mistake: Using a 30-day or 31-day window instead of exactly 28 days. L28 is a fixed 28-day window by convention, not "this calendar month." It is comparable across months because the denominator never changes. What the Interviewer Is Testing: Knowledge of the L28 metric convention, window date arithmetic, and the ability to translate a histogram query into business insight.

Question 13: Week-over-week retention curves with a single query

Product teams often track cohort retention as a weekly curve: of users who joined in week W, what fraction returned in W+1, W+2, W+3? Write a query that generates this table for all cohorts in the dataset.

sql
WITH first_week AS (
    SELECT
        user_id,
        DATE_TRUNC('week', MIN(activity_date))::DATE AS cohort_week
    FROM user_activity
    GROUP BY user_id
),
weekly_activity AS (
    SELECT
        ua.user_id,
        fw.cohort_week,
        DATE_TRUNC('week', ua.activity_date)::DATE AS activity_week,
        -- Weeks since acquisition
        (ua.activity_date - fw.cohort_week) / 7    AS week_number
    FROM user_activity ua
    JOIN first_week fw ON ua.user_id = fw.user_id
),
cohort_sizes AS (
    SELECT cohort_week, COUNT(DISTINCT user_id) AS size
    FROM first_week
    GROUP BY cohort_week
)
SELECT
    wa.cohort_week,
    cs.size                                                       AS cohort_size,
    wa.week_number,
    COUNT(DISTINCT wa.user_id)                                    AS retained_users,
    ROUND(
        100.0 * COUNT(DISTINCT wa.user_id) / cs.size, 1
    )                                                             AS retention_pct
FROM weekly_activity wa
JOIN cohort_sizes cs ON wa.cohort_week = cs.cohort_week
WHERE wa.week_number BETWEEN 0 AND 12
GROUP BY wa.cohort_week, cs.size, wa.week_number
ORDER BY wa.cohort_week, wa.week_number;

Expected output:

cohort_weekcohort_sizeweek_numberretained_usersretention_pct
2024-01-011,84001,840100.0
2024-01-011,84011,06758.0
2024-01-011,840279042.9
2024-01-011,840458932.0
2024-01-081,71201,712100.0

Key Insight: Week-over-week cohort curves often reveal onboarding quality. If week-1 retention is below 30%, that is almost always an onboarding problem, not a product problem. Saying this in an interview demonstrates product instinct alongside SQL skill. Common Mistake: Integer division in (activity_date - cohort_week) / 7 giving unexpected results when the dates are not on the same day of the week. Use DATE_TRUNC('week', ...) on both sides to normalize to Monday before subtracting. What the Interviewer Is Testing: The complete cohort retention pattern applied to weekly granularity, integer arithmetic for week offsets, and product interpretation of the output.

Conclusion

Cohort analysis questions test something that pure SQL syntax questions cannot: whether you understand how user behavior unfolds over time. The patterns above appear in almost every product data science interview because they mirror real work. Building a Day-7 retention query is not an academic exercise. It is the first thing a new data scientist does when joining a consumer app team.

Three things separate candidates who pass these questions from those who do not. First, they define the metric before writing SQL. The moment you hear "retention," you ask: Day-N or rolling? Exact date or any date after? Every active user or only those from a specific cohort? Second, they use CTEs to break the problem into named, testable pieces. A single 40-line query with no CTEs will almost always have an error that is hard to find. Four 10-line CTEs make each piece verifiable. Third, they always use LEFT JOIN when preserving a denominator. If a user was not retained, you still need them in your count. Dropping them with an INNER JOIN silently corrupts your retention rate.

The date spine technique (Question 8) and the L28 power user curve (Question 12) are genuinely senior-level patterns. If you can explain why a cohort matrix has missing rows and how a cross join with a generated series solves it, you are demonstrating the kind of thinking that separates people who have built real analytics infrastructure from those who have only practiced interview prep problems.

For a deeper foundation in the SQL patterns underlying these techniques, the SQL Window Functions Guide on letsdatascience.com and the Product Data Science Interview Prep series on letsdatascience.com cover the LAG/LEAD patterns and CTE construction techniques used throughout this article.

Every question above has a correct answer that you can rehearse. What you cannot rehearse is the product intuition that makes your answer memorable to an interviewer. Practice explaining what the output means, not just how to produce it.

Career Q&A

How much time should I spend on SQL prep for a product data science interview?

For a role where product analytics is central (consumer apps, SaaS, social platforms), SQL should get about 40% of your technical prep time. The other 60% goes to product sense, A/B testing, and statistics. Retention and cohort questions appear in nearly every product DS loop, so they deserve dedicated practice sessions, not just passive reading.

Should I memorize the exact date arithmetic formulas before the interview?

Know the general pattern, not the exact syntax. Interviewers care more that you construct the logic correctly than that you recall (EXTRACT(YEAR FROM d2) - EXTRACT(YEAR FROM d1)) * 12 + ... from memory. Most interview environments allow you to look up dialect-specific functions. What they want to see is that you know why the arithmetic is needed and what it calculates.

My retention query works in the interview but gives wrong numbers. How do I debug it?

Walk through a single user manually. Pick a user from your sample data, trace their first activity date, compute the expected month offset by hand, then compare that against what your query produces for that user. This process catches 90% of errors: wrong join type, incorrect interval arithmetic, or a GROUP BY that eliminates the user entirely. Say this debugging process out loud. It demonstrates structured thinking.

Is it better to use a self-join or window functions for retention queries?

Self-joins are easier to explain in an interview: "I join the events table to itself, matching on user and date." Window functions with complex frame clauses can confuse interviewers who are not deeply technical. For Day-N retention specifically, a self-join with a precise join condition is often cleaner. For rolling aggregations and month-over-month comparisons, LAG/LEAD and window frames are the right tool. Know both; default to whichever is more readable given the question.

What do interviewers mean when they say "your denominator is wrong"?

It almost always means you filtered out users who were NOT retained before counting. The denominator of any retention rate is the full cohort: every user who was first active in the anchor period, regardless of whether they came back. Using INNER JOIN instead of LEFT JOIN, or adding a WHERE clause on the retained event, shrinks the denominator and inflates the retention rate. Always verify your denominator independently with a COUNT(DISTINCT user_id) on just the cohort CTE.

How should I explain cohort analysis to a non-technical interviewer?

Use the class reunion analogy: "Imagine you track all students who graduated in the same year as one group. Month 1, you count how many showed up to the one-month reunion. Month 6, how many showed up to the six-month reunion. Each graduating class is a cohort. The fraction who show up each month is the retention rate." This explanation shows you can communicate across audiences, a skill that matters as much as the SQL.

Do these patterns translate across databases (Redshift, BigQuery, Snowflake)?

Yes, with minor syntax adjustments. DATE_TRUNC is standard in PostgreSQL, Redshift, and Snowflake. BigQuery uses DATE_TRUNC(date, MONTH) with a slightly different argument order. GENERATE_SERIES is PostgreSQL/Redshift; BigQuery uses GENERATE_DATE_ARRAY; Snowflake uses GENERATOR. Mention in the interview which dialect you are most comfortable with, then note the equivalent function in the system they use. This signals real-world experience.

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths