Skip to content

SQL CTEs and Subqueries: The Complete Interview Guide

DS
LDS Team
Let's Data Science
16 min

"The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience. These represent SQL patterns that data professionals have reported in technical interviews across the industry. We do not claim these are proprietary questions from any specific organization."

Every SQL technical screen has at least one question where you must decide: CTE or subquery? The wrong choice does not necessarily produce wrong output (both can return the same rows), but the choice signals whether you think about code as something a colleague will maintain, or just something that runs. Interviewers are watching for that distinction.

This guide covers the mechanics, the decision framework, and twelve worked questions drawn from the patterns that surface most often in data interviews.

What CTEs Actually Are (and What They Are Not)

A Common Table Expression is a named temporary result set defined with the WITH clause. It exists only for the duration of the query. It is not a view, not a temporary table, and not a persisted object.

sql
WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount)                     AS revenue
    FROM orders
    GROUP BY 1
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prior_month_revenue
FROM monthly_revenue;

The CTE named monthly_revenue is evaluated, its result is available to the outer query, and then it disappears. Nothing is written to disk.

The misconception that trips up many candidates: CTEs are not always faster than subqueries. In PostgreSQL, a CTE that is referenced more than once is materialized by default, meaning the database computes it once and stores the result in memory. A subquery in the same position would be re-evaluated each time, which is sometimes slower and sometimes faster depending on whether the optimizer can push index predicates into it. PostgreSQL 12 changed the default so that single-use CTEs are inlined (treated as if you had written the subquery directly), removing most of the performance difference for simple cases.

The PostgreSQL documentation states it directly: a CTE referenced multiple times is automatically materialized, while a CTE referenced once is folded into the parent query as if written inline. You can override this with WITH w AS MATERIALIZED (...) or WITH w AS NOT MATERIALIZED (...).

Key Insight: The query optimizer in modern PostgreSQL (12+) treats a single-use CTE identically to an inline subquery. The gap between CTEs and subqueries is primarily about readability and maintainability, not execution speed.

Question 1: CTE syntax warm-up, top revenue artists per genre

Schema:

sql
CREATE TABLE concerts (
    artist_id        INTEGER,
    artist_name      VARCHAR(100),
    genre            VARCHAR(50),
    concert_revenue  INTEGER,
    number_of_members INTEGER
);

Problem: For each genre, find the artist with the highest revenue per band member.

Solution:

sql
WITH ranked_artists AS (
    SELECT
        artist_name,
        genre,
        concert_revenue,
        number_of_members,
        concert_revenue::NUMERIC / number_of_members   AS revenue_per_member,
        RANK() OVER (
            PARTITION BY genre
            ORDER BY concert_revenue::NUMERIC / number_of_members DESC
        )                                            AS rnk
    FROM concerts
)
SELECT
    artist_name,
    genre,
    concert_revenue,
    number_of_members,
    revenue_per_member
FROM ranked_artists
WHERE rnk = 1
ORDER BY revenue_per_member DESC;

Why CTE here: The RANK() window function produces the rank column inside the CTE. The outer query then filters on rnk = 1. You cannot reference a window function alias in the same query's WHERE clause, so you need either a CTE or a subquery to wrap it. The CTE version makes the two-step logic explicit: first rank, then filter.

Common Mistake: Writing WHERE RANK() OVER (...) = 1 directly. This is invalid SQL. Window functions run after WHERE, so you must wrap them in a CTE or subquery before filtering on their output.

When CTEs Win: Readability and Maintainability

CTEs excel when a query has multiple logical steps and each step deserves a name. Compare these two versions of the same query, finding users who made a purchase within 7 days of their first session.

Schema:

sql
CREATE TABLE sessions  (user_id INT, session_date DATE);
CREATE TABLE purchases (user_id INT, purchase_date DATE, amount NUMERIC);

Version A: nested subqueries:

sql
SELECT
    p.user_id,
    COUNT(*)        AS purchases_in_first_week,
    SUM(p.amount)   AS revenue
FROM purchases p
JOIN (
    SELECT user_id, MIN(session_date) AS first_session
    FROM sessions
    GROUP BY user_id
) fs ON p.user_id = fs.user_id
WHERE p.purchase_date <= fs.first_session + INTERVAL '7 days'
  AND p.purchase_date >= fs.first_session
GROUP BY p.user_id;

Version B: CTE:

sql
WITH first_sessions AS (
    SELECT user_id, MIN(session_date) AS first_session
    FROM sessions
    GROUP BY user_id
),
early_purchases AS (
    SELECT
        p.user_id,
        p.amount
    FROM purchases p
    JOIN first_sessions fs ON p.user_id = fs.user_id
    WHERE p.purchase_date BETWEEN fs.first_session
                              AND fs.first_session + INTERVAL '7 days'
)
SELECT
    user_id,
    COUNT(*)      AS purchases_in_first_week,
    SUM(amount)   AS revenue
FROM early_purchases
GROUP BY user_id;

Both return identical results. Version B reads like a short story: find first sessions, find early purchases, then aggregate. A new teammate can read each CTE in isolation.

Question 2: Two-step CTE, employees earning above their department average

Schema:

sql
CREATE TABLE employees (
    employee_id   INTEGER PRIMARY KEY,
    name          VARCHAR(100),
    department    VARCHAR(50),
    salary        NUMERIC
);

Problem: List every employee whose salary exceeds their department's average. Include the department average in the output.

Solution:

sql
WITH dept_averages AS (
    SELECT
        department,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
)
SELECT
    e.name,
    e.department,
    e.salary,
    ROUND(da.avg_salary, 2) AS dept_avg_salary
FROM employees e
JOIN dept_averages da ON e.department = da.department
WHERE e.salary > da.avg_salary
ORDER BY e.department, e.salary DESC;

Key insight: The CTE computes department averages once. The outer query joins against it for every employee row. Compare to the correlated subquery version (shown in the Correlated vs Uncorrelated Subqueries section below), which would recompute the average for every single row in the employees table.

Subqueries Are Not Dead: When They Are the Right Choice

Subqueries remain the cleaner choice in three specific situations. Forcing a CTE in these cases adds boilerplate without improving clarity.

Scalar subqueries in the SELECT clause let you add a derived column without a join:

sql
SELECT
    order_id,
    amount,
    (SELECT AVG(amount) FROM orders) AS overall_avg,
    amount - (SELECT AVG(amount) FROM orders) AS diff_from_avg
FROM orders
WHERE order_date >= '2025-01-01';

The overall average is a single value. Writing a CTE for a single scalar is unnecessarily ceremonial.

Simple WHERE IN filters where the subquery is a short, obvious lookup:

sql
SELECT *
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE is_active = TRUE
);

A CTE here would add two extra lines and two extra keystrokes to name something that already has a perfectly clear name inside the parentheses.

EXISTS / NOT EXISTS for membership checks:

sql
-- Find managers who have at least one direct report
SELECT employee_id, name
FROM employees e
WHERE EXISTS (
    SELECT 1 FROM employees sub
    WHERE sub.manager_id = e.employee_id
);

Key Insight: The SELECT 1 inside EXISTS is a convention that signals intent to the reader. The database only checks whether any row is returned, never the column values. SELECT * would work identically.

Question 3: Scalar subquery, product price vs. category average

Schema:

sql
CREATE TABLE products (
    product_id   INTEGER,
    product_name VARCHAR(100),
    category_id  INTEGER,
    price        NUMERIC
);

Problem: For each product, show its price, the average price of its category, and whether it is above or below average.

Solution:

sql
SELECT
    product_name,
    price,
    ROUND(
        (SELECT AVG(p2.price)
         FROM products p2
         WHERE p2.category_id = p1.category_id),
        2
    )                                           AS category_avg_price,
    CASE
        WHEN price > (SELECT AVG(p2.price) FROM products p2
                      WHERE p2.category_id = p1.category_id)
        THEN 'Above Average'
        ELSE 'Below Average'
    END                                         AS vs_average
FROM products p1
ORDER BY category_id, price DESC;

When to mention the performance concern: This is a correlated subquery in the SELECT list, and it runs once per row. On a 10-row table the performance is negligible. On a 10-million-row table, it becomes a scan-per-row problem. In an interview, acknowledge this and offer the CTE + JOIN alternative if the interviewer asks about scaling.

Correlated vs Uncorrelated Subqueries

This distinction appears in almost every advanced SQL interview. The difference is execution behavior:

An uncorrelated subquery has no reference to the outer query. It executes once, returns a result set, and the outer query uses that result.

sql
-- Uncorrelated: the subquery runs once
SELECT name FROM employees
WHERE department_id IN (
    SELECT department_id FROM departments WHERE location = 'New York'
);

A correlated subquery references a column from the outer query. It runs once per row processed by the outer query.

sql
-- Correlated: the subquery runs once for every row in employees
SELECT name, salary, department_id
FROM employees e
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department_id = e.department_id  -- references outer query's row
);

The reference e.department_id is what makes this correlated. For each employee row the outer query processes, the database re-evaluates the AVG(salary) computation for that employee's department.

Question 4: Rewrite a correlated subquery as a JOIN

Problem: The correlated subquery from above is slow on a 5-million-row employees table. Rewrite it as a JOIN-based query that achieves the same result.

Solution:

sql
WITH dept_averages AS (
    SELECT
        department_id,
        AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT
    e.name,
    e.salary,
    e.department_id
FROM employees e
JOIN dept_averages da ON e.department_id = da.department_id
WHERE e.salary > da.avg_salary;

Why this is faster: The CTE computes one average per department. If there are 20 departments, that is 20 aggregate computations. The correlated version computes one average per employee. If there are 5 million employees spread across 20 departments, that is 5 million computations (many of which produce the same number).

Performance Note: Modern PostgreSQL can sometimes optimize away the correlated subquery by transforming it into a hash join internally. But you cannot rely on this, and the explicit JOIN version is always at least as fast and more portable across databases.

Question 5: NOT EXISTS for anti-join pattern

Schema:

sql
CREATE TABLE customers (customer_id INT, name VARCHAR(100));
CREATE TABLE orders    (order_id INT, customer_id INT, order_date DATE);

Problem: Find all customers who have never placed an order.

Solution using NOT EXISTS:

sql
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

Solution using LEFT JOIN (equivalent):

sql
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

Which to use in an interview: Both are correct. The NOT EXISTS version is often more readable when the intent is "has no matching row." The LEFT JOIN version is slightly more familiar to SQL beginners. Showing both and explaining the tradeoff earns points.

Common Mistake: Using NOT IN instead of NOT EXISTS when orders.customer_id might contain NULLs. NOT IN with any NULL in the subquery result returns zero rows, a silent, incorrect result. NOT EXISTS handles NULLs correctly. This is a trap question at senior-level interviews.

Recursive CTEs for Hierarchical Data

Recursive CTEs are the only pure-SQL way to traverse tree structures without knowing the depth in advance. They appear regularly at senior levels (L5+), most often in the context of org chart traversal, product category trees, or generating calendar series.

The structure has two mandatory parts joined by UNION ALL:

sql
WITH RECURSIVE cte_name AS (
    -- Anchor member: the starting point (non-recursive)
    SELECT ...

    UNION ALL

    -- Recursive member: references cte_name to go one level deeper
    SELECT ... FROM source_table JOIN cte_name ON ...
)
SELECT * FROM cte_name;

The database executes the anchor once, then feeds the result into the recursive member, keeps going until the recursive member produces zero rows, and returns the accumulated result.

Question 6: Employee manager hierarchy traversal

Schema:

sql
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    name        VARCHAR(100),
    manager_id  INTEGER REFERENCES employees(employee_id)
    -- manager_id is NULL for the CEO
);

Problem: For a given manager (employee_id = 5), find all direct and indirect reports, along with their depth in the hierarchy.

Solution:

sql
WITH RECURSIVE org_tree AS (
    -- Anchor: start with the target manager
    SELECT
        employee_id,
        name,
        manager_id,
        0 AS depth
    FROM employees
    WHERE employee_id = 5

    UNION ALL

    -- Recursive: find employees who report to someone already in the tree
    SELECT
        e.employee_id,
        e.name,
        e.manager_id,
        ot.depth + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT
    employee_id,
    name,
    manager_id,
    depth
FROM org_tree
WHERE employee_id <> 5  -- exclude the starting manager
ORDER BY depth, name;

Expected output (sample):

employee_idnamemanager_iddepth
8Alice51
12Bob51
19Carol82
23David122

Key Insight: For tree-structured data (org charts, category hierarchies), always use UNION ALL in recursive CTEs, not UNION. Using UNION removes duplicates at each iteration, which adds overhead and can break the recursion logic. For graph data with potential cycles, UNION or explicit cycle detection may be needed to prevent infinite recursion.

Question 7: Generating a date series with a recursive CTE

Problem: Generate every date in Q1 2026 (January 1 through March 31) for use as a calendar spine in a reporting query.

Solution:

sql
WITH RECURSIVE date_series AS (
    -- Anchor: first day of the range
    SELECT DATE '2026-01-01' AS dt

    UNION ALL

    -- Recursive: add one day until the end of the range
    SELECT (dt + INTERVAL '1 day')::DATE
    FROM date_series
    WHERE dt < DATE '2026-03-31'
)
SELECT dt FROM date_series;

Why this matters in interviews: The date spine pattern is foundational for time-series analysis. It lets you LEFT JOIN actual event data against a complete calendar, so that days with zero activity show up as zero rather than disappearing from the result. Interviewers from analytics-heavy teams (growth, finance, marketing data) ask this pattern frequently.

Performance Note: Most production databases offer a native function for this. PostgreSQL has generate_series(), BigQuery has GENERATE_DATE_ARRAY(). In an interview, mention the recursive CTE first to demonstrate you understand the mechanism, then note that native functions exist and are preferred in production for readability.

Multi-CTE Pipelines for Complex Analysis

Chaining multiple CTEs is where the format truly earns its keep. Each CTE can reference any previously defined CTE, creating a readable pipeline where each step has a name.

Question 8: Multi-step CTE, second purchase analysis

Schema:

sql
CREATE TABLE orders (
    order_id    INTEGER,
    user_id     INTEGER,
    order_date  DATE,
    amount      NUMERIC
);

Problem: For users who made at least two purchases, find the time between their first and second purchase. Report the average, median, and maximum gap in days.

Solution:

sql
WITH ordered_purchases AS (
    -- Step 1: Rank each user's purchases by date
    SELECT
        user_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY order_date
        ) AS purchase_rank
    FROM orders
),
first_and_second AS (
    -- Step 2: Isolate first and second purchases only
    SELECT
        user_id,
        MAX(CASE WHEN purchase_rank = 1 THEN order_date END) AS first_purchase,
        MAX(CASE WHEN purchase_rank = 2 THEN order_date END) AS second_purchase
    FROM ordered_purchases
    WHERE purchase_rank <= 2
    GROUP BY user_id
    HAVING COUNT(*) = 2  -- must have both first and second
),
gaps AS (
    -- Step 3: Compute the gap in days
    SELECT
        user_id,
        second_purchase - first_purchase AS days_to_second_purchase
    FROM first_and_second
)
SELECT
    AVG(days_to_second_purchase)                        AS avg_gap_days,
    PERCENTILE_CONT(0.5) WITHIN GROUP
        (ORDER BY days_to_second_purchase)              AS median_gap_days,
    MAX(days_to_second_purchase)                        AS max_gap_days
FROM gaps;

Why the CTE pipeline wins here: The logic has three genuinely distinct steps: rank, pivot, compute. Trying to collapse all three into one query would create deeply nested subqueries that are nearly impossible to debug. The CTE version lets you test each step in isolation: run SELECT * FROM ordered_purchases LIMIT 20 to verify step 1 before proceeding.

Question 9: Full funnel analysis using CTEs

Schema:

sql
CREATE TABLE events (
    user_id    INTEGER,
    event_type VARCHAR(50),   -- 'page_view', 'signup', 'trial_start', 'purchase'
    event_date DATE
);

Problem: Build a conversion funnel showing how many unique users reached each of the four stages, and the conversion rate from one stage to the next.

Solution:

sql
WITH page_views AS (
    SELECT COUNT(DISTINCT user_id) AS users FROM events
    WHERE event_type = 'page_view'
),
signups AS (
    SELECT COUNT(DISTINCT user_id) AS users FROM events
    WHERE event_type = 'signup'
),
trial_starts AS (
    SELECT COUNT(DISTINCT user_id) AS users FROM events
    WHERE event_type = 'trial_start'
),
purchases AS (
    SELECT COUNT(DISTINCT user_id) AS users FROM events
    WHERE event_type = 'purchase'
),
funnel AS (
    SELECT 1 AS step, 'Page View'    AS stage, users FROM page_views
    UNION ALL
    SELECT 2,          'Signup',               users FROM signups
    UNION ALL
    SELECT 3,          'Trial Start',          users FROM trial_starts
    UNION ALL
    SELECT 4,          'Purchase',             users FROM purchases
)
SELECT
    step,
    stage,
    users,
    LAG(users) OVER (ORDER BY step)        AS prior_stage_users,
    ROUND(
        100.0 * users
        / NULLIF(LAG(users) OVER (ORDER BY step), 0),
        1
    )                                      AS conversion_pct
FROM funnel
ORDER BY step;

Expected output:

stepstageusersprior_stage_usersconversion_pct
1Page View10000NULLNULL
2Signup24001000024.0
3Trial Start960240040.0
4Purchase19296020.0

Key Insight: NULLIF(LAG(users) OVER (...), 0) prevents division-by-zero when a stage has zero users. This is a small defensive pattern that interviewers notice. It shows you think about edge cases, not just the happy path.

Question 10: CTE reuse, comparing a metric to itself over two time windows

Problem: Using the same orders table, compare each user's total revenue in Q4 2025 vs Q1 2026. Show the change in absolute and percentage terms.

Solution:

sql
WITH q4_revenue AS (
    SELECT
        user_id,
        SUM(amount) AS revenue
    FROM orders
    WHERE order_date BETWEEN '2025-10-01' AND '2025-12-31'
    GROUP BY user_id
),
q1_revenue AS (
    SELECT
        user_id,
        SUM(amount) AS revenue
    FROM orders
    WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
    GROUP BY user_id
)
SELECT
    COALESCE(q4.user_id, q1.user_id)    AS user_id,
    COALESCE(q4.revenue, 0)             AS q4_revenue,
    COALESCE(q1.revenue, 0)             AS q1_revenue,
    COALESCE(q1.revenue, 0)
        - COALESCE(q4.revenue, 0)       AS absolute_change,
    CASE
        WHEN COALESCE(q4.revenue, 0) = 0 THEN NULL
        ELSE ROUND(
            100.0 * (COALESCE(q1.revenue, 0) - COALESCE(q4.revenue, 0))
            / q4.revenue,
            1
        )
    END                                 AS pct_change
FROM q4_revenue q4
FULL OUTER JOIN q1_revenue q1 ON q4.user_id = q1.user_id
ORDER BY pct_change DESC NULLS LAST;

Why FULL OUTER JOIN: Some users may have ordered only in Q4 (churned), only in Q1 (new), or both. A FULL OUTER JOIN captures all three groups. The COALESCE(..., 0) handles the NULLs from users who have no activity in a given quarter.

The Interview Decision Framework

When you have five minutes to structure a SQL response in an interview, use this framework to decide which form to reach for:

Use a CTE when:

  • The query has two or more distinct logical steps that each deserve a name
  • You need to reference the same intermediate result more than once in the final query
  • The query involves window functions and you need to filter on the window result
  • You are writing a recursive query for hierarchical data
  • You are building a pipeline with 3+ stages (funnel, cohort, time-comparison)
  • Code review or maintainability is a stated requirement

Use a subquery when:

  • You need a single scalar value in the SELECT list or WHERE clause
  • You need a simple IN / NOT IN membership filter where the subquery is short and obvious
  • You need EXISTS / NOT EXISTS for a row-membership check
  • The intermediate result has no meaningful standalone name

Use a JOIN instead of either when:

  • You have a correlated subquery that will execute per row on a large table
  • The "subquery" is really a second table you need to bring in for filtering or aggregation

How to answer the follow-up question "Why did you structure it this way?"

Interviewers ask this to verify that your choice was intentional, not random. A strong answer has two parts: what the code does and what it protects against.

"I used a CTE here because the ranking step and the filtering step are logically separate. If I embedded the window function in a subquery, someone reading this later would have to trace two levels of nesting to understand the intent. I also wanted to make it easy to test the ranking logic independently before applying the filter."

"I kept this as a subquery because it is a single scalar, the average across all rows. There is no intermediate result worth naming, and a CTE would just add noise."

Question 11: Window function + CTE, top 3 products per category

Schema:

sql
CREATE TABLE sales (
    product_id   INTEGER,
    product_name VARCHAR(100),
    category     VARCHAR(50),
    revenue      NUMERIC
);

Problem: For each category, return the top 3 products by revenue. If there are ties at rank 3, include all tied products.

Solution:

sql
WITH ranked_products AS (
    SELECT
        product_name,
        category,
        revenue,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY revenue DESC
        ) AS rnk
    FROM sales
)
SELECT
    product_name,
    category,
    revenue,
    rnk
FROM ranked_products
WHERE rnk <= 3
ORDER BY category, rnk;

DENSE_RANK vs RANK vs ROW_NUMBER, the interview question inside the question:

  • ROW_NUMBER() assigns unique integers with no ties (1, 2, 3, 4...), so use it when you want exactly N rows per group regardless of ties
  • RANK() skips after ties (1, 2, 2, 4...), so use it when you want strict competitive ranking
  • DENSE_RANK() never skips (1, 2, 2, 3...), so use it when you want top-N slots including all tied values

The problem says "if there are ties at rank 3, include all tied products," and that is DENSE_RANK().

Question 12: Multi-CTE pipeline with materialization awareness

Schema:

sql
CREATE TABLE user_events (
    user_id     INTEGER,
    event_type  VARCHAR(50),
    event_date  DATE
);
CREATE TABLE user_metadata (
    user_id     INTEGER,
    signup_date DATE,
    plan_type   VARCHAR(20)   -- 'free', 'pro', 'enterprise'
);

Problem: For each plan type, find users who were active (had at least one event) every week for 4 consecutive weeks in January 2026. Report the count per plan type and the retention rate (active users / total users on that plan).

Solution:

sql
WITH jan_weeks AS (
    -- Define the 4 weeks of January 2026
    SELECT 1 AS wk, DATE '2026-01-05' AS week_start, DATE '2026-01-11' AS week_end
    UNION ALL SELECT 2, '2026-01-12', '2026-01-18'
    UNION ALL SELECT 3, '2026-01-19', '2026-01-25'
    UNION ALL SELECT 4, '2026-01-26', '2026-02-01'
),
weekly_activity AS (
    -- Determine which week each event falls into
    SELECT DISTINCT
        ue.user_id,
        jw.wk
    FROM user_events ue
    JOIN jan_weeks jw
      ON ue.event_date BETWEEN jw.week_start AND jw.week_end
    WHERE ue.event_date >= '2026-01-05'
      AND ue.event_date <= '2026-02-01'
),
four_week_users AS (
    -- Keep only users active all 4 weeks
    SELECT user_id
    FROM weekly_activity
    GROUP BY user_id
    HAVING COUNT(DISTINCT wk) = 4
),
plan_totals AS (
    -- Count total users per plan
    SELECT plan_type, COUNT(*) AS total_users
    FROM user_metadata
    GROUP BY plan_type
)
SELECT
    um.plan_type,
    COUNT(DISTINCT fwu.user_id)     AS retained_users,
    pt.total_users,
    ROUND(
        100.0 * COUNT(DISTINCT fwu.user_id) / pt.total_users,
        1
    )                               AS retention_pct
FROM four_week_users fwu
JOIN user_metadata um   ON fwu.user_id = um.user_id
JOIN plan_totals   pt   ON um.plan_type = pt.plan_type
GROUP BY um.plan_type, pt.total_users
ORDER BY retention_pct DESC;

What this question tests: Multi-CTE pipeline design, the HAVING clause for filtering aggregates, DISTINCT inside COUNT, and the ability to build a calendar spine inline. Each CTE has a single clear responsibility. The final SELECT is a clean aggregation of already-prepared building blocks.

Conclusion

CTEs and subqueries solve the same class of problem. The choice between them is rarely about correctness and almost always about communication: who reads the code after you write it, how easy is it to verify each step, and how clearly does the structure reveal the logic.

The pattern that appears most often in senior interviews is the multi-CTE pipeline: four or five named intermediate results, each doing one thing, building toward a final aggregation. Candidates who can write that kind of query fluently, explain why each CTE exists, and name the performance implications when relevant consistently move forward.

Two rules carry most of the practical weight: never filter on a window function in the same query that defines it, and never use NOT IN when NULLs might exist in the subquery result. Those two catches fail more interviews than any syntax error.

For further reading on SQL patterns that appear alongside CTEs in technical screens, see the SQL Window Functions interview guide and the Advanced SQL JOIN patterns on letsdatascience.com.

Career Q&A

How deeply does a typical data scientist interview test SQL CTEs vs. data engineer interviews?

Data scientist interviews at most mid-size and large companies expect fluency with CTEs, window functions, and multi-step pipelines. The emphasis is on answering analytical questions (funnels, cohorts, retention) cleanly and quickly. Data engineer interviews go further into query optimization, execution plans, partitioning, and materialization behavior. For a data science role, you need to write correct, readable SQL; for a data engineering role, you also need to explain why your SQL is efficient.

If I am asked to solve a problem and I default to CTEs for everything, will that hurt me?

No, as long as you can explain the choice. Defaulting to CTEs signals that you prioritize readable, maintainable code, which is a positive signal. Where it can hurt is if your CTE is clearly overkill (a single-value scalar computation wrapped in a named CTE) and you cannot articulate why you structured it that way. The test is not which tool you reach for; it is whether you can justify the choice.

Should I verbalize my thought process while writing SQL in a live interview?

Yes, always. SQL interviews are as much about communication as correctness. Before writing anything, state the problem back in your own words, identify the tables and joins you will need, and sketch the steps. This gives the interviewer a chance to redirect you before you build the wrong thing, and it demonstrates structured thinking, which is exactly what the job requires.

How do I handle it when I know both a CTE version and a JOIN version that produce the same result?

Write the CTE version first because it is easier to read and verify. Then mention the JOIN alternative: "I could also write this as a single query with a JOIN. Would you like to see that version, or are you satisfied with this approach?" This shows you know multiple paths to the answer and are not rigidly attached to one tool.

What should I do if I blank on the recursive CTE syntax mid-interview?

Acknowledge the concept clearly: "I know recursive CTEs use an anchor member and a recursive member joined by UNION ALL, and the recursion terminates when the recursive step returns zero rows." Then ask if you can refer to syntax, or sketch the structure in pseudocode first. Demonstrating that you understand the mechanism (what recursive CTEs do and when to use them) matters more than memorizing the exact keywords.

How important is it to mention materialization in a SQL interview?

It depends on the role. For a data analyst role, mentioning that CTEs can be materialized and that PostgreSQL 12 changed the default behavior is impressive but not expected. For a senior data engineer or analytics engineer role, the interviewer may specifically ask about it. The safe move is to bring it up if your query reuses a CTE multiple times: "I'm aware this CTE will be materialized since it's referenced twice, and that's actually what we want here, since the aggregation is expensive and we don't want it computed twice."

What is the single most common SQL mistake in interviews?

Using NOT IN with a subquery that might return NULLs. If any row in the subquery result is NULL, NOT IN returns no rows at all. The entire WHERE clause evaluates to UNKNOWN for every outer row. Interviewers plant this trap deliberately. Always default to NOT EXISTS for anti-join patterns, or filter NULLs explicitly: WHERE col NOT IN (SELECT val FROM t WHERE val IS NOT NULL).

Sources

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths