Window functions are the single most frequently tested SQL topic in data science and analytics interviews at large technology companies. According to research aggregated across DataLemur, StrataScratch, and community reports on Blind and r/datascience, roughly 70% of technical SQL rounds at major tech firms include at least one window function question, yet surveys consistently show they are the concept candidates feel least prepared for. This guide covers every major pattern with worked problems, real schemas, and the exact reasoning interviewers are looking for.

Note: Questions in this guide are sourced from our research of publicly available interview prep platforms and community discussions. These types of questions have been reported by candidates across multiple publicly available sources including DataLemur, StrataScratch, LeetCode, and community forums. No specific companies are identified as the source of any question.

Window Function Fundamentals

Before any worked question, it helps to have a one-sentence definition that holds up under follow-up: a window function computes a value for each row by looking at a related set of rows (the "window") without collapsing the result set into fewer rows the way GROUP BY does.

That last clause is the key distinction. GROUP BY forces a one-row-per-group collapse. Window functions let you keep every original row while adding a computed column that reflects the group-level or ordered context. This is why they appear in nearly every real analytics problem: you almost always need both the individual row and the aggregate in the same result.

The general syntax is:

sql

function_name() OVER (
    PARTITION BY column1, column2   -- optional: defines the "groups"
    ORDER BY column3                -- required for ordered functions
    ROWS BETWEEN n PRECEDING AND m FOLLOWING  -- optional: frame bounds
)

Three concepts control every window function:

PARTITION BY resets the function's calculation for each group, similar to GROUP BY but without collapsing rows.
ORDER BY inside OVER() determines the sequence in which rows are processed, enabling running totals and LAG/LEAD access.
The frame clause (ROWS BETWEEN or RANGE BETWEEN) defines exactly which rows within the ordered partition are included in the calculation.

An important execution-order fact interviewers love to ask: window functions are evaluated after WHERE, GROUP BY, and HAVING, but before DISTINCT and the final ORDER BY. This means you cannot use a window function alias in a WHERE clause; you must wrap it in a subquery or CTE.

Ranking Functions: ROW_NUMBER, RANK, and DENSE_RANK

Ranking questions are the most common window function pattern across every prep platform. The three functions differ only in how they handle ties:

Function	Tie Behavior	Gap After Tie
`ROW_NUMBER()`	Assigns a unique number, ties broken arbitrarily	No gap
`RANK()`	Tied rows share the same rank	Gap after tie group
`DENSE_RANK()`	Tied rows share the same rank	No gap

The classic illustration: salaries of 100, 100, 90. ROW_NUMBER gives 1, 2, 3. RANK gives 1, 1, 3. DENSE_RANK gives 1, 1, 2.

Question 1: Top Earner Per Department

This exact problem appears on DataLemur and has been reported by candidates across multiple sources. The variant asking for the single highest earner per department is common; the more frequent interview version asks for the top N.

Schema:

sql

employees (
    employee_id   INT,
    employee_name VARCHAR,
    department    VARCHAR,
    salary        INT
)

Problem: Return the name, department, and salary of every employee who earns the highest salary in their department. Multiple employees can share the top spot if salaries are equal.

Solution:

sql

WITH ranked AS (
    SELECT
        employee_name,
        department,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS dept_rank
    FROM employees
)
SELECT employee_name, department, salary
FROM ranked
WHERE dept_rank = 1;

Expected output (sample data):

employee_name	department	salary
Alice	Engineering	145000
Bob	Marketing	98000
Carol	Marketing	98000

Key insight: DENSE_RANK is the correct choice here rather than RANK or ROW_NUMBER. If two employees in Marketing both earn 98,000, RANK would assign them both rank 1, which works for filtering, but the next rank would jump to 3. DENSE_RANK keeps the sequence contiguous, which matters when the question asks for the "top 3" rather than the "top 1." ROW_NUMBER would arbitrarily break the tie and exclude one of the co-leaders.

What the interviewer is testing: Whether you recognize that WHERE dept_rank = 1 on a subquery/CTE is the standard idiom for "top per group," and whether you choose DENSE_RANK over ROW_NUMBER when ties are possible.

Question 2: Nth Highest Salary (LeetCode Classic)

This is LeetCode problem 177 and appears in nearly every SQL interview guide. It specifically tests your ability to parameterize a ranking.

Schema:

sql

employee (
    id     INT PRIMARY KEY,
    salary INT
)

Problem: Find the Nth highest distinct salary. Return NULL if there are fewer than N distinct salaries.

Solution (MySQL 8+ function wrapper):

sql

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
  RETURN (
    SELECT DISTINCT salary
    FROM (
        SELECT
            salary,
            DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
        FROM employee
    ) ranked
    WHERE rnk = N
    LIMIT 1
  );
END

PostgreSQL equivalent (SQL function):

sql

CREATE OR REPLACE FUNCTION get_nth_highest_salary(n INT) RETURNS INT AS $$
  SELECT DISTINCT salary
  FROM (
      SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
      FROM employee
  ) ranked
  WHERE rnk = n
  LIMIT 1;
$$ LANGUAGE sql;

For a plain query where N = 2 (returns NULL if fewer than 2 distinct salaries):

sql

SELECT (
    SELECT DISTINCT salary
    FROM (
        SELECT
            salary,
            DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
        FROM employee
    ) ranked
    WHERE rnk = 2
    LIMIT 1
) AS SecondHighestSalary;

Key insight: DENSE_RANK on distinct salaries is cleaner than OFFSET N-1 with LIMIT 1 because it correctly handles the case where multiple employees share a salary. The DISTINCT in the outer query ensures you return one salary value, not one row per employee at that salary level.

What the interviewer is testing: Handling the NULL case gracefully, understanding DENSE_RANK vs RANK, and being able to explain why a subquery or CTE is necessary (window function results cannot be filtered directly in WHERE).

Question 3: Top Two Highest-Grossing Products Per Category

This is a DataLemur intermediate-level problem, reported as appearing at a major e-commerce platform.

Schema:

sql

product_spend (
    category         VARCHAR,
    product          VARCHAR,
    user_id          INT,
    spend            DECIMAL,
    transaction_date TIMESTAMP
)

Problem: Find the top two highest-grossing products within each category for the year 2022. Output the category, product name, and total spend, ordered by category then by spend descending.

Solution:

sql

WITH ranked_products AS (
    SELECT
        category,
        product,
        SUM(spend) AS total_spend,
        RANK() OVER (
            PARTITION BY category
            ORDER BY SUM(spend) DESC
        ) AS spending_rank
    FROM product_spend
    WHERE EXTRACT(YEAR FROM transaction_date) = 2022
    GROUP BY category, product
)
SELECT category, product, total_spend
FROM ranked_products
WHERE spending_rank <= 2
ORDER BY category, total_spend DESC;

Key insight: The window function is applied after GROUP BY because you first need to aggregate total spend per product per category, then rank within each category. This is a common two-step pattern: aggregate first, rank second. RANK is acceptable here; DENSE_RANK also works because ties at rank 2 would still pass the <= 2 filter.

What the interviewer is testing: Combining GROUP BY with a window function in a CTE, which requires understanding that window functions operate on the grouped result set, not the raw rows.

LAG and LEAD Functions

LAG(column, offset, default) returns the value of column from offset rows before the current row. LEAD does the same looking forward. Both default the offset to 1 and the fallback value to NULL if not specified.

These are the go-to functions for any "compare this period to the previous period" problem: month-over-month growth, day-over-day change, year-over-year retention.

Question 4: Month-Over-Month Revenue Growth

Reported across StrataScratch and community discussions as a standard analytics interview question at data-heavy technology companies.

Schema:

sql

monthly_revenue (
    month        DATE,   -- first day of each month
    revenue      DECIMAL
)

Problem: For each month, calculate the revenue and the percentage change from the previous month. Return month, current revenue, prior month revenue, and the percentage change rounded to two decimal places. Months with no prior period should show NULL for the change.

Solution:

sql

WITH monthly_lag AS (
    SELECT
        month,
        revenue AS current_revenue,
        LAG(revenue, 1) OVER (ORDER BY month) AS prev_revenue
    FROM monthly_revenue
)
SELECT
    month,
    current_revenue,
    prev_revenue,
    ROUND(
        (current_revenue - prev_revenue) / prev_revenue * 100,
        2
    ) AS pct_change
FROM monthly_lag;

Expected output (sample data):

month	current_revenue	prev_revenue	pct_change
2022-01-01	120000	NULL	NULL
2022-02-01	135000	120000	12.50
2022-03-01	128000	135000	-5.19

Key insight: The CTE step separates "getting the prior value" from "computing the percentage." This makes the query readable and avoids nesting a window function inside arithmetic. Division by prev_revenue will produce a divide-by-zero error or NULL if any month has zero revenue. This is a good edge case to raise proactively.

What the interviewer is testing: Basic LAG syntax, CTE organization, handling NULL in the first row, and whether you mention edge cases without being prompted.

Question 5: Detecting Duplicate Payments Within a Time Window

This is a DataLemur advanced-level problem, attributed to a major payments and financial infrastructure company.

Schema:

sql

transactions (
    transaction_id        INT,
    merchant_id           INT,
    credit_card_id        INT,
    amount                INT,
    transaction_timestamp TIMESTAMP
)

Problem: A payment is considered a duplicate if the same merchant receives two charges to the same credit card for the same amount within 10 minutes. Count how many such duplicate payments exist.

Solution:

sql

WITH payment_gaps AS (
    SELECT
        merchant_id,
        EXTRACT(EPOCH FROM
            transaction_timestamp
            - LAG(transaction_timestamp) OVER (
                PARTITION BY merchant_id, credit_card_id, amount
                ORDER BY transaction_timestamp
            )
        ) / 60 AS minutes_since_last
    FROM transactions
)
SELECT COUNT(*) AS duplicate_count
FROM payment_gaps
WHERE minutes_since_last <= 10;

Key insight: The PARTITION BY clause here is doing heavy lifting: it partitions by three columns simultaneously: merchant_id, credit_card_id, and amount. This means LAG only looks back at the previous transaction that shares all three attributes. When that gap is 10 minutes or fewer, the current transaction is a duplicate of the previous one.

What the interviewer is testing: Multi-column PARTITION BY, timestamp arithmetic, and the ability to translate a business rule ("same merchant, card, amount, within 10 minutes") directly into partition logic.

Question 6: Year-Over-Year Host Growth Rate

This is a StrataScratch hard-level problem modeled on growth analysis at a major home-sharing platform.

Schema:

sql

host_registrations (
    host_id    INT,
    host_since DATE
)

Problem: Calculate the annual growth rate in registered hosts. Show the year, current year count, prior year count, and growth rate percentage rounded to the nearest whole number. Sort ascending by year.

Solution:

sql

WITH yearly_counts AS (
    SELECT
        EXTRACT(YEAR FROM host_since)::INT AS year,
        COUNT(host_id) AS current_year_hosts
    FROM host_registrations
    WHERE host_since IS NOT NULL
    GROUP BY EXTRACT(YEAR FROM host_since)
),
with_lag AS (
    SELECT
        year,
        current_year_hosts,
        LAG(current_year_hosts, 1) OVER (ORDER BY year) AS prev_year_hosts
    FROM yearly_counts
)
SELECT
    year,
    current_year_hosts,
    prev_year_hosts,
    ROUND(
        (current_year_hosts - prev_year_hosts)::FLOAT
        / prev_year_hosts * 100
    ) AS growth_rate_pct
FROM with_lag
ORDER BY year;

Key insight: You cannot apply LAG directly to a GROUP BY result in a single query step. The pattern is always: aggregate in one CTE, apply the window function in the next CTE, then compute derived metrics in the outer query. This three-layer structure appears in a large share of real interview problems.

What the interviewer is testing: Multi-CTE construction, explicit type casting for float division, and whether you handle NULL for the first year's growth rate cleanly.

Running Totals and Moving Averages

Running totals and moving averages test two things: understanding the ORDER BY behavior inside OVER() and mastering the frame specification (ROWS BETWEEN).

A critical behavioral detail: when you write SUM(amount) OVER (ORDER BY date) without a frame clause, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This produces a running total. But if multiple rows share the same date, all those rows receive the same cumulative sum (the total through that date). Use ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to force row-by-row accumulation regardless of ties.

Question 7: Cumulative Revenue by Customer

A standard running total pattern, reported at multiple e-commerce and marketplace companies.

Schema:

sql

orders (
    order_id    INT,
    customer_id INT,
    order_date  DATE,
    amount      DECIMAL
)

Problem: For each order, show the customer ID, order date, order amount, and the customer's running total revenue up through and including that order. Sort by customer ID and order date.

Solution:

sql

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM orders
ORDER BY customer_id, order_date;

Expected output (sample data):

customer_id	order_date	amount	running_total
101	2022-01-05	250	250
101	2022-02-12	180	430
101	2022-03-20	310	740
102	2022-01-15	500	500
102	2022-03-08	200	700

Key insight: PARTITION BY customer_id resets the running total for each customer. Without it, you get a global running total across all customers. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame clause is technically the default when ORDER BY is present, but writing it explicitly signals to the interviewer that you understand frame semantics.

What the interviewer is testing: Whether you understand that PARTITION BY in running totals resets per-group accumulation, and whether you can explain what happens to the output if the frame clause is omitted versus included.

Question 8: 7-Day Rolling Average of Daily Active Users

A moving average question, reported commonly at social media and consumer internet companies. This pattern is standard in product analytics roles.

Schema:

sql

daily_activity (
    activity_date DATE,
    active_users  INT
)

Problem: For each date, calculate the 7-day rolling average of daily active users (including the current day and the 6 preceding days). Round to two decimal places. Return dates in ascending order.

Solution:

sql

SELECT
    activity_date,
    active_users,
    ROUND(
        AVG(active_users) OVER (
            ORDER BY activity_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ),
        2
    ) AS rolling_7day_avg
FROM daily_activity
ORDER BY activity_date;

Expected output (sample data):

activity_date	active_users	rolling_7day_avg
2022-01-01	1200	1200.00
2022-01-02	1350	1275.00
2022-01-03	980	1176.67
2022-01-07	1100	1121.43
2022-01-08	1420	1178.57

Key insight: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes exactly 7 rows: the current row plus the 6 before it. During the first 6 days of the dataset, the window will include fewer than 7 rows; this is called a "warm-up period" and the average will be calculated over fewer rows. If the problem requires strict 7-day windows only, wrap in a WHERE to exclude the first 6 dates.

The difference between ROWS and RANGE matters here: ROWS counts physical rows; RANGE counts rows with values within a range. For time series with potential date gaps, ROWS is almost always the right choice.

What the interviewer is testing: Frame specification syntax, the ROWS vs RANGE distinction, and awareness of the warm-up period issue.

Question 9: Best-Selling Product Per Month

A StrataScratch intermediate problem combining aggregation with ranking, reported at a major online retailer.

Schema:

sql

online_retail (
    invoice_no    VARCHAR,
    stock_code    VARCHAR,
    description   VARCHAR,
    quantity      INT,
    invoice_date  TIMESTAMP,
    unit_price    DECIMAL,
    customer_id   INT,
    country       VARCHAR
)

Problem: For each calendar month, identify the single best-selling product by total revenue (quantity × unit price). Return the month number, product description, and total revenue.

Solution:

sql

SELECT month, description, total_revenue
FROM (
    SELECT
        DATE_PART('month', invoice_date) AS month,
        description,
        SUM(quantity * unit_price) AS total_revenue,
        RANK() OVER (
            PARTITION BY DATE_PART('month', invoice_date)
            ORDER BY SUM(quantity * unit_price) DESC
        ) AS revenue_rank
    FROM online_retail
    GROUP BY month, description
) monthly_ranked
WHERE revenue_rank = 1
ORDER BY month;

Key insight: The window function operates on the grouped rows: GROUP BY month, description runs first, producing one row per product per month. Then RANK() ranks those grouped rows within each month partition. This is a crucial distinction: RANK() here is ranking aggregated totals, not individual transactions.

What the interviewer is testing: Correct ordering of GROUP BY vs window function evaluation, and the subquery pattern for filtering on a window function result.

NTILE and Percentile Functions

NTILE(n) divides the ordered rows within a partition into n roughly equal buckets and assigns each row a bucket number from 1 to n. This is the standard approach for quartile analysis, decile segmentation, and identifying top or bottom percentiles.

PERCENT_RANK() returns a value between 0 and 1 representing the row's relative position. CUME_DIST() returns the cumulative distribution. Both appear less frequently in interviews than NTILE but come up in senior data analyst and data science rounds.

Question 10: Customer Spending Quartile Analysis

This pattern appears in multiple prep platforms and is common at consumer-facing companies doing customer segmentation work.

Schema:

sql

customer_orders (
    customer_id    INT,
    total_spending DECIMAL
)

Problem: Segment all customers into four quartiles based on their total spending. Label each customer with their quartile number (1 = highest spending, 4 = lowest). Return customer ID, total spending, and quartile. Order by spending descending.

Solution:

sql

SELECT
    customer_id,
    total_spending,
    NTILE(4) OVER (ORDER BY total_spending DESC) AS spending_quartile
FROM customer_orders
ORDER BY total_spending DESC;

Expected output (sample data):

customer_id	total_spending	spending_quartile
1041	9200	1
1089	8750	1
1003	4100	2
1052	850	4

Key insight: NTILE(4) with ORDER BY total_spending DESC assigns quartile 1 to the highest spenders. If you reverse the order to ASC, quartile 1 goes to the lowest spenders. Always clarify with the interviewer which convention they expect. This is a good habit that demonstrates practical experience.

When the number of rows is not evenly divisible by 4, NTILE distributes the remainder rows to the earlier buckets. So with 13 customers, quartile 1 gets 4 customers, and quartiles 2, 3, and 4 get 3 each. The general rule: with N rows and NTILE(k), the first (N mod k) buckets get ceil(N/k) rows, and the remaining buckets get floor(N/k).

What the interviewer is testing: Basic NTILE syntax, ordering direction, and whether you understand the behavior with unequal bucket sizes.

Question 11: Identifying Top 5% of Fraud Claims by State

A StrataScratch problem combining NTILE with PARTITION BY for per-group percentile filtering.

Schema:

sql

fraud_claims (
    claim_id    INT,
    state       VARCHAR,
    fraud_score DECIMAL
)

Problem: For each state, identify the claims that fall in the top 5% of fraud scores within that state. Return claim ID, state, and fraud score.

Solution:

sql

WITH percentile_ranked AS (
    SELECT
        claim_id,
        state,
        fraud_score,
        NTILE(100) OVER (
            PARTITION BY state
            ORDER BY fraud_score DESC
        ) AS percentile_bucket
    FROM fraud_claims
)
SELECT claim_id, state, fraud_score
FROM percentile_ranked
WHERE percentile_bucket <= 5
ORDER BY state, fraud_score DESC;

Key insight: NTILE(100) creates 100 buckets, so WHERE percentile_bucket <= 5 captures the top 5%. The PARTITION BY state ensures percentile ranking resets for each state, so a California claim at fraud score 0.7 is evaluated against other California claims, not nationally. Without PARTITION BY, you would get a global top 5%, which would be dominated by high-fraud states.

What the interviewer is testing: The relationship between NTILE(100) and percentiles, the purpose of PARTITION BY in preventing cross-group contamination, and whether you can explain the edge case where a state has very few claims (small N per partition degrades NTILE accuracy).

Complex PARTITION BY Patterns

The most challenging window function questions in senior interviews combine multiple PARTITION BY dimensions, use window functions inside CTEs that feed other CTEs, or require you to recognize that a window function is the right tool when a self-join or correlated subquery would also work.

Question 12: Comparing Individual Ratios to Group Averages

This pattern is from StrataScratch, modeled on a ride-sharing analytics use case.

Schema:

sql

ride_requests (
    request_id      INT,
    request_date    DATE,
    distance_km     DECIMAL,
    fare_amount     DECIMAL
)

Problem: For each ride request, return the request ID, request date, individual fare-per-km ratio, and the monthly average fare-per-km ratio. Show whether the individual ride is above or below the monthly average. Sort by request date.

Solution:

sql

WITH ride_metrics AS (
    SELECT
        request_id,
        request_date,
        fare_amount / NULLIF(distance_km, 0) AS fare_per_km,
        AVG(fare_amount / NULLIF(distance_km, 0)) OVER (
            PARTITION BY DATE_TRUNC('month', request_date)
        ) AS monthly_avg_fare_per_km
    FROM ride_requests
)
SELECT
    request_id,
    request_date,
    ROUND(fare_per_km, 3) AS fare_per_km,
    ROUND(monthly_avg_fare_per_km, 3) AS monthly_avg,
    CASE
        WHEN fare_per_km > monthly_avg_fare_per_km THEN 'above average'
        WHEN fare_per_km < monthly_avg_fare_per_km THEN 'below average'
        ELSE 'at average'
    END AS performance
FROM ride_metrics
ORDER BY request_date;

Key insight: The PARTITION BY DATE_TRUNC('month', request_date) groups all rides in the same calendar month into a single window, so the AVG returns the monthly average across all rides in that month, including the current row itself. This is a fundamental property of window functions: unlike LAG or running totals, aggregate window functions like AVG and SUM default to including all rows in the partition when no ORDER BY is present. The NULLIF(distance_km, 0) guard prevents division by zero.

What the interviewer is testing: Using DATE_TRUNC for temporal partitioning, NULLIF for safe division, CASE for conditional labeling, and the ability to explain the inclusion behavior of aggregate window functions without ordering.

Question 13: Ranking Top Salaries by Department (Filtered to Top 3)

A StrataScratch intermediate problem, commonly reported at technology and financial services companies. This tests whether candidates recognize the top-N-per-group idiom.

Schema:

sql

employee_salaries (
    employee_id   INT,
    department    VARCHAR,
    salary        INT
)

Problem: Return all employees who earn one of the top 3 distinct salary amounts within their department. Multiple employees can share a salary. Order by department, then salary descending.

Solution:

sql

WITH dept_ranked AS (
    SELECT
        employee_id,
        department,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS salary_rank
    FROM employee_salaries
)
SELECT employee_id, department, salary
FROM dept_ranked
WHERE salary_rank <= 3
ORDER BY department, salary DESC;

Expected output (sample data):

employee_id	department	salary
201	Engineering	155000
205	Engineering	145000
209	Engineering	145000
212	Engineering	138000
301	Marketing	105000

Notice that Engineering has 4 rows in the output: two employees share the second-highest salary (145,000), so DENSE_RANK assigns them both rank 2, and the employee at rank 3 (138,000) is also included. RANK would NOT produce the same result; it would assign rank 4 to the 138,000 salary (skipping rank 3 due to the tie at rank 2), excluding that employee from the <= 3 filter. ROW_NUMBER would incorrectly exclude one of the tied employees at rank 2.

What the interviewer is testing: Precisely why DENSE_RANK is required over RANK and ROW_NUMBER for "top N distinct values per group" problems, and whether the candidate can trace through the tie-handling logic on sample data.

Question 14: Second Ride Average Delay (Multi-Step Window Analysis)

This is a DataLemur intermediate problem, reported at a major ride-sharing company, combining ROW_NUMBER and LAG in a two-CTE pattern.

Schema:

sql

users (
    user_id           INT,
    registration_date DATE
)

rides (
    ride_id   INT,
    user_id   INT,
    ride_date DATE
)

Problem: Among users who took their first ride on the same day they registered ("in-the-moment" users), calculate the average number of days between registration and their second ride.

Solution:

sql

WITH moment_users AS (
    SELECT DISTINCT u.user_id
    FROM users u
    INNER JOIN rides r
        ON u.user_id = r.user_id
        AND u.registration_date = r.ride_date
),
numbered_rides AS (
    SELECT
        r.user_id,
        r.ride_date,
        ROW_NUMBER() OVER (
            PARTITION BY r.user_id
            ORDER BY r.ride_date
        ) AS ride_number,
        LAG(r.ride_date) OVER (
            PARTITION BY r.user_id
            ORDER BY r.ride_date
        ) AS prev_ride_date
    FROM moment_users mu
    INNER JOIN rides r ON mu.user_id = r.user_id
)
SELECT ROUND(AVG(ride_date - prev_ride_date), 2) AS avg_days_to_second_ride
FROM numbered_rides
WHERE ride_number = 2;

Key insight: Two window functions run over the same partition in a single CTE: ROW_NUMBER identifies which ride is the second one, and LAG captures the date of the first ride. Filtering to ride_number = 2 in the outer query then gives you only the second-ride rows, where prev_ride_date holds the first ride date. In PostgreSQL, DATE - DATE returns an integer (number of days), so the subtraction and AVG work directly. In MySQL, use DATEDIFF(ride_date, prev_ride_date) instead.

What the interviewer is testing: Multi-window-function CTEs, the pattern of filtering on a ROW_NUMBER result to isolate a specific event in a sequence, and the awareness that both functions operate on the same ordered partition in a single scan.

A Framework for Solving Window Function Questions Under Pressure

Candidates who struggle with window function questions in interviews typically freeze at the blank canvas. They know the syntax but do not have a systematic approach for translating a business question into the right function and partition structure.

A four-step framework that works reliably:

Step 1: Identify what "group" the question implies. Every window function question has an implicit grouping. "Top product per category": the group is category. "Month-over-month growth": the group is the entity whose growth you are measuring (often implicit: the whole company, or per customer). This group becomes your PARTITION BY.

Step 2: Identify whether ordering matters. If the question involves sequences, time, rankings, or "previous/next" comparisons, you need ORDER BY inside OVER. If you just need a group aggregate on every row (like department average salary), you may not need ORDER BY at all.

Step 3: Choose the right function family. Ranking question → ROW_NUMBER, RANK, or DENSE_RANK. Prior or next period value → LAG or LEAD. Running total or rolling average → SUM or AVG with a frame. Segmentation or percentiles → NTILE, PERCENT_RANK, or CUME_DIST.

Step 4: Decide if you need a CTE wrapper. You cannot filter on a window function result in the same query level. If you need WHERE rank <= 3, you need a CTE or subquery. Build this wrapper instinctively.

Interviewers at data-intensive companies are watching for exactly this kind of structured reasoning, not just whether you arrive at the right SQL.

Conclusion

Window functions are difficult to learn but relatively systematic once you internalize three things: the partition defines the group, the order defines the sequence, and the frame defines the rows. Most interview questions reduce to one of the five patterns covered here: ranking, LAG/LEAD, running totals, moving averages, and NTILE segmentation.

The detail that separates strong candidates is precision on tie behavior. Knowing that DENSE_RANK produces no gaps while RANK does, and that ROW_NUMBER is appropriate only when you want arbitrary tie-breaking, is the kind of nuance interviewers probe with follow-up questions. Practice explaining these differences out loud, not just writing the SQL.

The second separator is frame specification. Most candidates know SUM() OVER (ORDER BY date) produces a running total, but candidates who can explain the ROWS BETWEEN vs RANGE BETWEEN distinction, and the warm-up period behavior in rolling averages, demonstrate real production experience, not just prep familiarity.

For deeper preparation, work through the interconnected topic of CTEs vs subqueries, which covers the patterns that frequently wrap window functions in multi-step interview problems. The combination of CTEs and window functions appears in roughly 40% of hard-level SQL interview questions on DataLemur and StrataScratch.

Career Q&A

How much time should I spend on window functions vs other SQL topics before an interview?

At most data science and senior analyst roles, window functions represent a disproportionate share of the medium and hard SQL questions. If you have two weeks, allocate three to four days exclusively to window functions after you have solid foundations in JOINs and GROUP BY. Practice on DataLemur and StrataScratch rather than just reading; writing the queries from a blank editor is the only way to build fluency under interview conditions.

Do I need to memorize the exact frame clause syntax?

Yes, at a functional level. Interviewers will not penalize you for a minor syntax slip that you catch yourself, but you should be able to write ROWS BETWEEN 6 PRECEDING AND CURRENT ROW without looking it up. More important is understanding what it does. Interviewers frequently ask you to explain the output of a given frame, which requires conceptual understanding, not just memorization.

Should I explain my reasoning while writing the SQL or wait until I am done?

Explain as you go. Most interviewers prefer a running commentary: "I am partitioning by department because we want rankings to reset per department, not globally." This demonstrates analytical thinking and gives the interviewer opportunities to course-correct you early if you are heading the wrong direction, rather than watching silently while you write a complete wrong answer.

What if I forget whether to use RANK or DENSE_RANK in the moment?

Ask yourself: "If there is a tie at position 2, do I want to see a position 3 or a second position 2 in the output?" If the business question is "give me every employee in the top 3 salary tiers," you need the positions to be contiguous, so DENSE_RANK. If the question is "give me employees ranked above 80% of their peers," the gap-skipping behavior of RANK or PERCENT_RANK is more appropriate. When in doubt, name the ambiguity to the interviewer and ask which behavior they want.

How do companies actually use window functions in production analytics work?

The patterns you practice in interviews map directly to production work. LAG/LEAD are standard in product analytics for cohort retention and engagement curves. Running totals appear in finance reporting and revenue dashboards. NTILE and DENSE_RANK are used in customer segmentation, AB test analysis, and fraud scoring. The interview questions are not contrived. They test genuine patterns that analysts write every week.

Is it worth learning BigQuery or Snowflake dialect differences for window functions?

Most window function syntax is standard SQL (SQL:2003+), so the core patterns transfer across engines. The main dialect differences are in timestamp arithmetic (BigQuery uses DATE_DIFF, PostgreSQL uses subtraction), date truncation (DATE_TRUNC vs TRUNC), and a few frame clause subtleties. Learn the standard SQL syntax deeply first, then spend a couple of hours with the documentation of whatever dialect the target company uses.

What is the most common mistake candidates make on window function questions?

Trying to filter on a window function result in the same query level. Writing WHERE RANK() OVER (...) <= 3 directly in the WHERE clause is a syntax error in all major SQL engines. The fix is always a CTE or subquery. This error appears in interviews far more than any syntax mistake, and it is entirely avoidable with practice.