According to a 2025 StrataScratch analysis, over 70% of data science SQL interviews at large technology companies involve JOINs, CTEs, or window functions applied to transactional data. If you are interviewing at a company with an e-commerce, marketplace, or logistics business model, the SQL questions you face will test whether you can think in orders, sessions, funnels, and cohorts. This guide walks through 15 real-world SQL problems drawn from the patterns that show up most often.
The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience and Blind. These represent patterns that candidates have reported across interviews at large technology companies with e-commerce and marketplace operations. We do not claim these are proprietary questions from any specific company.
Customer and Order Analytics
These questions test your ability to aggregate transactional data, handle ties in rankings, and compute metrics that drive business decisions around customer retention.
Question 1: Top N Customers by Total Revenue
A major online retailer asks you to find the top five customers by total spending in 2024. If two customers are tied, both should appear.
Sample Schema
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
| order_id | customer_id | order_date | total_amount |
|---|---|---|---|
| 1001 | 201 | 2024-03-15 | 249.99 |
| 1002 | 202 | 2024-04-01 | 89.50 |
| 1003 | 201 | 2024-06-22 | 310.00 |
| 1004 | 203 | 2024-07-10 | 559.99 |
| 1005 | 202 | 2024-09-05 | 470.50 |
| 1006 | 204 | 2024-11-28 | 559.99 |
The approach is to aggregate first, rank second, then filter. Using DENSE_RANK() instead of RANK() preserves consecutive numbering even when ties exist, which is what interviewers expect when they say "top 5."
WITH customer_spend AS (
SELECT
customer_id,
SUM(total_amount) AS total_revenue
FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01'
GROUP BY customer_id
),
ranked AS (
SELECT
customer_id,
total_revenue,
DENSE_RANK() OVER (ORDER BY total_revenue DESC) AS rnk
FROM customer_spend
)
SELECT customer_id, total_revenue, rnk
FROM ranked
WHERE rnk <= 5
ORDER BY rnk;
Expected Output
| customer_id | total_revenue | rnk |
|---|---|---|
| 202 | 560.00 | 1 |
| 201 | 559.99 | 2 |
| 203 | 559.99 | 2 |
| 204 | 559.99 | 2 |
Key Insight: DENSE_RANK() returns consecutive ranks (1, 2, 3) while RANK() skips ranks after ties (1, 1, 3). When the question says "top 5," clarify which behavior is expected before writing code.
What the Interviewer Is Testing: Window function selection, date filtering with half-open intervals, and whether you ask about tie-breaking behavior.
Question 2: Repeat Purchase Rate
A large marketplace company wants to know the percentage of customers who made more than one purchase in the last 12 months.
Sample Schema
CREATE TABLE transactions (
transaction_id INT PRIMARY KEY,
customer_id INT,
purchase_date DATE,
amount DECIMAL(10,2)
);
The key is separating the count step from the ratio step. Count distinct purchase dates per customer, classify each as single or repeat, then compute the ratio.
WITH customer_counts AS (
SELECT
customer_id,
COUNT(DISTINCT purchase_date) AS purchase_days
FROM transactions
WHERE purchase_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY customer_id
)
SELECT
ROUND(
100.0 * COUNT(CASE WHEN purchase_days > 1 THEN 1 END)
/ COUNT(*),
2
) AS repeat_purchase_rate_pct
FROM customer_counts;
Expected Output
| repeat_purchase_rate_pct |
|---|
| 38.24 |
Key Insight: Using COUNT(DISTINCT purchase_date) instead of COUNT(*) avoids inflating the repeat count when a customer places multiple orders on the same day. Interviewers notice this nuance.
What the Interviewer Is Testing: Conditional aggregation with CASE WHEN inside an aggregate, and whether you de-duplicate same-day orders.
Question 3: Customer Order History with Running Totals
A cloud services provider with e-commerce operations needs a report showing each customer's orders alongside a running cumulative spend, ordered chronologically.
Sample Schema
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
order_total DECIMAL(10,2)
);
SELECT
customer_id,
order_id,
order_date,
order_total,
SUM(order_total) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_spend
FROM orders
ORDER BY customer_id, order_date;
Expected Output
| customer_id | order_id | order_date | order_total | cumulative_spend |
|---|---|---|---|---|
| 201 | 1001 | 2024-03-15 | 249.99 | 249.99 |
| 201 | 1003 | 2024-06-22 | 310.00 | 559.99 |
| 202 | 1002 | 2024-04-01 | 89.50 | 89.50 |
| 202 | 1005 | 2024-09-05 | 470.50 | 560.00 |
Key Insight: Always specify the window frame explicitly (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). The default frame changes depending on whether ORDER BY is present, and relying on implicit behavior is a common source of bugs.
What the Interviewer Is Testing: Window frame syntax, PARTITION BY usage, and awareness of frame defaults.
Product and Inventory Analysis
Product-level queries test your ability to work with normalized schemas (orders, order items, products) and to handle category-level aggregations.
Question 4: Highest-Grossing Products by Category
A major e-commerce platform asks you to find the top two products by total revenue within each product category for the year 2024.
Sample Schema
CREATE TABLE product_spend (
category VARCHAR(50),
product VARCHAR(100),
user_id INT,
spend DECIMAL(10,2),
transaction_date DATE
);
| category | product | user_id | spend | transaction_date |
|---|---|---|---|---|
| Electronics | Headphones | 101 | 79.99 | 2024-02-14 |
| Electronics | Tablet | 102 | 349.00 | 2024-03-20 |
| Electronics | Headphones | 103 | 79.99 | 2024-05-11 |
| Appliances | Blender | 101 | 59.99 | 2024-01-08 |
| Appliances | Air Fryer | 104 | 129.99 | 2024-06-30 |
| Appliances | Blender | 105 | 59.99 | 2024-08-15 |
Aggregate spending by category and product, rank within each category, then filter.
WITH category_product_totals AS (
SELECT
category,
product,
SUM(spend) AS total_spend
FROM product_spend
WHERE transaction_date >= '2024-01-01'
AND transaction_date < '2025-01-01'
GROUP BY category, product
),
ranked AS (
SELECT
category,
product,
total_spend,
RANK() OVER (
PARTITION BY category
ORDER BY total_spend DESC
) AS product_rank
FROM category_product_totals
)
SELECT category, product, total_spend
FROM ranked
WHERE product_rank <= 2
ORDER BY category, product_rank;
Expected Output
| category | product | total_spend |
|---|---|---|
| Appliances | Air Fryer | 129.99 |
| Appliances | Blender | 119.98 |
| Electronics | Tablet | 349.00 |
| Electronics | Headphones | 159.98 |
Key Insight: This is one of the most reported SQL questions from e-commerce interviews. The pattern of aggregate, rank with a window function, then filter in an outer query appears repeatedly. Memorize this three-step CTE structure.
What the Interviewer Is Testing: PARTITION BY within RANK(), date filtering, and whether you handle the outer filter correctly (you cannot put WHERE product_rank <= 2 inside the CTE that defines the rank).
Question 5: Inventory Turnover Rate
A large online retailer wants to calculate inventory turnover for each product over the last 90 days. Inventory turnover measures how many times the available stock was sold through.
Sample Schema
CREATE TABLE daily_inventory (
product_id INT,
snapshot_date DATE,
units_on_hand INT
);
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
order_date DATE
);
WITH trailing_sales AS (
SELECT
product_id,
SUM(quantity) AS units_sold_90d
FROM order_items
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY product_id
),
avg_inventory AS (
SELECT
product_id,
AVG(units_on_hand) AS avg_units_on_hand
FROM daily_inventory
WHERE snapshot_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY product_id
)
SELECT
s.product_id,
s.units_sold_90d,
i.avg_units_on_hand,
ROUND(s.units_sold_90d::DECIMAL / NULLIF(i.avg_units_on_hand, 0), 2)
AS turnover_ratio
FROM trailing_sales s
JOIN avg_inventory i ON s.product_id = i.product_id
ORDER BY turnover_ratio DESC;
Expected Output
| product_id | units_sold_90d | avg_units_on_hand | turnover_ratio |
|---|---|---|---|
| 5012 | 840 | 120 | 7.00 |
| 5034 | 620 | 200 | 3.10 |
| 5001 | 150 | 300 | 0.50 |
Key Insight: NULLIF(i.avg_units_on_hand, 0) prevents division-by-zero errors for products with no inventory records. This defensive pattern shows production-level thinking.
What the Interviewer Is Testing: Multi-table joins across different time grains, NULLIF for safe division, and your understanding of what inventory turnover means as a business metric.
Question 6: Frequently Bought Together (Product Pair Analysis)
A marketplace company asks you to find the top 10 product pairs that are most often purchased in the same order.
Sample Schema
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT
);
The trick is a self-join on order_id with a condition that prevents duplicate pairs and self-pairs.
SELECT
a.product_id AS product_a,
b.product_id AS product_b,
COUNT(*) AS times_bought_together
FROM order_items a
JOIN order_items b
ON a.order_id = b.order_id
AND a.product_id < b.product_id
GROUP BY a.product_id, b.product_id
ORDER BY times_bought_together DESC
LIMIT 10;
Expected Output
| product_a | product_b | times_bought_together |
|---|---|---|
| 1001 | 1045 | 312 |
| 1001 | 1023 | 287 |
| 1023 | 1045 | 241 |
Key Insight: The a.product_id < b.product_id condition is critical. Without it, the pair (1001, 1045) and (1045, 1001) would both appear, doubling every count. Interviewers specifically check for this.
Common Mistake: Using != instead of < in the join condition. This eliminates self-joins but still produces duplicate mirrored pairs.
Funnel and Conversion Queries
Funnel analysis is where e-commerce SQL interviews get distinctive. These questions test conditional aggregation and your ability to reason about user sessions.
Question 7: Checkout Funnel Conversion Rates
A major e-commerce platform tracks user events through a four-step funnel: page view, add to cart, begin checkout, purchase. Calculate the conversion rate at each step.
Sample Schema
CREATE TABLE user_events (
event_id INT PRIMARY KEY,
user_id INT,
event_type VARCHAR(20),
event_date DATE
);
-- event_type values: 'page_view', 'add_to_cart', 'begin_checkout', 'purchase'
WITH funnel_counts AS (
SELECT
COUNT(DISTINCT CASE WHEN event_type = 'page_view'
THEN user_id END) AS page_views,
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart'
THEN user_id END) AS add_to_carts,
COUNT(DISTINCT CASE WHEN event_type = 'begin_checkout'
THEN user_id END) AS checkouts,
COUNT(DISTINCT CASE WHEN event_type = 'purchase'
THEN user_id END) AS purchases
FROM user_events
)
SELECT
page_views,
add_to_carts,
ROUND(100.0 * add_to_carts / NULLIF(page_views, 0), 2)
AS view_to_cart_pct,
checkouts,
ROUND(100.0 * checkouts / NULLIF(add_to_carts, 0), 2)
AS cart_to_checkout_pct,
purchases,
ROUND(100.0 * purchases / NULLIF(checkouts, 0), 2)
AS checkout_to_purchase_pct
FROM funnel_counts;
Expected Output
| page_views | add_to_carts | view_to_cart_pct | checkouts | cart_to_checkout_pct | purchases | checkout_to_purchase_pct |
|---|---|---|---|---|---|---|
| 10000 | 3500 | 35.00 | 2100 | 60.00 | 1680 | 80.00 |
Key Insight: Using COUNT(DISTINCT ... user_id) inside CASE WHEN is the standard pattern for conditional aggregation in funnel queries. This counts unique users at each stage, not total events.
What the Interviewer Is Testing: Conditional aggregation, CASE WHEN inside aggregate functions, and whether you compute step-to-step rates rather than only overall conversion.
Question 8: Cart Abandonment by Day of Week
A large marketplace company needs to know which days of the week have the highest cart abandonment rate. A cart is abandoned when a user has an add_to_cart event but no subsequent purchase event on the same day.
Sample Schema
CREATE TABLE user_events (
event_id INT PRIMARY KEY,
user_id INT,
event_type VARCHAR(20),
event_date DATE
);
WITH daily_carts AS (
SELECT DISTINCT user_id, event_date
FROM user_events
WHERE event_type = 'add_to_cart'
),
daily_purchases AS (
SELECT DISTINCT user_id, event_date
FROM user_events
WHERE event_type = 'purchase'
),
cart_status AS (
SELECT
c.event_date,
c.user_id,
CASE WHEN p.user_id IS NULL THEN 1 ELSE 0 END AS abandoned
FROM daily_carts c
LEFT JOIN daily_purchases p
ON c.user_id = p.user_id
AND c.event_date = p.event_date
)
SELECT
TO_CHAR(event_date, 'Day') AS day_of_week,
EXTRACT(DOW FROM event_date) AS day_num,
COUNT(*) AS total_carts,
SUM(abandoned) AS abandoned_carts,
ROUND(100.0 * SUM(abandoned) / COUNT(*), 2) AS abandonment_rate_pct
FROM cart_status
GROUP BY TO_CHAR(event_date, 'Day'), EXTRACT(DOW FROM event_date)
ORDER BY day_num;
Expected Output
| day_of_week | day_num | total_carts | abandoned_carts | abandonment_rate_pct |
|---|---|---|---|---|
| Sunday | 0 | 1420 | 987 | 69.51 |
| Monday | 1 | 1850 | 1147 | 62.00 |
| Tuesday | 2 | 1780 | 1103 | 61.97 |
Key Insight: The LEFT JOIN between cart events and purchase events is the core of abandonment logic. Users who added to cart but never purchased will have NULL in the purchase side of the join.
What the Interviewer Is Testing: LEFT JOIN for set difference logic, date-level granularity decisions, and whether you can translate a business definition ("abandoned cart") into precise SQL.
Question 9: Session-Based Funnel with Time Constraint
A major online retailer defines a valid conversion as a purchase that occurs within 30 minutes of the initial page view in the same session. Calculate the time-constrained conversion rate.
Sample Schema
CREATE TABLE session_events (
event_id INT PRIMARY KEY,
session_id VARCHAR(36),
user_id INT,
event_type VARCHAR(20),
event_ts TIMESTAMP
);
WITH session_starts AS (
SELECT
session_id,
user_id,
MIN(event_ts) AS session_start
FROM session_events
WHERE event_type = 'page_view'
GROUP BY session_id, user_id
),
session_purchases AS (
SELECT
session_id,
MIN(event_ts) AS purchase_ts
FROM session_events
WHERE event_type = 'purchase'
GROUP BY session_id
)
SELECT
COUNT(DISTINCT s.session_id) AS total_sessions,
COUNT(DISTINCT CASE
WHEN p.purchase_ts IS NOT NULL
AND p.purchase_ts <= s.session_start + INTERVAL '30 minutes'
THEN s.session_id
END) AS converted_sessions,
ROUND(100.0 *
COUNT(DISTINCT CASE
WHEN p.purchase_ts IS NOT NULL
AND p.purchase_ts <= s.session_start + INTERVAL '30 minutes'
THEN s.session_id
END)
/ COUNT(DISTINCT s.session_id),
2) AS conversion_rate_pct
FROM session_starts s
LEFT JOIN session_purchases p
ON s.session_id = p.session_id;
Expected Output
| total_sessions | converted_sessions | conversion_rate_pct |
|---|---|---|
| 45000 | 2925 | 6.50 |
Key Insight: Time-constrained funnels are more realistic than simple event-presence funnels. Real purchase attribution almost always involves a time window, whether it is 30 minutes for sessions or 7 days for marketing attribution.
What the Interviewer Is Testing: Timestamp arithmetic, interval logic, and whether you apply the time constraint before computing the ratio rather than after.
Seller and Marketplace Metrics
Two-sided marketplace interviews add a second dimension: seller performance. These questions test your ability to work with data from both sides of the platform.
Question 10: Top Sellers by Fulfillment Speed
A large marketplace company asks you to rank sellers by their average order fulfillment time (days between order placement and shipment) and identify the top 10 fastest.
Sample Schema
CREATE TABLE marketplace_orders (
order_id INT PRIMARY KEY,
seller_id INT,
buyer_id INT,
order_date DATE,
ship_date DATE,
order_total DECIMAL(10,2)
);
SELECT
seller_id,
COUNT(*) AS total_orders,
ROUND(AVG(ship_date - order_date), 2) AS avg_fulfillment_days,
ROUND(
PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY ship_date - order_date
)::NUMERIC, 2
) AS median_fulfillment_days
FROM marketplace_orders
WHERE ship_date IS NOT NULL
GROUP BY seller_id
HAVING COUNT(*) >= 10
ORDER BY avg_fulfillment_days ASC
LIMIT 10;
Expected Output
| seller_id | total_orders | avg_fulfillment_days | median_fulfillment_days |
|---|---|---|---|
| 3021 | 487 | 1.24 | 1.00 |
| 3089 | 312 | 1.41 | 1.00 |
| 3045 | 156 | 1.58 | 2.00 |
Key Insight: The HAVING COUNT(*) >= 10 filter removes sellers with too few orders to produce a meaningful average. In real marketplace analytics, minimum order thresholds prevent low-volume sellers from appearing in top-performer lists due to statistical noise.
What the Interviewer Is Testing: Date arithmetic for fulfillment time, HAVING clause for minimum sample filtering, and the difference between mean and median for skewed distributions.
Question 11: Seller Rating Distribution
A marketplace platform wants to categorize sellers into performance tiers based on their average customer rating: "Excellent" (4.5 or above), "Good" (3.5 to 4.49), "Needs Improvement" (below 3.5).
Sample Schema
CREATE TABLE seller_reviews (
review_id INT PRIMARY KEY,
seller_id INT,
buyer_id INT,
rating DECIMAL(2,1),
review_date DATE
);
WITH seller_avg AS (
SELECT
seller_id,
COUNT(*) AS review_count,
ROUND(AVG(rating), 2) AS avg_rating
FROM seller_reviews
GROUP BY seller_id
HAVING COUNT(*) >= 5
)
SELECT
CASE
WHEN avg_rating >= 4.5 THEN 'Excellent'
WHEN avg_rating >= 3.5 THEN 'Good'
ELSE 'Needs Improvement'
END AS tier,
COUNT(*) AS seller_count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct_of_sellers
FROM seller_avg
GROUP BY
CASE
WHEN avg_rating >= 4.5 THEN 'Excellent'
WHEN avg_rating >= 3.5 THEN 'Good'
ELSE 'Needs Improvement'
END
ORDER BY
CASE tier
WHEN 'Excellent' THEN 1
WHEN 'Good' THEN 2
ELSE 3
END;
Expected Output
| tier | seller_count | pct_of_sellers |
|---|---|---|
| Excellent | 1240 | 41.33 |
| Good | 1380 | 46.00 |
| Needs Improvement | 380 | 12.67 |
Key Insight: SUM(COUNT(*)) OVER () computes the grand total across all groups in a single pass, which is cleaner than using a subquery or CTE to get the denominator for percentage calculations.
What the Interviewer Is Testing: CASE WHEN for tiering, window functions mixed with GROUP BY, and custom ORDER BY logic for non-alphabetical sorting.
Question 12: 7-Day Rolling Revenue per Seller
A marketplace platform tracks daily seller revenue and wants a 7-day rolling average to smooth out day-to-day fluctuations.
Sample Schema
CREATE TABLE daily_seller_revenue (
seller_id INT,
rev_date DATE,
revenue DECIMAL(10,2)
);
SELECT
seller_id,
rev_date,
revenue,
ROUND(
AVG(revenue) OVER (
PARTITION BY seller_id
ORDER BY rev_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2
) AS rolling_7d_avg
FROM daily_seller_revenue
ORDER BY seller_id, rev_date;
Expected Output
| seller_id | rev_date | revenue | rolling_7d_avg |
|---|---|---|---|
| 3021 | 2024-10-01 | 1250.00 | 1250.00 |
| 3021 | 2024-10-02 | 980.00 | 1115.00 |
| 3021 | 2024-10-03 | 1340.00 | 1190.00 |
| 3021 | 2024-10-07 | 1120.00 | 1158.57 |
Key Insight: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives exactly 7 rows (current plus 6 prior). If your data has gaps (missing dates), consider using RANGE instead of ROWS, or fill gaps with a date spine before applying the window.
Common Mistake: Writing ROWS BETWEEN 7 PRECEDING AND CURRENT ROW gives an 8-row window. This off-by-one error is one of the most frequent mistakes in rolling window questions.
Time-Based Business Intelligence
Time-series questions are where e-commerce interviews differentiate strong candidates from average ones. These patterns appear in almost every data science take-home and live coding round.
Question 13: Month-over-Month Revenue Growth
A major e-commerce platform needs a report showing each month's total revenue alongside the percentage change from the prior month.
Sample Schema
CREATE TABLE orders (
order_id INT PRIMARY KEY,
order_date DATE,
total_amount DECIMAL(10,2)
);
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS revenue_month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
revenue_month,
monthly_total,
LAG(monthly_total) OVER (ORDER BY revenue_month) AS prev_month_total,
ROUND(
100.0 * (monthly_total - LAG(monthly_total) OVER (ORDER BY revenue_month))
/ NULLIF(LAG(monthly_total) OVER (ORDER BY revenue_month), 0),
2
) AS mom_growth_pct
FROM monthly_revenue
ORDER BY revenue_month;
Expected Output
| revenue_month | monthly_total | prev_month_total | mom_growth_pct |
|---|---|---|---|
| 2024-01-01 | 1240000.00 | NULL | NULL |
| 2024-02-01 | 1185000.00 | 1240000.00 | -4.44 |
| 2024-03-01 | 1367000.00 | 1185000.00 | 15.36 |
Key Insight: LAG() with no offset parameter defaults to 1, which gives you the immediately preceding row. For year-over-year comparisons, you would use LAG(monthly_total, 12) to reach back 12 months.
What the Interviewer Is Testing: LAG() window function, DATE_TRUNC for period aggregation, and handling the NULL first row gracefully rather than showing 0% or an error.
Question 14: Year-over-Year Product Spend Growth
A large e-commerce platform asks you to calculate the year-over-year growth rate for total spending on each product.
Sample Schema
CREATE TABLE user_transactions (
transaction_id INT PRIMARY KEY,
product_id INT,
spend DECIMAL(10,2),
transaction_date DATE
);
WITH yearly_spend AS (
SELECT
EXTRACT(YEAR FROM transaction_date) AS txn_year,
product_id,
SUM(spend) AS annual_spend
FROM user_transactions
GROUP BY EXTRACT(YEAR FROM transaction_date), product_id
)
SELECT
txn_year,
product_id,
annual_spend AS curr_year_spend,
LAG(annual_spend) OVER (
PARTITION BY product_id ORDER BY txn_year
) AS prev_year_spend,
ROUND(
100.0 * (annual_spend - LAG(annual_spend) OVER (
PARTITION BY product_id ORDER BY txn_year
))
/ NULLIF(LAG(annual_spend) OVER (
PARTITION BY product_id ORDER BY txn_year
), 0),
2
) AS yoy_growth_pct
FROM yearly_spend
ORDER BY product_id, txn_year;
Expected Output
| txn_year | product_id | curr_year_spend | prev_year_spend | yoy_growth_pct |
|---|---|---|---|---|
| 2022 | 1001 | 15200.00 | NULL | NULL |
| 2023 | 1001 | 18400.00 | 15200.00 | 21.05 |
| 2024 | 1001 | 16100.00 | 18400.00 | -12.50 |
Key Insight: This question is one of the most commonly reported from interviews at e-commerce companies, per DataLemur. The pattern is identical to month-over-month except that LAG() is partitioned by product and ordered by year. Once you understand the MoM pattern, YoY is a direct adaptation.
What the Interviewer Is Testing: LAG() with PARTITION BY, EXTRACT(YEAR FROM ...) for year-level aggregation, and safe division with NULLIF.
Question 15: First-Purchase Cohort Retention
A major online retailer asks you to build a cohort table showing what percentage of customers who made their first purchase in a given month returned to purchase in each of the following three months.
Sample Schema
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE
);
WITH first_purchase AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders
GROUP BY customer_id
),
subsequent_orders AS (
SELECT
fp.cohort_month,
fp.customer_id,
DATE_TRUNC('month', o.order_date) AS order_month,
(EXTRACT(YEAR FROM DATE_TRUNC('month', o.order_date))
- EXTRACT(YEAR FROM fp.cohort_month)) * 12
+ (EXTRACT(MONTH FROM DATE_TRUNC('month', o.order_date))
- EXTRACT(MONTH FROM fp.cohort_month)) AS months_since_first
FROM first_purchase fp
JOIN orders o ON fp.customer_id = o.customer_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT customer_id) AS cohort_size
FROM first_purchase
GROUP BY cohort_month
)
SELECT
cs.cohort_month,
cs.cohort_size,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN so.months_since_first = 1
THEN so.customer_id END) / cs.cohort_size, 2) AS month_1_retention,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN so.months_since_first = 2
THEN so.customer_id END) / cs.cohort_size, 2) AS month_2_retention,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN so.months_since_first = 3
THEN so.customer_id END) / cs.cohort_size, 2) AS month_3_retention
FROM cohort_sizes cs
LEFT JOIN subsequent_orders so
ON cs.cohort_month = so.cohort_month
AND so.months_since_first BETWEEN 1 AND 3
GROUP BY cs.cohort_month, cs.cohort_size
ORDER BY cs.cohort_month;
Expected Output
| cohort_month | cohort_size | month_1_retention | month_2_retention | month_3_retention |
|---|---|---|---|---|
| 2024-01-01 | 2340 | 28.42 | 19.15 | 15.60 |
| 2024-02-01 | 2180 | 31.10 | 21.33 | 16.88 |
| 2024-03-01 | 2510 | 26.77 | 18.45 | NULL |
Key Insight: Cohort retention is the single most asked advanced SQL question at e-commerce companies. The structure is always the same: identify each customer's first event, calculate the time delta to subsequent events, then pivot into a retention grid. Practice this pattern until you can write it from memory.
What the Interviewer Is Testing: CTEs for multi-step logic, date arithmetic for month deltas, conditional aggregation with COUNT(DISTINCT CASE WHEN ...), and whether you handle incomplete cohorts (the NULL for month 3 in recent cohorts) correctly.
Conclusion
The 15 questions in this guide cover the patterns that appear most frequently in SQL interviews at companies with e-commerce and marketplace business models. Three themes dominate: window functions (RANK, LAG, SUM OVER), conditional aggregation (CASE WHEN inside COUNT and SUM), and multi-step CTE pipelines that break complex business logic into readable stages.
The questions that separate candidates are not the hardest technically. They are the ones that require translating a business concept into precise SQL. "Cart abandonment" becomes a LEFT JOIN where the purchase side is NULL. "Cohort retention" becomes a self-join with month-delta arithmetic. If you can make that translation quickly, you will handle whatever variation the interviewer creates.
Practice each question by writing the SQL from scratch before looking at the solution. Then modify it: change "top 2 per category" to "top 3 per region," swap month-over-month for quarter-over-quarter, add a filter for minimum order count. These small modifications build the fluency that lets you adapt in a live interview.
For more SQL and data science interview preparation, explore the practice problems and career guides at letsdatascience.com.
Career Q&A
How much of a typical data science interview at an e-commerce company focuses on SQL? At most large technology companies with e-commerce operations, SQL accounts for at least one full interview round and often appears in the take-home assignment as well. StrataScratch reports that 70% of questions at these companies involve JOINs, CTEs, or window functions. Expect SQL to be a gating criterion rather than a bonus skill.
Should I practice on a specific SQL dialect for marketplace interviews? Most interview platforms and live coding environments use PostgreSQL syntax. If the company's job posting mentions Redshift, BigQuery, or Spark SQL, practice with that dialect's date functions and string handling, which differ across platforms. For standard window functions and aggregations, the syntax is nearly identical across all major dialects.
What is the most common SQL mistake that costs candidates an offer?
Writing queries that produce incorrect results due to grain confusion. When an orders table has one row per line item rather than one row per order, a simple COUNT(*) overstates the order count. Always check the grain of each table before writing your GROUP BY. Ask the interviewer explicitly if the schema is ambiguous.
How should I structure my approach when given a SQL question I have not seen before? Start by restating the question in your own words and confirming the expected output with the interviewer. Then identify which tables you need and at what grain. Write the query in stages, using CTEs to build intermediate results you can verify. Name your CTEs descriptively. This structured approach shows clear thinking even if your final query needs debugging.
Do e-commerce SQL interviews ever include database design questions? Yes, particularly at companies building new data platforms. You may be asked to design a schema for tracking orders, returns, and inventory, or to normalize a denormalized reporting table. Understanding star schema design and the tradeoffs between normalization and query performance is worth reviewing before your interview.
Is it worth memorizing specific query patterns, or should I focus on understanding fundamentals? Both. The cohort retention pattern, the aggregate-rank-filter pattern, and the LAG-based period-over-period pattern appear so frequently that writing them from memory saves you time and mental energy during the interview. But the fundamentals (joins, group by, window frames) are what let you adapt those patterns to new variations.
How do marketplace SQL interviews differ from standard e-commerce SQL interviews? Marketplace interviews add a second entity: the seller. You will see questions about seller performance tiers, fulfillment speed rankings, and cross-side metrics like "what percentage of buyers purchase from more than one seller." The SQL is similar, but you need to keep track of which side of the marketplace each metric belongs to.
What should I do if I get stuck during a live SQL coding round? Talk through your thought process. Explain what you are trying to compute, what intermediate result you need, and where you are stuck. Interviewers routinely give hints to candidates who communicate well but may not help candidates who go silent. Partial credit is real: a correct CTE with a flawed final SELECT is better than a blank screen.
Sources
- Amazon SQL Interview Questions (DataLemur, 2025)
- Highest-Grossing Items SQL Question (DataLemur, 2025)
- Y-on-Y Growth Rate SQL Question (DataLemur, 2025)
- Advanced SQL Interview Questions (DataLemur, 2025)
- Amazon SQL Interview Questions (StrataScratch, 2025)
- Top 30 SQL Query Interview Questions (StrataScratch, 2025)
- SQL Interview Questions for Data Analysts (InterviewQuery, 2025)
- SQL Scenario-Based Interview Questions (InterviewQuery, 2025)
- Customer Placing the Largest Number of Orders (LeetCode)
- Customers Who Never Order (LeetCode)
- SQL 50 Study Plan (LeetCode)
- Paired Products SQL Problem (InterviewQuery)
- Market Basket Analysis with SQL (GeeksforGeeks, 2025)
- How to Compute Year-Over-Year Differences in SQL (LearnSQL, 2025)