SQL rounds at companies running search engines and advertising platforms are structurally different from general data science SQL interviews. The schemas are heavier on event tables, the metrics are domain-specific (CTR, RPM, auction win rates), and the expected query patterns lean hard on window functions, sessionization logic, and multi-step CTEs. If you walk in expecting a warm-up aggregation question, you may find yourself staring at a three-table attribution schema with a running revenue window.

"The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience. These represent patterns that candidates have reported at technology companies operating search, advertising, or recommendation systems. We do not claim these are proprietary questions from any specific organization."

CTR and Impression Analysis

Click-through rate is the foundational metric in any ad platform. Interviewers use CTR questions to test whether you can handle pivoted event tables, integer division pitfalls, and correct date filtering, all simultaneously.

Schema

sql

-- events table (used across Questions 1–3)
CREATE TABLE events (
  app_id      INTEGER,
  event_type  VARCHAR(20),  -- 'impression' or 'click'
  event_time  TIMESTAMP
);

Question 1: Basic CTR per App (Easy)

A large digital advertising platform logs all ad impressions and clicks in a single events table. Calculate the click-through rate for each app for the full calendar year 2023, rounded to two decimal places.

sql

SELECT
  app_id,
  ROUND(
    100.0 *
    SUM(CASE WHEN event_type = 'click'      THEN 1 ELSE 0 END) /
    NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0),
  2) AS ctr
FROM events
WHERE event_time >= '2023-01-01'
  AND event_time <  '2024-01-01'
GROUP BY app_id;

Sample output

app_id	ctr
101	3.42
202	7.89
303	1.15

Key Insight: Multiply by 100.0 (not 100) before dividing to force floating-point arithmetic. Integer division silently returns 0 in most SQL dialects.

Common Mistake: Filtering with YEAR(event_time) = 2023 is valid but non-sargable on large tables. A range predicate on an indexed timestamp column is far faster.

What the Interviewer Is Testing: Basic conditional aggregation, date filtering, and awareness of integer division behavior.

Question 2: CTR by Ad Placement Position (Medium)

Ad position matters: position 1 typically earns 5 to 10 times the CTR of position 5. A content recommendation platform wants a breakdown of CTR by ad slot position to decide how aggressively to price each slot.

sql

-- Updated schema adds position column
CREATE TABLE ad_events (
  ad_id       INTEGER,
  position    INTEGER,   -- 1 = top slot, higher = lower on page
  event_type  VARCHAR(20),
  event_time  TIMESTAMP
);

sql

SELECT
  position,
  SUM(CASE WHEN event_type = 'click'      THEN 1 ELSE 0 END) AS clicks,
  SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions,
  ROUND(
    100.0 *
    SUM(CASE WHEN event_type = 'click'      THEN 1 ELSE 0 END) /
    NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0),
  2) AS ctr
FROM ad_events
GROUP BY position
ORDER BY position;

Sample output

position	clicks	impressions	ctr
1	4,820	52,100	9.25
2	2,340	51,800	4.52
3	980	50,900	1.93

Key Insight: Wrap the denominator in NULLIF(..., 0) to avoid a division-by-zero error on positions with only click events (possible if data arrives out of order).

Common Mistake: Omitting NULLIF and trusting the data is always clean. Interviewers at platforms dealing with real-time pipelines will probe exactly this edge case.

What the Interviewer Is Testing: Defensive coding for NULL/zero edge cases and understanding of position bias in ad systems.

Question 3: Top 3 Ads by CTR per Campaign (Medium)

A major search engine company runs hundreds of active campaigns simultaneously. Product wants a daily report showing the top 3 performing ads within each campaign by CTR.

sql

CREATE TABLE campaign_events (
  campaign_id INTEGER,
  ad_id       INTEGER,
  event_type  VARCHAR(20),
  event_date  DATE
);

sql

WITH ad_ctr AS (
  SELECT
    campaign_id,
    ad_id,
    SUM(CASE WHEN event_type = 'click'      THEN 1 ELSE 0 END) AS clicks,
    SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions,
    ROUND(
      100.0 *
      SUM(CASE WHEN event_type = 'click'      THEN 1 ELSE 0 END) /
      NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0),
    2) AS ctr
  FROM campaign_events
  GROUP BY campaign_id, ad_id
),
ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY campaign_id ORDER BY ctr DESC) AS rn
  FROM ad_ctr
  WHERE impressions >= 100   -- minimum impression threshold
)
SELECT campaign_id, ad_id, clicks, impressions, ctr
FROM ranked
WHERE rn <= 3
ORDER BY campaign_id, rn;

Sample output

campaign_id	ad_id	clicks	impressions	ctr
10	501	320	2,100	15.24
10	507	290	2,400	12.08
10	512	210	1,950	10.77

Key Insight: The impression threshold (impressions >= 100) prevents low-volume flukes from dominating. An ad with 1 impression and 1 click reports 100% CTR, which is statistically meaningless.

Common Mistake: Using RANK() instead of ROW_NUMBER(). If two ads tie on CTR, RANK() returns both as rank 1 and skips rank 2, potentially returning 4 or more rows per campaign.

What the Interviewer Is Testing: ROW_NUMBER() vs RANK() vs DENSE_RANK() distinctions, and real-world data quality filtering.

User Session and Search Query Analysis

Session-level analysis is where search and ad platform interviews separate strong candidates from average ones. The sessionization problem (assigning session IDs to raw event streams using time gaps) appears regularly in these rounds.

Schema

sql

CREATE TABLE user_events (
  user_id    INTEGER,
  query      VARCHAR(255),
  event_time TIMESTAMP
);

Question 4: Session Identification with 30-Minute Gap Logic (Hard)

A search engine product team defines a session as a sequence of queries from the same user with no gap longer than 30 minutes. Write a query that assigns a session ID to each event.

sql

WITH gap_flags AS (
  SELECT
    user_id,
    query,
    event_time,
    CASE
      WHEN event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
           > INTERVAL '30 minutes'
        THEN 1
      WHEN LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL
        THEN 1
      ELSE 0
    END AS is_new_session
  FROM user_events
),
session_numbers AS (
  SELECT
    user_id,
    query,
    event_time,
    SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
  FROM gap_flags
)
SELECT
  user_id,
  session_id,
  MIN(event_time) AS session_start,
  MAX(event_time) AS session_end,
  COUNT(*)        AS query_count
FROM session_numbers
GROUP BY user_id, session_id
ORDER BY user_id, session_id;

Sample output

user_id	session_id	session_start	session_end	query_count
42	1	2024-03-01 09:00:00	2024-03-01 09:22:00	4
42	2	2024-03-01 10:05:00	2024-03-01 10:18:00	3
87	1	2024-03-01 14:10:00	2024-03-01 14:40:00	6

Key Insight: The two-step CTE pattern (flag new sessions first, then cumulative sum) is the standard solution. The SUM(...) OVER (PARTITION BY user_id ORDER BY event_time) running total increments each time a new session flag fires.

Common Mistake: Using DATEDIFF without accounting for the first event per user. The first event always starts a new session. Handle this with the IS NULL condition on LAG().

What the Interviewer Is Testing: Deep window function fluency, specifically combining LAG() and a running SUM() for stateful computations.

Question 5: Top Search Terms by Query Volume (Easy)

A large digital advertising platform wants to identify the 10 most-searched queries in the past 7 days to allocate more ad inventory against high-demand terms.

sql

SELECT
  query,
  COUNT(*) AS search_count
FROM user_events
WHERE event_time >= NOW() - INTERVAL '7 days'
GROUP BY query
ORDER BY search_count DESC
LIMIT 10;

Sample output

query	search_count
data science jobs	14,820
python tutorial	12,340
machine learning course	9,870

Key Insight: In a real interview, a follow-up will often ask: "How would this query perform on a 500M-row table?" The answer is a covering index on (event_time, query) to satisfy the WHERE clause without a full scan and to push the GROUP BY to the index.

What the Interviewer Is Testing: Basic aggregation fluency and, at senior levels, indexing intuition for high-cardinality event tables.

Ad Performance and Attribution

Attribution and revenue analysis questions test whether you can write multi-step CTEs and reason about join patterns across three or more tables.

Schema

sql

CREATE TABLE ad_clicks (
  click_id     INTEGER,
  user_id      INTEGER,
  ad_id        INTEGER,
  campaign_id  INTEGER,
  click_time   TIMESTAMP,
  revenue_usd  DECIMAL(10,2)
);

CREATE TABLE ad_impressions (
  impression_id INTEGER,
  user_id       INTEGER,
  ad_id         INTEGER,
  campaign_id   INTEGER,
  impression_time TIMESTAMP
);

Question 6: Revenue Per Click and Revenue Per Impression (Medium)

A social media company with ad-based revenue needs a campaign-level report showing both revenue per click (RPC) and revenue per thousand impressions (RPM) to compare campaign efficiency.

sql

WITH click_summary AS (
  SELECT
    campaign_id,
    COUNT(*)        AS total_clicks,
    SUM(revenue_usd) AS total_revenue
  FROM ad_clicks
  GROUP BY campaign_id
),
impression_summary AS (
  SELECT
    campaign_id,
    COUNT(*) AS total_impressions
  FROM ad_impressions
  GROUP BY campaign_id
)
SELECT
  c.campaign_id,
  c.total_clicks,
  i.total_impressions,
  ROUND(c.total_revenue / NULLIF(c.total_clicks, 0), 4)            AS rpc,
  ROUND(1000.0 * c.total_revenue / NULLIF(i.total_impressions, 0), 4) AS rpm
FROM click_summary c
JOIN impression_summary i USING (campaign_id)
ORDER BY rpm DESC;

Sample output

campaign_id	total_clicks	total_impressions	rpc	rpm
201	5,420	890,000	1.2300	7.4900
202	3,100	720,000	0.9800	4.2194

Key Insight: RPM (revenue per mille) is the standard ad yield metric. Campaigns can have high RPC but low RPM if their CTR is low. Reporting both gives a complete picture of efficiency.

Common Mistake: Joining clicks and impressions before aggregating. This creates a many-to-many join explosion. Aggregate each table first in separate CTEs, then join the summaries.

What the Interviewer Is Testing: CTE decomposition skills and understanding of ad yield metrics.

Question 7: Running Revenue Total by Campaign (Medium)

Finance needs a daily running revenue total per campaign to project month-end revenue. Write a query that shows cumulative revenue through each day.

sql

WITH daily_revenue AS (
  SELECT
    campaign_id,
    DATE_TRUNC('day', click_time)::DATE AS click_date,
    SUM(revenue_usd) AS daily_rev
  FROM ad_clicks
  GROUP BY campaign_id, DATE_TRUNC('day', click_time)
)
SELECT
  campaign_id,
  click_date,
  daily_rev,
  SUM(daily_rev) OVER (
    PARTITION BY campaign_id
    ORDER BY click_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM daily_revenue
ORDER BY campaign_id, click_date;

Sample output

campaign_id	click_date	daily_rev	running_total
201	2024-03-01	1,240.00	1,240.00
201	2024-03-02	980.50	2,220.50
201	2024-03-03	1,105.00	3,325.50

Key Insight: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the explicit frame clause for a running total. Some databases default to this when you specify only ORDER BY, but writing it explicitly makes intent clear and prevents surprises with RANGE vs ROWS semantics.

Common Mistake: Forgetting PARTITION BY campaign_id causes the running total to span across campaigns.

What the Interviewer Is Testing: Window frame clause precision and the difference between ROWS and RANGE framing.

Question 8: First-Touch Attribution (Hard)

A major search engine company wants to understand which marketing channel deserves credit for each conversion. First-touch attribution assigns the credit to the channel where the user first arrived.

sql

CREATE TABLE sessions (
  session_id   INTEGER,
  user_id      INTEGER,
  channel      VARCHAR(50),   -- 'organic_search', 'paid_search', 'email', 'direct'
  created_at   TIMESTAMP,
  converted    BOOLEAN
);

sql

WITH converted_users AS (
  SELECT DISTINCT user_id
  FROM sessions
  WHERE converted = TRUE
),
first_touch AS (
  SELECT
    s.user_id,
    s.channel,
    s.created_at,
    ROW_NUMBER() OVER (PARTITION BY s.user_id ORDER BY s.created_at) AS rn
  FROM sessions s
  INNER JOIN converted_users cu ON s.user_id = cu.user_id
)
SELECT
  user_id,
  channel AS first_touch_channel,
  created_at AS first_session_time
FROM first_touch
WHERE rn = 1
ORDER BY user_id;

Sample output

user_id	first_touch_channel	first_session_time
123	paid_search	2024-03-01 08:14:00
145	organic_search	2024-03-01 10:22:00
172	email	2024-03-02 14:05:00

Key Insight: Filter to converted users first in a CTE before the ROW_NUMBER() window. Filtering after the window computation wastes compute on non-converting users. Filter early to keep the query focused and efficient.

Common Mistake: Using MIN(created_at) with a GROUP BY instead of ROW_NUMBER(). The MIN approach works for finding the earliest timestamp but doesn't preserve the channel associated with that row unless you add a second join.

What the Interviewer Is Testing: Attribution logic, correct use of ROW_NUMBER() for row-level selection, and join order reasoning.

Ranking and Relevance Queries

Search and recommendation systems constantly re-rank results. SQL interview questions in this area test your ability to write ranking queries across partitions and compare ranking quality across time periods.

Schema

sql

CREATE TABLE search_results (
  query_id        INTEGER,
  result_id       INTEGER,
  position        INTEGER,   -- 1 = top result
  relevance_score DECIMAL(5,3),
  click_flag      BOOLEAN,
  recorded_date   DATE
);

Question 9: Rank Results within Each Query by Relevance Score (Medium)

A content recommendation platform surfaces 10 results per query. Product wants to know how the current ranked order (by position) compares to a relevance-first order for a set of audited queries.

sql

SELECT
  query_id,
  result_id,
  position                  AS current_rank,
  relevance_score,
  RANK() OVER (
    PARTITION BY query_id
    ORDER BY relevance_score DESC
  )                         AS relevance_rank,
  position - RANK() OVER (
    PARTITION BY query_id
    ORDER BY relevance_score DESC
  )                         AS rank_displacement
FROM search_results
WHERE recorded_date = '2024-03-01'
ORDER BY query_id, current_rank;

Sample output

query_id	result_id	current_rank	relevance_score	relevance_rank	rank_displacement
1001	55	1	0.920	2	-1
1001	62	2	0.950	1	1
1001	78	3	0.810	3	0

Key Insight: rank_displacement = current_rank - relevance_rank quantifies how far off the ranked list is from the ideal order. Positive values mean a result is ranked lower than it should be; negative values mean it's over-ranked. This is the building block of NDCG (Normalized Discounted Cumulative Gain) analysis.

Common Mistake: Using ORDER BY relevance_score DESC at the query level and assuming it equals the ranked order per query. Without PARTITION BY query_id, RANK() treats the whole table as a single partition.

What the Interviewer Is Testing: RANK() vs DENSE_RANK() distinction, window partition logic, and search quality intuition.

Question 10: Comparing Search Result Positions Across Two Time Periods (Hard)

An A/B test ran a new ranking model against the control. Write a query that compares the average position of each result_id in the treatment vs. the control period, and flags results whose position changed by more than 2 slots.

sql

WITH control_period AS (
  SELECT
    result_id,
    AVG(position) AS avg_pos_control
  FROM search_results
  WHERE recorded_date BETWEEN '2024-02-01' AND '2024-02-14'
  GROUP BY result_id
),
treatment_period AS (
  SELECT
    result_id,
    AVG(position) AS avg_pos_treatment
  FROM search_results
  WHERE recorded_date BETWEEN '2024-02-15' AND '2024-02-28'
  GROUP BY result_id
)
SELECT
  c.result_id,
  ROUND(c.avg_pos_control,   2) AS control_avg_pos,
  ROUND(t.avg_pos_treatment, 2) AS treatment_avg_pos,
  ROUND(t.avg_pos_treatment - c.avg_pos_control, 2) AS position_delta,
  CASE
    WHEN ABS(t.avg_pos_treatment - c.avg_pos_control) > 2
      THEN 'significant_shift'
    ELSE 'stable'
  END AS shift_flag
FROM control_period c
JOIN treatment_period t ON c.result_id = t.result_id
ORDER BY ABS(t.avg_pos_treatment - c.avg_pos_control) DESC;

Sample output

result_id	control_avg_pos	treatment_avg_pos	position_delta	shift_flag
62	4.10	1.30	-2.80	significant_shift
78	2.50	4.90	2.40	significant_shift
55	1.20	1.60	0.40	stable

Key Insight: A self-join on aggregated CTEs is the cleanest way to compare the same entity across two time windows. The alternative (pivoting in a single query with CASE WHEN) works but is harder to read and extend.

Common Mistake: Using FULL OUTER JOIN and not accounting for result IDs that appear in only one period. For search ranking audits, results absent from one period are themselves anomalous and worth surfacing separately.

What the Interviewer Is Testing: CTE-based temporal comparison, self-join patterns, and A/B test result analysis in SQL.

Anomaly and Quality Detection

Ad fraud and search quality teams rely heavily on SQL for anomaly detection. These questions test whether you can use window functions to spot statistical outliers and bot-like click patterns.

Schema

sql

CREATE TABLE click_log (
  click_id   INTEGER,
  user_id    INTEGER,
  ad_id      INTEGER,
  ip_address VARCHAR(45),
  click_time TIMESTAMP
);

Question 11: Detecting Suspiciously Fast Click Patterns (Medium)

A large digital advertising platform's trust team wants to flag users who clicked on the same ad with more than 3 clicks less than 5 minutes apart from the preceding click, which is a signal of invalid traffic.

sql

WITH click_gaps AS (
  SELECT
    user_id,
    ad_id,
    click_time,
    LAG(click_time) OVER (
      PARTITION BY user_id, ad_id
      ORDER BY click_time
    ) AS prev_click_time
  FROM click_log
),
flagged_gaps AS (
  SELECT
    user_id,
    ad_id,
    click_time,
    CASE
      WHEN click_time - prev_click_time < INTERVAL '5 minutes'
        THEN 1
      ELSE 0
    END AS rapid_click
  FROM click_gaps
)
SELECT
  user_id,
  ad_id,
  COUNT(*)          AS total_clicks,
  SUM(rapid_click)  AS rapid_click_count
FROM flagged_gaps
GROUP BY user_id, ad_id
HAVING SUM(rapid_click) >= 3
ORDER BY rapid_click_count DESC;

Sample output

user_id	ad_id	total_clicks	rapid_click_count
9901	77	18	9
4402	33	12	6

Key Insight: Partitioning LAG() by both user_id and ad_id is critical. You only care about repeated rapid clicks on the same ad, not rapid clicks across different ads.

Common Mistake: Using COUNT(*) OVER (PARTITION BY user_id, ad_id ORDER BY click_time ROWS BETWEEN INTERVAL AND CURRENT ROW), but this syntax is invalid in most SQL dialects. The LAG-based gap approach is the standard.

What the Interviewer Is Testing: LAG() for time-series gap analysis, fraud detection intuition, and partition specification precision.

Question 12: Date-Based Anomaly Detection with LAG (Hard)

A content recommendation platform monitors daily click volume per campaign. Write a query that flags any day where click volume dropped more than 30% compared to the prior day, which may indicate a pipeline failure or targeting issue. This query joins the click_log table (defined above) with the ad_clicks table (from the Ad Performance section) to obtain campaign_id.

sql

WITH daily_clicks AS (
  SELECT
    campaign_id,
    DATE_TRUNC('day', click_time)::DATE AS click_date,
    COUNT(*) AS daily_click_count
  FROM click_log
  JOIN ad_clicks USING (click_id)
  GROUP BY campaign_id, DATE_TRUNC('day', click_time)::DATE
),
with_prior_day AS (
  SELECT
    campaign_id,
    click_date,
    daily_click_count,
    LAG(daily_click_count) OVER (
      PARTITION BY campaign_id
      ORDER BY click_date
    ) AS prior_day_clicks
  FROM daily_clicks
)
SELECT
  campaign_id,
  click_date,
  daily_click_count,
  prior_day_clicks,
  ROUND(
    100.0 * (daily_click_count - prior_day_clicks) / NULLIF(prior_day_clicks, 0),
  1) AS pct_change
FROM with_prior_day
WHERE prior_day_clicks IS NOT NULL
  AND daily_click_count < prior_day_clicks * 0.70
ORDER BY pct_change;

Sample output

campaign_id	click_date	daily_click_count	prior_day_clicks	pct_change
201	2024-03-15	820	3,400	-75.9
305	2024-03-18	1,100	2,200	-50.0

Key Insight: Filtering prior_day_clicks IS NOT NULL removes the first day per campaign, which has no prior day to compare against. Without this filter, day 1 always shows a NULL percentage change that can confuse downstream reports.

Common Mistake: Comparing absolute drops instead of percentage drops. A drop from 100 to 60 clicks is alarming (40% drop). A drop from 100,000 to 99,940 is noise. Percentage-based thresholds scale correctly across campaign sizes.

What the Interviewer Is Testing: LAG() for trend analysis, percentage change computation, and data quality monitoring thinking.

Complex Multi-Table Pipeline Queries

The hardest questions in search and ad platform interviews involve chaining 3 to 5 tables through CTEs. These questions are designed to see whether you can decompose a business problem into clean, testable SQL steps.

Schema

sql

CREATE TABLE users (
  user_id     INTEGER,
  signup_date DATE,
  country     VARCHAR(50)
);

CREATE TABLE user_sessions (
  session_id  INTEGER,
  user_id     INTEGER,
  channel     VARCHAR(50),
  session_start TIMESTAMP
);

CREATE TABLE search_events (
  event_id    INTEGER,
  session_id  INTEGER,
  query       VARCHAR(255),
  result_clicked BOOLEAN,
  click_revenue  DECIMAL(10,2)
);

Question 13: End-to-End Engagement Funnel Across Three Tables (Hard)

The product team wants a country-level funnel report: how many users have at least one session, and of those, how many generated at least one click with revenue. The output should show conversion rate from session to click-with-revenue.

sql

WITH session_users AS (
  SELECT DISTINCT
    u.user_id,
    u.country,
    us.session_id
  FROM users u
  JOIN user_sessions us ON u.user_id = us.user_id
),
revenue_sessions AS (
  SELECT DISTINCT
    su.user_id,
    su.country
  FROM session_users su
  JOIN search_events se ON su.session_id = se.session_id
  WHERE se.result_clicked = TRUE
    AND se.click_revenue > 0
),
country_funnel AS (
  SELECT
    su.country,
    COUNT(DISTINCT su.user_id)   AS users_with_sessions,
    COUNT(DISTINCT rs.user_id)   AS users_with_revenue
  FROM session_users su
  LEFT JOIN revenue_sessions rs ON su.user_id = rs.user_id
  GROUP BY su.country
)
SELECT
  country,
  users_with_sessions,
  users_with_revenue,
  ROUND(
    100.0 * users_with_revenue / NULLIF(users_with_sessions, 0),
  2) AS session_to_revenue_rate
FROM country_funnel
ORDER BY session_to_revenue_rate DESC;

Sample output

country	users_with_sessions	users_with_revenue	session_to_revenue_rate
United States	48,200	12,100	25.10
United Kingdom	9,800	2,300	23.47
Germany	7,100	1,540	21.69

Key Insight: Using COUNT(DISTINCT user_id) at each funnel stage prevents double-counting users who appear in multiple sessions or generate multiple revenue events. This is non-negotiable for funnel accuracy.

Common Mistake: Applying a WHERE rs.user_id IS NOT NULL filter instead of LEFT JOIN causes users without revenue sessions to disappear entirely from the denominator, making the rate calculation impossible.

What the Interviewer Is Testing: Multi-stage CTE pipeline design, LEFT JOIN vs inner join reasoning, and funnel metric construction.

Question 14: Average CTR Campaign Performance with Multi-Dimensional Join (Hard)

A major search engine company wants to calculate the average CTR by campaign, advertiser, and region by joining three tables: campaigns, ads, and an impressions table.

sql

CREATE TABLE campaigns (
  campaign_id   INTEGER,
  advertiser_id INTEGER,
  ad_id         INTEGER
);

CREATE TABLE ads (
  ad_id     INTEGER,
  region_id INTEGER
);

CREATE TABLE impressions (
  ad_id      INTEGER,
  event_time TIMESTAMP,
  is_clicked INTEGER  -- 1 for click, 0 for impression without click
);

sql

SELECT
  c.campaign_id,
  c.advertiser_id,
  a.region_id,
  COUNT(*)                    AS total_impressions,
  SUM(i.is_clicked)           AS total_clicks,
  ROUND(AVG(i.is_clicked::FLOAT) * 100, 2) AS avg_ctr
FROM campaigns c
JOIN ads a        ON c.ad_id = a.ad_id
JOIN impressions i ON a.ad_id = i.ad_id
GROUP BY c.campaign_id, c.advertiser_id, a.region_id
ORDER BY avg_ctr DESC;

Sample output

campaign_id	advertiser_id	region_id	total_impressions	total_clicks	avg_ctr
10	3	2	48,200	3,850	7.99
12	5	1	31,100	2,100	6.75

Key Insight: AVG(is_clicked::FLOAT) is mathematically equivalent to SUM(is_clicked) / COUNT(*) when the column contains 0/1 values. Casting to FLOAT before AVG avoids integer truncation.

Common Mistake: Joining impressions to campaigns directly on ad_id without going through the ads table. This bypasses the region data, making the multi-dimensional breakdown impossible.

What the Interviewer Is Testing: Three-table join chain logic, multi-level GROUP BY, and equivalent formulations of CTR.

Question 15: Bid Price Trend Analysis with Running Average (Medium)

An advertising auction platform wants to monitor whether average bid prices are trending up or down over time per campaign. Write a query showing the 7-day trailing average bid price per campaign.

sql

CREATE TABLE campaign_bids (
  bid_id       INTEGER,
  campaign_id  INTEGER,
  bid_time     TIMESTAMP,
  bid_price    DECIMAL(10,4)
);

sql

WITH daily_avg_bids AS (
  SELECT
    campaign_id,
    DATE_TRUNC('day', bid_time)::DATE AS bid_date,
    AVG(bid_price) AS daily_avg_bid
  FROM campaign_bids
  GROUP BY campaign_id, DATE_TRUNC('day', bid_time)
)
SELECT
  campaign_id,
  bid_date,
  ROUND(daily_avg_bid, 4) AS daily_avg_bid,
  ROUND(
    AVG(daily_avg_bid) OVER (
      PARTITION BY campaign_id
      ORDER BY bid_date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ),
  4) AS trailing_7d_avg
FROM daily_avg_bids
ORDER BY campaign_id, bid_date;

Sample output

campaign_id	bid_date	daily_avg_bid	trailing_7d_avg
567	2023-12-01	2.5000	2.5000
567	2023-12-02	2.6000	2.5500
567	2023-12-03	2.4000	2.5000
567	2023-12-04	2.8000	2.5750

Key Insight: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes the current row plus 6 prior rows, giving a 7-row trailing window. For dates with gaps (weekends, holidays), this counts 7 actual data points, not 7 calendar days, which may or may not match your business requirement.

Common Mistake: Using RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW for a true calendar-day window. RANGE requires an ordered numeric or date key and computes based on value distance, not row count. Choose the right frame type based on whether you want "7 data points" or "7 calendar days."

What the Interviewer Is Testing: Window frame mechanics (ROWS vs RANGE), aggregation before window computation, and time-series analysis judgment.

Conclusion

SQL at search and ad platform companies is not a test of syntax recall. It's a test of whether you understand the data structures that power ad auctions, relevance systems, and user funnel measurement. The 15 questions here cover the core patterns that appear most consistently in these rounds: conditional aggregation for CTR, LAG-based sessionization and anomaly detection, ROW_NUMBER() for top-N-per-partition selection, CTE pipelines for multi-table attribution, and window frame precision for time-series analysis.

The single most common failure mode in these interviews is writing queries that produce correct results on clean data but break on real-world edge cases: integer division returning zero, NULLIF missing from a denominator, RANK() returning extra rows where ROW_NUMBER() was needed, or a LEFT JOIN replaced by an inner join that silently drops users from a funnel. Study the "Common Mistake" callouts as seriously as the solutions themselves.

If you want more practice beyond these 15, DataLemur's question bank and StrataScratch both have regularly updated problems from companies with large-scale advertising and search systems. Practice writing queries cold, without looking at hints, and then read the editorial to learn the faster approach.

For deeper preparation, the LDS articles on data science interview strategy and SQL window functions explained cover the underlying concepts in more depth.

Career Q&A

How much SQL is expected in a data science interview at a search or ad platform company versus a general tech company?

Significantly more. Companies running ad auctions and relevance systems have SQL embedded in their measurement stack at every level, from daily CTR dashboards to A/B test analysis pipelines. Expect at least one SQL round dedicated to business metric calculation, and often a second round focused on product analytics that requires SQL for exploratory work.

Should I learn PostgreSQL, BigQuery, or Spark SQL for these interviews?

PostgreSQL syntax is the most common for whiteboard and online assessments because it's clean, well-known, and interviewers can evaluate logic without dialect quirks getting in the way. BigQuery SQL is worth knowing if the role description mentions GCP or large-scale analytics. The core concepts (window functions, CTEs, joins) transfer across dialects.

What is the most important SQL concept to master for ad platform roles specifically?

Window functions, and particularly the combination of LAG() with a running SUM() for sessionization. This pattern appears in user journey analysis, fraud detection, and cohort construction. Candidates who can write it confidently, partition correctly, and explain what each clause does are noticeably better prepared than those who know only GROUP BY aggregations.

How do interviewers evaluate SQL solutions beyond correctness?

They look for efficiency (avoiding full scans when an indexed range predicate would work), defensive coding (NULLIF on denominators, handling NULL from LAG on the first row), and readability (CTEs named for what they compute, not generic names like cte1). At senior levels, they often ask you to explain how the query would perform on 10 billion rows.

Is it worth memorizing specific SQL question answers from DataLemur or StrataScratch?

Memorizing solutions is less useful than internalizing patterns. If you understand why ROW_NUMBER() OVER (PARTITION BY campaign ORDER BY ctr DESC) followed by WHERE rn <= 3 gives top-N-per-group, you can apply that pattern to any top-N variant regardless of the business domain. Grinding problems to build pattern recognition is the goal, not rote answer recall.

How should I structure my answer in a SQL interview when I'm given a complex multi-table schema?

Start by talking through the tables out loud before writing any SQL. Identify the grain (one row per what?), the join keys, and the metric you're computing. Write the query in CTE stages from the innermost aggregation outward. Interviewers consistently rate structured thinkers higher than candidates who jump immediately to code and backtrack several times.

What's a realistic timeline to prepare SQL for a data science interview at a top search or ad platform?

Four to six weeks of consistent practice: two weeks on core aggregation and joins, two weeks on window functions and CTEs, and one to two weeks on full problem sets from DataLemur and StrataScratch filtered to hard difficulty. Do at least 10 problems completely cold (no hints) in the final week to simulate actual interview pressure.