You cleared the resume screen. The recruiter call went well. Now you're staring at a shared code editor with 45 minutes on the clock and a prompt that reads: "Given this DataFrame of user events, find the top 3 customers by revenue per month." Most candidates freeze, not because they don't know Pandas, but because they never practiced the specific patterns that come up in technical screens.

This guide covers the 14 most commonly tested Pandas patterns at large technology companies, drawn from real interview prep platforms and community reports.

The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, StrataScratch, LeetCode, and forums such as r/datascience and Blind. These represent patterns that data scientists have reported encountering at large technology companies. We do not claim these are proprietary questions from any specific organization.

DataFrame Fundamentals Under Interview Pressure

These three patterns trip up more candidates than any other topic. They look easy in documentation but produce subtle bugs under timed conditions.

Problem 1: loc vs iloc When Index Is Not the Default Range

A large e-commerce platform commonly asks candidates to filter rows using .loc[] after operations that reset or scramble the index. The trap: after a sort or merge, numeric positions and label-based positions diverge.

python

import pandas as pd
import numpy as np

# Sample DataFrame — product sales log
df = pd.DataFrame({
    "product_id": [101, 205, 309, 412, 501],
    "category": ["electronics", "clothing", "electronics", "food", "clothing"],
    "sales": [8400, 1200, 3300, 950, 4700],
    "returns": [120, 45, 0, 80, 210]
})

# Sort by sales descending — index is now 4, 0, 2, 1, 3
df_sorted = df.sort_values("sales", ascending=False)

# iloc: position-based — gets rows at positions 0, 1, 2 regardless of index
top3_iloc = df_sorted.iloc[:3]

# loc: label-based — would fail with a slice if index contains non-contiguous integers
# Safe pattern: reset_index first, then use loc
df_sorted_reset = df_sorted.reset_index(drop=True)
top3_loc = df_sorted_reset.loc[:2]

print(top3_iloc)
#    product_id     category  sales  returns
# 0         101  electronics   8400      120
# 4         501     clothing   4700      210
# 2         309  electronics   3300        0

# Boolean indexing — always safe with loc
high_value = df.loc[(df["sales"] > 3000) & (df["returns"] < 150)]
print(high_value)
#    product_id     category  sales  returns
# 0         101  electronics   8400      120
# 2         309  electronics   3300        0

The key insight: .iloc cares about physical row position; .loc cares about index label. After any sort, merge, or filter, these diverge. Reset the index with reset_index(drop=True) before any position-critical operation, or use boolean masks instead of positional slices.

Common Mistake: Writing df[0:3] after a sort. This is equivalent to df.iloc[0:3] on the sorted DataFrame, which gives you the correct rows here but silently breaks if the DataFrame has been filtered before the sort, leaving gaps in positional indexing.

Problem 2: Selecting Multiple Columns and The Efficiency Trap

A cloud data platform asks candidates to select a subset of columns from a 50-column DataFrame. Interviewers watch whether you use the list syntax or chain unnecessary operations.

python

import pandas as pd

df = pd.DataFrame({
    "user_id": [1, 2, 3, 4],
    "age": [25, 34, 28, 41],
    "city": ["NYC", "SF", "LA", "Chicago"],
    "revenue": [450.0, 890.0, 120.0, 670.0],
    "sessions": [12, 8, 3, 19],
    "churn_flag": [0, 0, 1, 0]
})

# Select multiple columns — pass a list inside brackets
cols_needed = ["user_id", "revenue", "churn_flag"]
subset = df[cols_needed]

# One-liner version
subset = df[["user_id", "revenue", "churn_flag"]]

print(subset)
#    user_id  revenue  churn_flag
# 0        1    450.0           0
# 1        2    890.0           0
# 2        3    120.0           1
# 3        4    670.0           0

# Dynamic column selection — useful when column names come from config
metric_cols = [c for c in df.columns if c not in ["user_id", "city"]]
print(df[metric_cols])

Key Insight: df["col"] returns a Series. df[["col"]] returns a single-column DataFrame. This distinction matters when passing output to functions that expect a DataFrame, which is a source of type errors that trip candidates up in live sessions.

GroupBy and Aggregation Patterns

GroupBy questions are the bread and butter of data science interviews. Every major platform uses them. The differentiating factor is whether you can handle multi-level aggregations and group-level transformations without collapsing the DataFrame.

Problem 3: Multi-Column GroupBy with Named Aggregations

A major social media company asks candidates to compute, per region and product category, the total revenue, average order value, and count of unique customers, all in a single operation.

python

import pandas as pd
import numpy as np

orders = pd.DataFrame({
    "order_id": range(1, 9),
    "customer_id": [101, 102, 101, 103, 104, 102, 103, 105],
    "region": ["East", "West", "East", "West", "East", "West", "East", "West"],
    "category": ["Tech", "Tech", "Home", "Home", "Tech", "Home", "Tech", "Tech"],
    "order_value": [250, 180, 95, 430, 310, 75, 520, 200]
})

summary = (
    orders
    .groupby(["region", "category"])
    .agg(
        total_revenue=("order_value", "sum"),
        avg_order_value=("order_value", "mean"),
        unique_customers=("customer_id", "nunique"),
        order_count=("order_id", "count")
    )
    .reset_index()
)

print(summary)
#   region category  total_revenue  avg_order_value  unique_customers  order_count
# 0   East     Home             95            95.00                 1            1
# 1   East     Tech           1080           360.00                 3            3
# 2   West     Home            505           252.50                 2            2
# 3   West     Tech            380           190.00                 2            2

The named aggregation syntax, column_name=(source_column, function), was added in Pandas 0.25 and is now the expected style in interviews. Using the older {"col": ["sum", "mean"]} syntax produces a MultiIndex column header that requires an extra flattening step.

Key Insight: Always .reset_index() after a GroupBy aggregation when the result feeds into a merge or further transformation. Group keys become index columns by default, which causes silent NaN cascades in subsequent joins.

Problem 4: GroupBy Transform for Adding Group Stats Without Collapsing Rows

This is the most commonly missed pattern in Pandas interviews. A search engine company asks: "For each order, add a column showing what percentage that order represents of its region's total revenue."

The trap: regular .agg() collapses the DataFrame to one row per group. You need .transform() to broadcast the group-level result back to every original row.

python

import pandas as pd

orders = pd.DataFrame({
    "order_id": range(1, 9),
    "customer_id": [101, 102, 101, 103, 104, 102, 103, 105],
    "region": ["East", "West", "East", "West", "East", "West", "East", "West"],
    "order_value": [250, 180, 95, 430, 310, 75, 520, 200]
})

# transform broadcasts the group sum back to every row
orders["region_total"] = orders.groupby("region")["order_value"].transform("sum")
orders["pct_of_region"] = (orders["order_value"] / orders["region_total"] * 100).round(1)

print(orders[["order_id", "region", "order_value", "region_total", "pct_of_region"]])
#    order_id region  order_value  region_total  pct_of_region
# 0         1   East          250          1175           21.3
# 1         2   West          180           885           20.3
# 2         3   East           95          1175            8.1
# 3         4   West          430           885           48.6
# 4         5   East          310          1175           26.4
# 5         6   West           75           885            8.5
# 6         7   East          520          1175           44.3
# 7         8   West          200           885           22.6

Notice that the output has the same number of rows as the input (8 rows, not 2). That is the entire point of transform. It is the Pandas equivalent of a SQL window function.

Common Mistake: Using .agg() and then merging the result back to the original DataFrame. This works but adds unnecessary complexity. Any time you hear "for each row, add a column showing the group-level statistic," reach for transform.

Problem 5: Conditional Aggregation (SQL CASE WHEN in Pandas)

A financial technology platform asks: "Count orders above $300 and orders at 300 or below separately, grouped by region."

python

import pandas as pd
import numpy as np

orders = pd.DataFrame({
    "region": ["East", "West", "East", "West", "East", "West", "East", "West"],
    "order_value": [250, 180, 95, 430, 310, 75, 520, 200]
})

# Method 1: apply a lambda that does conditional counting
result = orders.groupby("region")["order_value"].agg(
    high_value_orders=lambda x: (x > 300).sum(),
    low_value_orders=lambda x: (x <= 300).sum(),
    total_orders="count"
).reset_index()

print(result)
#   region  high_value_orders  low_value_orders  total_orders
# 0   East                  2                 2             4
# 1   West                  1                 3             4

# Method 2: vectorized — create binary flag columns first, then aggregate
orders["is_high_value"] = (orders["order_value"] > 300).astype(int)
orders["is_low_value"] = (orders["order_value"] <= 300).astype(int)

result2 = orders.groupby("region").agg(
    high_value_orders=("is_high_value", "sum"),
    low_value_orders=("is_low_value", "sum")
).reset_index()

Method 2 is preferred in interviews because it separates the feature engineering step from the aggregation step, which is easier to test and debug.

Merging and Joining DataFrames

Merge questions test whether you understand join semantics, not just syntax. The most common failure mode is silently creating a many-to-many join with more rows than expected.

Problem 6: Inner vs Outer Merge Behavior

A ride-sharing analytics team asks you to combine a riders table with a trips table. Some riders have no trips yet, and some trips have missing rider data due to a data pipeline issue.

python

import pandas as pd

riders = pd.DataFrame({
    "rider_id": [1, 2, 3, 4],
    "city": ["NYC", "LA", "SF", "Chicago"]
})

trips = pd.DataFrame({
    "trip_id": [101, 102, 103, 104, 105],
    "rider_id": [1, 2, 2, 5, 3],  # rider 4 has no trips; rider 5 not in riders
    "fare": [12.50, 8.00, 15.00, 9.50, 22.00]
})

# Inner join — only matching rider_ids (loses rider 4 and rider_id 5 from trips)
inner = pd.merge(riders, trips, on="rider_id", how="inner")
print(f"Inner merge rows: {len(inner)}")  # Inner merge rows: 4

# Left join — keeps all riders, NaN for riders with no trips
left = pd.merge(riders, trips, on="rider_id", how="left")
print(f"Left merge rows: {len(left)}")   # Left merge rows: 5

# Outer join — keeps everything, NaN on both sides for unmatched rows
outer = pd.merge(riders, trips, on="rider_id", how="outer")
print(f"Outer merge rows: {len(outer)}")  # Outer merge rows: 6

# Check for unmatched rows using indicator
merged = pd.merge(riders, trips, on="rider_id", how="outer", indicator=True)
unmatched = merged[merged["_merge"] != "both"]
print(unmatched[["rider_id", "_merge"]])
#    rider_id      _merge
# 4         4   left_only
# 5         5  right_only

Key Insight: Always check len(result) against your expectations after a merge. If the result has more rows than the larger input, you have an unintentional many-to-many join caused by duplicate keys on one or both sides.

Problem 7: Merging on Multiple Keys and Detecting Duplicates

A large e-commerce platform asks candidates to join a sales table with a pricing table on both product_id and region. This is a multi-key merge, and the many-to-many problem becomes more likely when combining keys.

python

import pandas as pd

sales = pd.DataFrame({
    "product_id": [1, 1, 2, 2, 3],
    "region": ["East", "West", "East", "East", "West"],  # duplicate (2, East)
    "units_sold": [100, 80, 200, 150, 90]
})

pricing = pd.DataFrame({
    "product_id": [1, 1, 2, 3],
    "region": ["East", "West", "East", "West"],
    "unit_price": [25.0, 27.0, 15.0, 40.0]
})

# Check for duplicates on merge keys BEFORE merging
sales_dupes = sales.duplicated(subset=["product_id", "region"]).sum()
pricing_dupes = pricing.duplicated(subset=["product_id", "region"]).sum()
print(f"Sales duplicates: {sales_dupes}")    # Sales duplicates: 1
print(f"Pricing duplicates: {pricing_dupes}")  # Pricing duplicates: 0

# The duplicate in sales is expected (two transactions for product 2 in East)
# Validate() parameter catches unexpected many-to-many joins
result = pd.merge(
    sales, pricing,
    on=["product_id", "region"],
    how="left",
    validate="many_to_one"  # raises if pricing has duplicates on merge keys
)
result["revenue"] = result["units_sold"] * result["unit_price"]

print(result)
#    product_id region  units_sold  unit_price  revenue
# 0           1   East         100        25.0   2500.0
# 1           1   West          80        27.0   2160.0
# 2           2   East         200        15.0   3000.0
# 3           2   East         150        15.0   2250.0
# 4           3   West          90        40.0   3600.0

The validate parameter is rarely known but highly impressive in an interview. It enforces join cardinality at runtime, turning a silent data quality bug into a loud, catchable error.

Common Mistake: Not checking for duplicates before a merge. In a real-world pipeline, duplicate keys in a dimension table can silently multiply your fact table rows by 2x or 10x, and you won't notice until an aggregation produces unexpectedly large numbers.

Apply vs Vectorization: What Interviewers Actually Care About

This is where mid-level candidates differentiate themselves from senior ones. Every interviewer knows that .apply() is slow. What they want to see is whether you reach for it reflexively or whether you know when vectorized alternatives exist.

Problem 8: Replace Apply with Vectorized Operations

A major social media company's data engineering screen includes this prompt: "You have a DataFrame with a text column. Add a column indicating whether the text contains the word 'error'. The DataFrame has 10 million rows."

python

import pandas as pd
import numpy as np
import time

# Simulate a moderately large DataFrame
np.random.seed(42)
words = ["error in pipeline", "success", "warning issued", "error code 404",
         "completed normally", "error detected", "ok", "failed with error"]
df = pd.DataFrame({
    "log_message": np.random.choice(words, size=100_000)
})

# --- Anti-pattern: apply with a lambda ---
start = time.time()
df["has_error_slow"] = df["log_message"].apply(lambda x: "error" in x.lower())
slow_time = time.time() - start

# --- Vectorized approach: str.contains ---
start = time.time()
df["has_error_fast"] = df["log_message"].str.contains("error", case=False, na=False)
fast_time = time.time() - start

print(f"apply time:      {slow_time:.4f}s")  # apply time:      0.1823s
print(f"vectorized time: {fast_time:.4f}s")  # vectorized time: 0.0041s
print(f"Speedup: {slow_time / fast_time:.1f}x")  # Speedup: 44.5x

# Verify both produce identical results
assert df["has_error_slow"].equals(df["has_error_fast"])

The speedup scales with DataFrame size. On 10 million rows, the gap between apply and str.contains exceeds 100x in typical benchmarks.

Performance Note: The str accessor, np.where(), arithmetic operators, and boolean comparisons are all vectorized. Before writing .apply(), ask yourself: "Can I express this operation as a column-level transformation using built-in operators or np.where?"

Problem 9: When Apply Is Acceptable

Apply is appropriate when the logic cannot be expressed as a vectorized operation, typically when you need to reference multiple columns in a row-wise computation with conditional branches.

python

import pandas as pd
import numpy as np

employees = pd.DataFrame({
    "name": ["Alice", "Bob", "Carla", "Dave"],
    "base_salary": [90000, 75000, 110000, 68000],
    "years_exp": [6, 2, 9, 1],
    "dept": ["Engineering", "Sales", "Engineering", "Sales"]
})

# Complex multi-column logic with branching — apply is reasonable here
def compute_bonus(row):
    if row["dept"] == "Engineering" and row["years_exp"] >= 5:
        return row["base_salary"] * 0.20
    elif row["dept"] == "Sales":
        return row["base_salary"] * 0.15
    else:
        return row["base_salary"] * 0.05

employees["bonus"] = employees.apply(compute_bonus, axis=1)

# However — np.select is still faster for this pattern at scale
conditions = [
    (employees["dept"] == "Engineering") & (employees["years_exp"] >= 5),
    employees["dept"] == "Sales"
]
choices = [
    employees["base_salary"] * 0.20,
    employees["base_salary"] * 0.15
]
employees["bonus_vectorized"] = np.select(conditions, choices,
                                            default=employees["base_salary"] * 0.05)

print(employees[["name", "bonus", "bonus_vectorized"]])
#     name     bonus  bonus_vectorized
# 0  Alice   18000.0           18000.0
# 1    Bob   11250.0           11250.0
# 2  Carla   22000.0           22000.0
# 3   Dave   10200.0           10200.0

Key Insight: Even when apply feels natural, np.select() or np.where() handles multi-condition row-wise logic faster. Showing both approaches in an interview, and explaining the tradeoff, signals senior-level thinking.

Handling Missing Data

Null handling questions test whether you understand the difference between filling blindly and filling intelligently based on business context.

Problem 10: Strategic Imputation with Group-Level Fill

A healthcare analytics company asks: "Fill missing temperature readings with the last valid reading per patient. If no prior reading exists for that patient, fill with the patient's mean temperature."

python

import pandas as pd
import numpy as np

readings = pd.DataFrame({
    "patient_id": [1, 1, 1, 2, 2, 2, 3, 3],
    "timestamp": pd.date_range("2024-01-01", periods=8, freq="6h"),
    "temperature": [98.6, np.nan, 99.1, np.nan, 101.2, np.nan, 98.9, np.nan]
})

# Step 1: forward fill within each patient group
readings["temp_ffill"] = (
    readings
    .groupby("patient_id")["temperature"]
    .transform(lambda x: x.ffill())
)

# Step 2: for any remaining NaN (no prior reading), fill with group mean
readings["temp_filled"] = (
    readings["temp_ffill"]
    .fillna(
        readings.groupby("patient_id")["temperature"].transform("mean")
    )
)

print(readings[["patient_id", "temperature", "temp_filled"]])
#    patient_id  temperature  temp_filled
# 0           1         98.6       98.600
# 1           1          NaN       98.600  <- ffill from row 0
# 2           1         99.1       99.100
# 3           2          NaN      101.200  <- group mean (only one non-null)
# 4           2        101.2      101.200
# 5           2          NaN      101.200  <- ffill from row 4
# 6           3         98.9       98.900
# 7           3          NaN       98.900  <- ffill from row 6

Common Mistake: Using df.fillna(df["temperature"].mean()), which fills with the global mean, not the per-patient mean. The question specifically said "the patient's mean temperature," and missing that constraint fails the test case.

Problem 11: Counting and Profiling Nulls Across a DataFrame

This is a diagnostic question asked in data quality interviews. The goal is to produce a null audit table in three lines.

python

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "age": [25, np.nan, 34, np.nan, 29],
    "email": ["[email protected]", None, "[email protected]", "[email protected]", None],
    "revenue": [100.0, 200.0, np.nan, 150.0, np.nan],
    "country": ["US", "UK", "US", None, "CA"]
})

# Null audit — shows count and percentage per column
null_audit = pd.DataFrame({
    "null_count": df.isnull().sum(),
    "null_pct": (df.isnull().mean() * 100).round(1),
    "dtype": df.dtypes
}).query("null_count > 0").sort_values("null_pct", ascending=False)

print(null_audit)
#          null_count  null_pct   dtype
# revenue           2      40.0  float64
# age               2      40.0  float64
# email             2      40.0   object
# country           1      20.0   object

# Check for rows with any nulls vs rows that are completely clean
total_rows = len(df)
rows_with_null = df.isnull().any(axis=1).sum()
print(f"Rows with at least one null: {rows_with_null}/{total_rows}")
# Rows with at least one null: 4/5

This two-step pattern (audit first, then decide imputation strategy) is what interviewers want to see. Never jump straight to fillna without first understanding the null distribution.

Pivot Tables and Reshaping

Reshaping questions appear frequently in analytics engineer screens and are common in SQL-to-Python translation tasks.

Problem 12: Pivot Table with Multiple Aggregations

A major social media company asks you to produce a report showing average and total ad spend per advertiser category, broken out by quarter.

python

import pandas as pd
import numpy as np

np.random.seed(42)
ad_spend = pd.DataFrame({
    "advertiser_id": np.random.randint(1, 20, 40),
    "category": np.random.choice(["Tech", "Retail", "Finance"], 40),
    "quarter": np.random.choice(["Q1", "Q2", "Q3", "Q4"], 40),
    "spend": np.random.randint(5000, 50000, 40)
})

# pivot_table with multiple aggfunc values
pivot = pd.pivot_table(
    ad_spend,
    values="spend",
    index="category",
    columns="quarter",
    aggfunc={"spend": ["mean", "sum"]},
    fill_value=0
).round(0)

print(pivot)
# Output varies by numpy version due to random seed behavior.
# You will see a multi-level column pivot with mean and sum
# for each quarter, grouped by category.

# Flatten MultiIndex columns for downstream use
pivot.columns = ["_".join(col).strip() for col in pivot.columns.values]
pivot = pivot.reset_index()

Key Insight: fill_value=0 fills missing cells (combinations with no data) with zero, not NaN. Use this when zero is semantically correct (no spend happened). Use fill_value=np.nan when absence means unknown rather than zero.

Problem 13: Melt for Wide-to-Long Transformation

A cloud data platform gives candidates a wide-format DataFrame with one column per metric and asks them to convert it to long format for Seaborn plotting or SQL upload.

python

import pandas as pd

# Wide format — common in spreadsheet exports
wide = pd.DataFrame({
    "store_id": [1, 2, 3],
    "revenue_Q1": [45000, 62000, 38000],
    "revenue_Q2": [51000, 58000, 42000],
    "revenue_Q3": [48000, 70000, 35000],
    "revenue_Q4": [55000, 65000, 40000]
})

# melt: convert revenue_Q1..Q4 columns into rows
long = wide.melt(
    id_vars=["store_id"],
    value_vars=["revenue_Q1", "revenue_Q2", "revenue_Q3", "revenue_Q4"],
    var_name="quarter",
    value_name="revenue"
)

# Clean up the quarter column — strip "revenue_" prefix
long["quarter"] = long["quarter"].str.replace("revenue_", "", regex=False)
long = long.sort_values(["store_id", "quarter"]).reset_index(drop=True)

print(long)
#     store_id quarter  revenue
# 0          1      Q1    45000
# 1          1      Q2    51000
# 2          1      Q3    48000
# 3          1      Q4    55000
# 4          2      Q1    62000
# ...

The inverse of melt is pivot (not pivot_table). If the interviewer follows up with "now convert it back to wide format," use long.pivot(index="store_id", columns="quarter", values="revenue").reset_index().

Time Series Operations

Time series problems appear in data science interviews at any company that deals with user behavior, financial data, or sensor readings, which is most of them.

Problem 14: Resample and Rolling Windows

A fintech company asks: "Given a DataFrame of daily stock prices, compute the monthly average closing price and a 7-day rolling average."

python

import pandas as pd
import numpy as np

np.random.seed(42)
dates = pd.date_range("2024-01-01", "2024-06-30", freq="D")
prices = pd.DataFrame({
    "date": dates,
    "close": 100 + np.cumsum(np.random.randn(len(dates)) * 2)
}).set_index("date")

# resample to monthly mean
monthly_avg = prices["close"].resample("ME").mean().round(2)
print(monthly_avg)
# Output varies by numpy version due to random seed behavior.
# You will see monthly mean closing prices for Jan-Jun 2024,
# showing a general upward trend driven by the positive drift parameter.

# 7-day rolling average — min_periods avoids NaN for the first 6 rows
prices["rolling_7d"] = prices["close"].rolling(window=7, min_periods=1).mean()

# Date-based filtering — select only Q1 data
q1 = prices["2024-01-01":"2024-03-31"]
print(f"Q1 rows: {len(q1)}")  # Q1 rows: 91

Key Insight: Set the DataFrame index to the datetime column before calling resample. It does not work on a regular integer index. This is the most common runtime error in time series interviews: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex.

Common Mistake: Using rolling(window=7).mean() without min_periods=1. The first six rows will be NaN in a report that the interviewer expects to have values. Always set min_periods explicitly unless you specifically want to exclude early windows.

Conclusion

Pandas interviews test a consistent set of patterns: loc vs iloc after index disruption, transform for broadcasting group stats, multi-key merge validation, vectorized operations over apply, strategic null imputation, and resample for time series. Candidates who know these patterns cold (meaning they can write the code from memory under time pressure) are the ones who pass.

The second differentiator is debugging instinct. Add assert len(result) == expected after merges. Print intermediate shapes with df.shape. Check null counts before imputation. Interviewers notice when a candidate's code includes these sanity checks because it signals production awareness, not just textbook knowledge.

The third differentiator is knowing the tradeoffs. Named aggregation syntax over dictionary aggregation, np.select over nested apply, validate parameter in merges, min_periods in rolling windows. These are the details that separate a candidate who has used Pandas casually from one who has debugged a broken data pipeline at 2am.

For more on the ML concepts that complement Pandas fluency, see our Python Machine Learning Interview Questions guide. For how Pandas fits into the broader feature engineering pipeline, see our Feature Engineering guide.

The candidates who pass Pandas technical screens are not the ones who memorized every API. They are the ones who built a mental model of how data flows through each operation.

Career Q&A

How much Pandas do I actually need to know for a data science interview?

For most data science roles, you need to be fluent in groupby/agg, merge, loc/iloc, handling nulls, and basic reshaping. You do not need to memorize the entire Pandas API. Interviewers care more about problem-solving approach (can you decompose the problem into steps, and do you check your work?) than whether you can recall the exact parameter name for pivot_table.

Should I practice Pandas problems in Jupyter notebooks or in a plain text editor?

Both. Use Jupyter for learning and exploration, but also practice in a plain code editor (or directly in a browser-based screen) because many technical interviews use tools like CoderPad or a raw Python environment. The cognitive gap between "writing code with autocomplete" and "writing code from memory" is significant and only closed with deliberate practice.

How important is it to know the performance differences between apply and vectorized operations?

For senior data scientist roles, very important. Interviewers at large technology companies specifically probe this because their pipelines process billions of rows. Even if you are applying for an entry-level role, being able to say "I would use str.contains here instead of apply because it's vectorized" distinguishes you from candidates who only know one approach.

Is Pandas being replaced by Polars or Spark in interviews?

At many large technology companies in 2026, Pandas remains the standard for take-home projects and initial screens. Polars is growing in popularity for performance-sensitive pipelines, and PySpark comes up in data engineering screens. For data scientist roles specifically, you are unlikely to be tested on PySpark syntax, but understanding the concept of lazy evaluation (which PySpark uses) can come up in system design discussions.

How do I handle a Pandas problem I've never seen before in a live interview?

Think out loud through the data structure first. Say what shape you expect the input to have, what shape you need the output to have, and which Pandas operation bridges the two. The transformation classes are: reduce (groupby/agg), broadcast (transform), reshape (pivot/melt), combine (merge/concat), and filter (boolean indexing). Naming the correct class immediately narrows your search to the right API surface.

What Python libraries besides Pandas should I know for a data science technical screen?

NumPy (especially np.where, np.select, broadcasting semantics), and basic familiarity with scikit-learn's fit/transform pattern. Some interviews include matplotlib for a quick visualization. SQL fluency is equally important; many companies give a mixed interview with both Pandas and SQL questions to test whether you can choose the right tool.

How long should I spend on Pandas interview prep?

If you are starting from intermediate Pandas knowledge, four to six weeks of daily 30-minute practice sessions on StrataScratch or DataLemur covers the core patterns. Focus on the seven categories in this article. Do not chase obscure edge cases. The same patterns appear across nearly every company's screen.