Technical screens for data science roles in 2026 have a reliable shape: thirty minutes of SQL, then thirty minutes where the interviewer says "I'd like you to implement something." That second half is where candidates separate themselves. Knowing the sklearn API is table stakes. Being able to derive gradient descent in NumPy (and explain it in plain English while doing it) is what earns an offer.

"The questions in this article are drawn from our research of publicly available interview prep platforms and community discussions, including DataLemur, InterviewQuery, LeetCode, and forums such as r/datascience and Blind. These represent patterns that machine learning practitioners have reported at technical interviews across the industry. We do not claim these are proprietary questions from any specific organization."

This guide covers what actually gets asked: from-scratch algorithm implementations, model evaluation traps, cross-validation mechanics, feature engineering decisions, gradient boosting intuition, and the LLM literacy questions that have become standard at senior levels as of 2026.

Implementing ML Algorithms From Scratch

Why do interviewers ask you to implement algorithms without sklearn? It tests whether you understand the math, not just the API. Anyone can call LogisticRegression().fit(X, y). Far fewer candidates can explain what that call is actually doing or reproduce its core logic in fifty lines of NumPy.

These are the three implementations that appear most often in technical screens, ranked by frequency.

Logistic Regression in NumPy

The question: "Implement binary logistic regression using only NumPy. Include a fit method with gradient descent and a predict method."

This question tests whether you understand the sigmoid function, binary cross-entropy loss, and weight updates simultaneously.

python

import numpy as np

class LogisticRegressionNumPy:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None

    def _sigmoid(self, z):
        # Clip to prevent overflow in exp
        z = np.clip(z, -500, 500)
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for _ in range(self.n_iter):
            # Forward pass
            z = np.dot(X, self.weights) + self.bias
            y_hat = self._sigmoid(z)

            # Gradients (derived from cross-entropy loss)
            error = y_hat - y
            dw = (1 / n_samples) * np.dot(X.T, error)
            db = (1 / n_samples) * np.sum(error)

            # Update weights
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self._sigmoid(z)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)


# Example usage
np.random.seed(42)
X = np.random.randn(200, 3)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

model = LogisticRegressionNumPy(learning_rate=0.1, n_iterations=500)
model.fit(X, y)
preds = model.predict(X)
accuracy = np.mean(preds == y)
print(f"Training accuracy: {accuracy:.3f}")  # Training accuracy: 0.970

Key Insight: The gradient of binary cross-entropy with respect to weights is simply (1/n) * X^T * (y_hat - y). If you can derive this, the implementation follows naturally. If you memorize it without deriving it, an interviewer can unravel you in one follow-up question.

Common Mistake: Forgetting to clip the sigmoid input before computing exp(-z). For large negative values, this causes numerical overflow. Show the np.clip line and explain why it is there.

From the Interviewer's Perspective: "We are not looking for perfection. We want to see that you understand the update rule and can reason about what happens when the learning rate is too large or too small."

K-Means Clustering in NumPy

The question: "Implement K-Means from scratch. Your implementation should handle convergence detection."

python

import numpy as np

class KMeansNumPy:
    def __init__(self, k=3, max_iters=100, tol=1e-4):
        self.k = k
        self.max_iters = max_iters
        self.tol = tol
        self.centroids = None

    def fit(self, X):
        np.random.seed(42)
        # Initialize centroids by sampling k points from X
        idx = np.random.choice(len(X), self.k, replace=False)
        self.centroids = X[idx].copy()

        for _ in range(self.max_iters):
            # Assignment step: compute distances to each centroid
            distances = np.linalg.norm(
                X[:, np.newaxis] - self.centroids, axis=2
            )  # shape: (n_samples, k)
            labels = np.argmin(distances, axis=1)

            # Update step: recompute centroids as cluster means
            new_centroids = np.array([
                X[labels == j].mean(axis=0) if np.any(labels == j)
                else self.centroids[j]
                for j in range(self.k)
            ])

            # Convergence check
            shift = np.linalg.norm(new_centroids - self.centroids)
            self.centroids = new_centroids
            if shift < self.tol:
                break

        self.labels_ = labels
        return self

    def predict(self, X):
        distances = np.linalg.norm(
            X[:, np.newaxis] - self.centroids, axis=2
        )
        return np.argmin(distances, axis=1)


# Example usage
np.random.seed(42)
cluster1 = np.random.randn(50, 2) + np.array([0, 0])
cluster2 = np.random.randn(50, 2) + np.array([5, 5])
cluster3 = np.random.randn(50, 2) + np.array([0, 5])
X = np.vstack([cluster1, cluster2, cluster3])

km = KMeansNumPy(k=3)
km.fit(X)
print(f"Centroids:\n{km.centroids.round(2)}")
# Centroids:
# [[ 4.9   5.14]
#  [-0.14 -0.07]
#  [ 0.13  5.  ]]

Key Insight: The edge case that trips most candidates is an empty cluster, where no points get assigned to a centroid. Handle it explicitly (the implementation above keeps the old centroid). If you silently fail here, inertia becomes NaN and convergence never happens.

Common Mistake: Using a list comprehension inside a tight loop without broadcasting. The vectorized distance computation X[:, np.newaxis] - self.centroids is O(n*k) in memory but dramatically faster than nested loops for realistic dataset sizes.

Linear Regression: Normal Equation and Gradient Descent

The question: "What are the two ways to solve linear regression? Implement both."

python

import numpy as np

def linear_regression_normal_equation(X, y):
    """Closed-form solution: beta = (X^T X)^{-1} X^T y"""
    # Add bias column
    X_b = np.column_stack([np.ones(len(X)), X])
    # Solve using least squares (numerically stable vs direct inverse)
    theta = np.linalg.lstsq(X_b, y, rcond=None)[0]
    return theta

def linear_regression_gradient_descent(X, y, lr=0.01, n_iter=1000):
    """Iterative solution via gradient descent"""
    X_b = np.column_stack([np.ones(len(X)), X])
    n = len(y)
    theta = np.zeros(X_b.shape[1])

    for _ in range(n_iter):
        predictions = X_b.dot(theta)
        errors = predictions - y
        gradient = (2 / n) * X_b.T.dot(errors)
        theta -= lr * gradient

    return theta


# Verify both methods agree
np.random.seed(42)
X = np.random.randn(100, 2)
true_weights = np.array([3.0, 1.5, -2.0])  # [bias, w1, w2]
y = np.column_stack([np.ones(100), X]).dot(true_weights) + np.random.randn(100) * 0.5

theta_ne = linear_regression_normal_equation(X, y)
theta_gd = linear_regression_gradient_descent(X, y, lr=0.1, n_iter=2000)

print(f"Normal equation: {theta_ne.round(3)}")   # [ 3.046  1.595 -2.086]
print(f"Gradient descent: {theta_gd.round(3)}")  # [ 3.046  1.595 -2.086]

Key Insight: The normal equation is O(n * p^2 + p^3), and the matrix inversion scales with the cube of features. For p > 10,000 features, gradient descent becomes the only practical option. Knowing where each method breaks down is the answer that separates senior candidates from junior ones.

Model Evaluation: The Questions Most Candidates Answer Poorly

Interviewers report that model evaluation is the section where prepared candidates most often underperform. The concepts seem basic, so candidates do not review them carefully. Then they get asked a second-level question and freeze.

Precision, Recall, and F1 From Scratch

The question: "Implement precision, recall, and F1 score using only NumPy. Then explain when you would optimize for precision vs. recall."

python

import numpy as np

def precision_recall_f1(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall    = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1        = (2 * precision * recall / (precision + recall)
                 if (precision + recall) > 0 else 0.0)

    return {
        "precision": round(precision, 4),
        "recall": round(recall, 4),
        "f1": round(f1, 4),
        "tp": int(tp), "fp": int(fp), "fn": int(fn)
    }


y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

metrics = precision_recall_f1(y_true, y_pred)
print(metrics)
# {'precision': 0.8, 'recall': 0.8, 'f1': 0.8, 'tp': 4, 'fp': 1, 'fn': 1}

When precision matters more: Email spam detection. Sending a legitimate email to the spam folder (false positive) is worse than letting one piece of spam through. You pay for false positives with user trust.

When recall matters more: Cancer screening or fraud detection. Missing a true positive (a tumor, a fraudulent transaction) has a much higher cost than an incorrect flag. You pay for false negatives with outcomes that cannot be undone.

Key Insight: The threshold that maximizes F1 is not always 0.5. In highly imbalanced datasets, the optimal threshold is often much lower. Bring this up unprompted and it signals senior-level thinking.

Common Mistake: Confusing precision with positive predictive value and recall with sensitivity. They are the same thing, but interviewers who come from a clinical background will use the clinical terms and expect you to recognize them.

ROC-AUC: What the Number Actually Means

The question: "What does an AUC of 0.5 mean? What about 1.0? And what does AUC actually measure, and how would you explain it to a product manager?"

The AUC (Area Under the ROC Curve) answers one question: if you randomly sample one positive and one negative example, what is the probability that your model assigns a higher score to the positive example? An AUC of 0.5 means the model performs no better than random guessing. An AUC of 1.0 means perfect rank ordering.

python

import numpy as np

def roc_auc_from_scratch(y_true, y_scores):
    """
    Computes AUC using the trapezoidal rule on the ROC curve.
    Equivalent to the probability that a random positive scores
    higher than a random negative.
    """
    y_true = np.array(y_true)
    y_scores = np.array(y_scores)

    # Sort by descending score
    desc_idx = np.argsort(-y_scores)
    y_true_sorted = y_true[desc_idx]

    n_pos = np.sum(y_true == 1)
    n_neg = np.sum(y_true == 0)

    tpr_list, fpr_list = [0.0], [0.0]
    tp, fp = 0, 0

    for label in y_true_sorted:
        if label == 1:
            tp += 1
        else:
            fp += 1
        tpr_list.append(tp / n_pos)
        fpr_list.append(fp / n_neg)

    # Trapezoidal rule
    auc = np.trapezoid(tpr_list, fpr_list)
    return round(abs(auc), 4)


y_true  = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_scores = [0.9, 0.3, 0.8, 0.7, 0.4, 0.85, 0.2, 0.35, 0.75, 0.45]

print(f"AUC: {roc_auc_from_scratch(y_true, y_scores)}")  # AUC: 1.0

# Verify against sklearn
from sklearn.metrics import roc_auc_score
print(f"sklearn AUC: {roc_auc_score(y_true, y_scores):.4f}")  # sklearn AUC: 1.0000

Key Insight: AUC measures rank discrimination, not calibration. A model with AUC 0.95 can still assign wildly wrong probabilities (e.g., the positive examples score 0.6 and negatives score 0.4, which is correct ordering but miscalibrated). If your downstream system cares about actual probability values (pricing, expected value calculations), also check calibration with a reliability diagram.

From the Interviewer's Perspective: "We want to hear 'AUC is a threshold-free metric.' If a candidate jumps straight to 'AUC is the area under the curve' without explaining what that area represents probabilistically, they probably memorized the definition."

Validation, Overfitting, and the Subtle Mistakes Candidates Make

K-Fold Cross-Validation Without Sklearn

The question: "Implement k-fold cross-validation from scratch. Walk me through why we use it instead of a single train-test split."

python

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def kfold_cross_validate(X, y, model, k=5, shuffle=True, random_state=42):
    """
    Manual k-fold cross-validation.
    Returns per-fold scores and their mean/std.
    """
    n = len(X)
    indices = np.arange(n)

    if shuffle:
        rng = np.random.default_rng(random_state)
        rng.shuffle(indices)

    fold_size = n // k
    scores = []

    for fold in range(k):
        # Build validation indices for this fold
        val_start = fold * fold_size
        val_end   = val_start + fold_size if fold < k - 1 else n
        val_idx   = indices[val_start:val_end]
        train_idx = np.concatenate([indices[:val_start], indices[val_end:]])

        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model.fit(X_train, y_train)
        preds = model.predict(X_val)
        scores.append(accuracy_score(y_val, preds))

    return {
        "scores": [round(s, 4) for s in scores],
        "mean":   round(np.mean(scores), 4),
        "std":    round(np.std(scores), 4)
    }


# Usage
np.random.seed(42)
X = np.random.randn(300, 5)
y = (X[:, 0] + X[:, 2] > 0).astype(int)

clf = LogisticRegression(max_iter=200)
results = kfold_cross_validate(X, y, clf, k=5)
print(results)
# {'scores': [0.9667, 0.9667, 0.95, 1.0, 0.9833], 'mean': 0.9733, 'std': 0.017}

Why k-fold over a single split? A single split gives one accuracy number whose variance depends entirely on which points ended up in the test set. With five folds, every data point is in the validation set exactly once. The mean accuracy is less biased and the standard deviation reveals whether the model is consistent or fragile.

Stratified k-fold: When your classes are imbalanced (say 90% negative), random folding can produce a fold with zero positive examples. Stratified k-fold preserves the class ratio in each fold. Always use it for classification with imbalance above 20:80.

Common Mistake: Fitting preprocessing (e.g., StandardScaler) on the full dataset before splitting into folds. This is data leakage. The scaler has seen the validation fold's statistics. Fit the scaler inside the loop, on the training fold only, and transform the validation fold with those same parameters.

Reading Learning Curves to Detect Overfitting

The question: "How would you diagnose overfitting vs. underfitting using learning curves? Show me in code."

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

def plot_learning_curves(estimator, X, y, title="Learning Curves"):
    train_sizes = np.linspace(0.1, 1.0, 10)
    train_errors, val_errors = [], []

    np.random.seed(42)
    n = len(X)
    perm = np.random.permutation(n)
    X, y = X[perm], y[perm]
    split = int(0.8 * n)
    X_train_full, X_val = X[:split], X[split:]
    y_train_full, y_val = y[:split], y[split:]

    for frac in train_sizes:
        size = max(int(frac * len(X_train_full)), 2)
        X_tr, y_tr = X_train_full[:size], y_train_full[:size]

        estimator.fit(X_tr, y_tr)
        train_errors.append(
            mean_squared_error(y_tr, estimator.predict(X_tr))
        )
        val_errors.append(
            mean_squared_error(y_val, estimator.predict(X_val))
        )

    plt.figure(figsize=(8, 4))
    plt.plot(train_sizes, train_errors, label="Train MSE")
    plt.plot(train_sizes, val_errors,  label="Val MSE")
    plt.xlabel("Training set fraction")
    plt.ylabel("MSE")
    plt.title(title)
    plt.legend()
    plt.tight_layout()
    plt.savefig("learning_curve.png", dpi=80)

# Underfitting: linear model on nonlinear data
np.random.seed(42)
X = np.sort(np.random.uniform(0, 3, 200))
y = np.sin(2 * X) + np.random.randn(200) * 0.2

linear_model = LinearRegression()
poly_model   = make_pipeline(PolynomialFeatures(degree=6), LinearRegression())

plot_learning_curves(linear_model, X.reshape(-1, 1), y, "Linear (Underfitting)")
plot_learning_curves(poly_model,   X.reshape(-1, 1), y, "Poly Degree 6 (Good Fit)")

What the curves tell you:

Overfitting: Training error is low. Validation error is high. A large gap between the two lines that does not close as training size increases.
Underfitting: Both training and validation error are high and converge quickly. Adding more data will not help; the model needs more capacity.
Good fit: Training error slightly below validation error. The gap is small and both errors decrease as training size grows.

Feature Engineering Under Interview Pressure

These questions test whether you can make real decisions under constraints, not just recite textbook definitions.

Encoding Categorical Variables

The question: "When would you use one-hot encoding vs. target encoding vs. label encoding?"

Label encoding assigns integer values (0, 1, 2...) to categories. Use it only for ordinal features where the order is meaningful (e.g., "low / medium / high") or for tree-based models where the ordering does not affect splits. Never use it for nominal categories with linear models, because the integer ordering introduces a false relationship.

One-hot encoding creates a binary column per category. Use it for nominal features with low cardinality (under ~20 unique values). The main failure mode is high-cardinality columns (e.g., zip codes, user IDs) where one-hot produces thousands of sparse columns that destroy model performance and inflate memory usage.

Target encoding replaces each category with the mean of the target variable for that category. Use it for high-cardinality nominal features where one-hot is impractical. The catch is data leakage: if you compute target means on the full training set and then use them in cross-validation, you have leaked the target. Always compute target encodings inside each fold.

python

import numpy as np
import pandas as pd

def target_encode_safe(train_df, val_df, col, target, smoothing=10):
    """
    Smoothed target encoding to prevent overfitting on rare categories.
    Computed on train only, applied to val without leakage.
    """
    global_mean = train_df[target].mean()
    stats = train_df.groupby(col)[target].agg(['mean', 'count'])
    stats['smoothed'] = (
        (stats['mean'] * stats['count'] + global_mean * smoothing)
        / (stats['count'] + smoothing)
    )
    train_encoded = train_df[col].map(stats['smoothed']).fillna(global_mean)
    val_encoded   = val_df[col].map(stats['smoothed']).fillna(global_mean)
    return train_encoded, val_encoded


# Example
df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago'],
    'churn': [1, 0, 1, 0, 1, 0, 0]
})
train, val = df.iloc[:5], df.iloc[5:]
enc_train, enc_val = target_encode_safe(train, val, 'city', 'churn')
print(enc_train.round(3).tolist())  # [0.667, 0.583, 0.667, 0.545, 0.583]

Feature Scaling: Which Models Need It and Why

The question: "When does feature scaling matter? Give examples of algorithms that break without it and algorithms that do not care."

Scaling matters for:

Linear and logistic regression: gradient descent converges much faster when features are on the same scale. A feature ranging 0-1,000,000 will dominate the gradient update relative to a feature ranging 0-1.
Support vector machines: the kernel function (especially RBF) computes distances in feature space. Unscaled features make distance meaningless.
K-nearest neighbors and k-means: both rely on Euclidean distance directly.
Neural networks: unscaled inputs cause saturated activations early in training.

Scaling does not matter for:

Decision trees, random forests, gradient boosting: splits are based on rank order, not magnitude. A feature with values 1-1,000,000 and a feature with values 0-1 produce the same splits.

StandardScaler vs. MinMaxScaler:

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

X = np.array([[100, 0.01], [200, 0.02], [150, 0.015]])

ss = StandardScaler()
mm = MinMaxScaler()

print("StandardScaler:\n", ss.fit_transform(X).round(3))
# [[-1.225 -1.225]
#  [ 1.225  1.225]
#  [ 0.     0.   ]]

print("MinMaxScaler:\n", mm.fit_transform(X).round(3))
# [[0.  0. ]
#  [1.  1. ]
#  [0.5 0.5]]

Use StandardScaler (zero mean, unit variance) as the default. Use MinMaxScaler when the algorithm requires inputs bounded to [0, 1] (e.g., some neural network architectures, certain kernel SVMs) or when the feature distribution is not Gaussian. MinMaxScaler is sensitive to outliers, and one extreme value compresses all other values.

Handling Class Imbalance

The question: "Your dataset has a 97:3 class ratio. Walk me through your strategy."

This is a decision tree problem, not a one-answer problem. The right answer depends on the dataset size, the cost structure, and the model type.

Step 1: Adjust the threshold first. Before touching the data, lower the classification threshold (from 0.5 to 0.2 or 0.1) and see how precision and recall change. This costs nothing and often closes most of the gap in detection rate.

Step 2: Use class weights if your model supports it. Setting class_weight='balanced' in sklearn models (logistic regression, SVM, random forest, gradient boosting) tells the model to penalize misclassifying minority examples more heavily. Zero extra preprocessing. Works well when the imbalance is moderate (up to ~30:1).

Step 3: SMOTE for severe imbalance. SMOTE generates synthetic minority examples by interpolating between existing minority points in feature space. Use it when class_weight is insufficient (imbalance above 50:1) and when features are continuous. SMOTE does not work well with categorical features.

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
import numpy as np

np.random.seed(42)
X_maj = np.random.randn(970, 2)
X_min = np.random.randn(30, 2) + [2, 2]
X = np.vstack([X_maj, X_min])
y = np.array([0]*970 + [1]*30)

# Without any balancing
clf_base = LogisticRegression(max_iter=500)
clf_base.fit(X, y)
print("No balancing:")
print(classification_report(y, clf_base.predict(X), digits=3))

# With class_weight='balanced'
clf_cw = LogisticRegression(class_weight='balanced', max_iter=500)
clf_cw.fit(X, y)
print("class_weight='balanced':")
print(classification_report(y, clf_cw.predict(X), digits=3))

# With SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
clf_smote = LogisticRegression(max_iter=500)
clf_smote.fit(X_res, y_res)

Key Insight: Accuracy is a misleading metric on imbalanced data. A model that predicts the majority class for every example achieves 97% accuracy on a 97:3 split. Always report precision, recall, and F1 on the minority class, or use Matthews Correlation Coefficient (MCC) for a single-number summary that accounts for all four cells of the confusion matrix.

Gradient Boosting Intuition Without Memorizing Formulas

The 60-Second Plain English Explanation

The question: "Explain gradient boosting to a non-technical interviewer in one minute."

A strong answer sounds like this:

"Imagine you're trying to predict house prices. You start with the simplest possible model, just the average sale price. That model will be wrong for almost every house. For each house, you calculate how wrong it was: that error is called a residual. Now you train a small decision tree to predict those residuals (not the original prices, but the errors). You add that tree's corrections to your first model, and now you're a little more accurate. You calculate residuals again, train another small tree to predict those new residuals, and add it on. You repeat this process a few hundred times. Each tree is individually weak, but the ensemble of hundreds of trees, each correcting the previous one's mistakes, becomes quite strong. That is gradient boosting."

The word "gradient" comes from the fact that residuals are technically the negative gradient of the squared error loss function, and minimizing residuals is mathematically equivalent to gradient descent in function space. You do not need to say this in the 60-second version, but if the interviewer asks "why gradient?", that is your answer.

XGBoost vs. LightGBM vs. CatBoost

The question: "What is the actual difference between XGBoost, LightGBM, and CatBoost? When would you choose each?"

These are not marketing differences. They differ in tree construction strategy, categorical handling, and default behavior.

Dimension	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise (breadth-first)	Leaf-wise (best-first)	Symmetric (oblivious trees)
Speed on large data	Moderate	Fastest	Moderate
Categorical features	Manual encoding required	Has native support (limited)	Best native support
Out-of-box performance	Good with tuning	Good with tuning	Excellent with defaults
Sensitivity to outliers	Moderate	Higher (leaf-wise can overfit)	Lower
Key hyperparameter	`max_depth`	`num_leaves`	`depth`

Choosing in practice:

Use LightGBM when your dataset has more than 100,000 rows and training speed matters. Its leaf-wise growth produces deeper, more complex trees faster than level-wise approaches.

Use CatBoost when you have many categorical features and do not want to spend time on preprocessing. Its ordered boosting method handles categoricals natively and with less overfitting than naive target encoding.

Use XGBoost when you are working in an environment where the other two are not available, or when you need the most battle-tested implementation with the most Stack Overflow answers behind it.

The one hyperparameter that matters most for each:

XGBoost: max_depth (default 6). Reduces overfitting. Drop to 3-4 before tuning learning rate.
LightGBM: num_leaves (default 31). This controls tree complexity more directly than max_depth for leaf-wise trees. Keep it below 2^max_depth.
CatBoost: iterations (number of trees) paired with learning_rate. Lower the learning rate and raise iterations. CatBoost tolerates this well due to its built-in regularization.

python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=5000, n_features=20,
                            n_informative=10, random_state=42)

# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier(max_depth=4, n_estimators=200,
                    learning_rate=0.05, eval_metric='logloss',
                    random_state=42, verbosity=0)
xgb_scores = cross_val_score(xgb, X, y, cv=3, scoring='roc_auc')
print(f"XGBoost AUC: {xgb_scores.mean():.4f}")  # XGBoost AUC: 0.9862

# LightGBM
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(num_leaves=31, n_estimators=200,
                     learning_rate=0.05, random_state=42, verbose=-1)
lgb_scores = cross_val_score(lgb, X, y, cv=3, scoring='roc_auc')
print(f"LightGBM AUC: {lgb_scores.mean():.4f}")  # LightGBM AUC: 0.9897

AI and LLM Questions Now Standard at Senior Levels

As of early 2026, interviews for senior data scientist roles at large technology companies routinely include questions about LLM architecture, evaluation, and system design. These are not optional extras; they appear in over 60% of senior-level technical screens, according to patterns reported across InterviewQuery, Blind, and r/MachineLearning communities.

Fine-Tuning vs. Prompting vs. RAG

The question: "Your team wants to improve an LLM's performance on your company's internal support tickets. Walk me through the decision between fine-tuning, prompt engineering, and RAG."

This is a system design question disguised as an ML question. There is no single right answer; the right answer depends on the problem constraints.

Use prompt engineering when:

The task is well-defined and can be explained in a few examples (few-shot prompting)
Your knowledge base is relatively small and fits in a context window
Speed and cost matter (no training infrastructure required)
You need to iterate quickly

Use RAG when:

Your knowledge base is large, frequently updated, or proprietary
The model needs accurate, specific facts (product documentation, support ticket history)
You cannot fine-tune due to cost, timeline, or model access restrictions
You want to cite sources for your answers (each retrieved chunk is traceable)

Use fine-tuning when:

You need the model to follow a specific style, format, or tone consistently
The task domain is highly specialized and prompt engineering has hit a ceiling
You have hundreds or thousands of labeled examples
Latency matters and you cannot afford long RAG retrieval pipelines in production

code

Interview tip: When asked this question, explicitly mention the cost of catastrophic
forgetting in fine-tuning — the model can lose general capability when trained on a
narrow domain. Mentioning LoRA (Low-Rank Adaptation) as a parameter-efficient
alternative shows awareness of production realities.

What Causes LLM Hallucinations

The question: "What causes hallucinations in LLMs and what strategies reduce them?"

Hallucinations occur because LLMs are trained to produce fluent, probable text, not to produce verified, factual text. The model learns statistical associations between tokens, not ground truth. When asked about something outside its training distribution, it generates a plausible-sounding completion rather than returning uncertainty.

Three main causes:

Knowledge gaps: The training corpus does not contain the answer, so the model interpolates.
Instruction following over factuality: RLHF (Reinforcement Learning from Human Feedback) rewards confident, helpful-sounding responses. This can train models to sound certain even when they are not.
Context window limitations: For long documents, the model may lose track of earlier context and generate inconsistent or invented claims.

Mitigation strategies:

RAG: Ground responses in retrieved documents. The model still needs to faithfully quote them, but errors become retrievable and correctable.
Chain-of-thought prompting: Asking the model to reason step by step reduces confident errors on multi-step reasoning tasks.
Temperature calibration: Lowering temperature reduces creative variation. Not a complete fix, but reduces the variance of hallucinated outputs.
Output verification: For high-stakes applications, route model outputs through a separate verification step (another model, a rule-based checker, or human review).

Embeddings and Cosine Similarity

The question: "Explain embeddings and cosine similarity to a technical interviewer who has not worked with NLP."

An embedding is a fixed-length numerical vector that encodes the semantic meaning of a piece of text (a word, a sentence, a document). Texts with similar meanings have similar vectors, and they cluster together in the high-dimensional embedding space.

Cosine similarity measures the angle between two vectors, not their length. Two documents can be short or long (different magnitudes) but still talk about the same thing (similar direction). Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning).

python

import numpy as np

def cosine_similarity(a, b):
    """Cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

# Mock embeddings (in production these come from a model like text-embedding-3-small)
doc_a = np.array([0.9, 0.1, 0.8, 0.2])  # "machine learning tutorial"
doc_b = np.array([0.85, 0.15, 0.75, 0.25])  # "ML guide for beginners"
doc_c = np.array([0.1, 0.9, 0.2, 0.8])   # "recipes for chocolate cake"

print(f"A vs B (similar): {cosine_similarity(doc_a, doc_b):.4f}")  # 0.9975
print(f"A vs C (different): {cosine_similarity(doc_a, doc_c):.4f}") # 0.3333

RAG systems use cosine similarity (or dot product for normalized vectors) to find which chunks in a knowledge base are most relevant to a user's query, then pass the top-k chunks to the LLM as context.

Evaluating LLM Output Quality

The question: "How would you evaluate whether a production LLM is performing well?"

A strong answer covers four layers:

1. Automated metrics: BLEU and ROUGE for text matching against reference answers. Useful for translation and summarization where reference outputs exist. Not useful for open-ended generation.

2. LLM-as-judge: Use a separate, stronger model (often GPT-4 class) to score outputs on dimensions like accuracy, helpfulness, and coherence. This scales better than human evaluation and correlates well with it. The main risk is positional bias, where the judge model may prefer outputs that appear first.

3. Human evaluation: A/B tests and preference studies. Ground truth for subjective quality dimensions. Slow and expensive, but necessary for establishing baselines.

4. Task-specific metrics: For RAG systems, evaluate retrieval precision (did we retrieve the right chunks?) separately from generation quality (did the model use them correctly?). For code generation, run the generated code in a sandbox and measure pass rate on unit tests.

From the Interviewer's Perspective: "We want candidates who understand that there is no single number. The interview answer that ends with 'and ultimately you need human evaluation to calibrate your automated metrics' demonstrates production experience."

Conclusion

The theme running through every section of this guide is the same: interviewers test whether you understand mechanisms, not just interfaces. You can use sklearn for production work, but you must be able to explain and implement the mathematics underneath it. This separates the candidate who learned ML by running notebooks from the candidate who can debug a diverging gradient descent at 2am.

The 2026 addition to this pattern is LLM literacy. At the senior level, questions about when to fine-tune vs. use RAG, what drives hallucinations, and how to evaluate generative models are now as standard as bias-variance tradeoff was five years ago. Preparing for these is not a bonus. It is a baseline requirement.

Three principles will carry you through any technical screen you have not seen before. First, always state assumptions explicitly before writing code. Second, handle edge cases before the interviewer asks (empty clusters in k-means, zero-division in precision when no positives are predicted). Third, when in doubt, write the simple version first and then describe how you would make it production-grade. Interviewers value clarity over cleverness.

For further reading on the underlying algorithms, the LDS guides on gradient boosting and logistic regression provide the mathematical derivations in full. The XGBoost vs. LightGBM vs. CatBoost comparison covers hyperparameter tuning with benchmarks.

The interview is not the endpoint. The implementations you practice for the screen are the same implementations that will make you useful once you are hired.

Career Q&A

How many interview rounds should I expect for a data science role at a large technology company in 2026?

Most large technology company processes run four to six rounds: a recruiter screen, a technical phone screen (SQL or Python), a take-home or live coding session, a machine learning concepts round, a case study or product sense round, and a behavioral round. Some organizations have consolidated these into three-hour virtual loops. Expect at least four total touchpoints before an offer.

Should I prioritize SQL or Python for interview prep?

Both, but SQL first if your time is limited. According to patterns reported on InterviewQuery and Blind, SQL appears in over 90% of data science technical screens; Python appears in roughly 80%. More importantly, SQL screens are often eliminatory, and a poor SQL performance may end the process before the ML round begins. Prepare SQL to competence, then prepare Python ML implementations.

Is it worth implementing algorithms from scratch if the interviewer said I can use sklearn?

Yes, for the knowledge, not the code. Even if the interviewer permits sklearn, they will ask follow-up questions that require implementation-level understanding. "What would happen if the learning rate were 10x larger?" or "Why did your logistic regression fail to converge?" cannot be answered from the sklearn docs alone.

How much time should I spend on LLM interview prep vs. classical ML?

For senior roles (L5 and above at large technology companies), spend roughly 30% of your preparation time on LLM concepts: fine-tuning, RAG, evaluation, hallucination mitigation, and embedding systems. For mid-level roles, this drops to 15-20%. Classical ML fundamentals remain the majority of questions at all levels.

What is the most common reason strong candidates fail ML technical screens?

Model evaluation questions. Most candidates prepare implementations but underestimate the depth of follow-up on metrics. "What is the AUC?" is easy. "Your model has AUC 0.92 but your product team is seeing too many false positives in production, so walk me through your debugging process" is what actually gets asked.

Should I memorize gradient boosting math formulas?

Not the formulas themselves, but the intuition behind them, yes. You should be able to explain what a residual is, why we train each tree on the previous tree's errors, what the learning rate controls, and why regularization via shallow trees prevents overfitting. An interviewer who asks you to write out the XGBoost objective function from memory is asking a memorization question, not an ML question, and that is unusual.

How do I handle a question I genuinely do not know the answer to?

State what you do know and reason toward the answer. "I am not certain of the exact formula, but I know that AUC is related to the probability that a positive example scores higher than a negative example, so let me try to derive the computation from that definition." Interviewers at well-run companies score partial credit and value structured reasoning over memorized answers.