Skip to content

Production MLOps: Deploying and Monitoring ML Models at Scale

DS
LDS Team
Let's Data Science
19 minAudio
Listen Along
0:00/ 0:00
AI voice

87% of ML models never make it to production. Of those that do, most fail silently within six months — not from bad code, but from the gap between how data scientists build models and how engineers run software. MLOps closes that gap. The global MLOps market reached $3.4 billion in 2026 and is growing at 37% annually, which tells you everything about how much money is wasted when that gap stays open.

This article walks through the complete production lifecycle for an ML model at a fintech company deploying a loan default prediction system handling 50,000+ daily predictions. Every concept maps to a real decision you'll face.

MLOps Maturity Levels

MLOps maturity measures how much of your ML lifecycle is automated and reproducible. Google's MLOps whitepaper defines three levels, and knowing which level your team is at determines what to build next.

MLOps maturity levels from manual to full CI/CD automationClick to expandMLOps maturity levels from manual to full CI/CD automation

Level 0: Manual process. Data scientists train models in notebooks, export weights manually, and hand them off to engineering. There's no pipeline, no versioning, no monitoring. The loan model gets retrained when someone notices accuracy dropping, which usually means a customer complaint first. Most teams start here.

Level 1: ML pipeline automation. Training is automated and reproducible. When new data arrives, the pipeline runs: data validation, feature engineering, training, evaluation. Models are registered in a versioned registry. The loan model retrains on a weekly schedule. Engineers can trigger retraining with a single command. Monitoring exists. Most production teams should be here.

Level 2: CI/CD for ML. Full automation. Code changes trigger pipeline tests. Model quality gates prevent bad models from deploying. Shadow deployments and champion/challenger testing are standard. Retraining can be triggered automatically by drift alerts. This is where financial institutions and large-scale consumer products operate.

Key Insight: Most teams waste time jumping straight to Level 2 infrastructure before they've solved Level 1 basics. Get your training pipeline automated and monitored before worrying about A/B testing frameworks.

The Production ML Pipeline

A production ML pipeline is not a single script — it's a sequence of validated stages where each step checks its own outputs before passing data downstream.

Full production ML pipeline from data ingestion to monitoring loopClick to expandFull production ML pipeline from data ingestion to monitoring loop

Data Validation and Schema Enforcement

Data validation is the unglamorous work that prevents 70% of production failures. Before any training begins, you need to verify that the incoming data matches what the model expects.

Great Expectations and Pandera are the two main tools here. Great Expectations defines an "expectation suite" — a declarative set of rules for what your data should look like. For the loan model, this includes checks like: credit_score must be between 300 and 850, loan_amount cannot be null, debt_to_income_ratio must be a positive float.

python
import great_expectations as gx

context = gx.get_context()
suite = context.add_expectation_suite("loan_features")

# Define expectations on the loan dataset
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="credit_score", min_value=300, max_value=850
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(
    column="loan_amount"
))

# Run validation
results = context.run_checkpoint("loan_checkpoint")
if not results.success:
    raise ValueError("Data validation failed — aborting pipeline")

Pandera is better for DataFrame-level schemas in Python-native workflows:

python
import pandera as pa

loan_schema = pa.DataFrameSchema({
    "credit_score": pa.Column(int, pa.Check.between(300, 850)),
    "loan_amount": pa.Column(float, pa.Check.greater_than(0)),
    "debt_to_income_ratio": pa.Column(float, pa.Check.between(0, 1)),
    "annual_income": pa.Column(float, pa.Check.greater_than(0)),
})

validated_df = loan_schema.validate(raw_df)  # Raises SchemaError on failure

Common Pitfall: Skipping data validation because "the data is always clean." Production data sources break. Upstream schema changes, null values appear, distributions shift. Validation catches these before a corrupt model gets deployed.

Feature Engineering and Feature Stores

Feature engineering in production has a hidden trap called training-serving skew. This happens when the features used to train the model differ — even slightly — from the features computed at serving time. The loan model calculates payment_to_income_ratio during training using a batch SQL query. At serving time, the same ratio is computed inline with slightly different rounding. Six months later, the model is systematically miscalibrated and no one knows why.

Feast, the open-source feature store, solves this by becoming the single source of truth for feature computation. You define features once:

python
from feast import FeatureView, Entity, Field, FileSource
from feast.types import Float32, Int64

loan_applicant = Entity(name="applicant_id", join_keys=["applicant_id"])

applicant_features = FeatureView(
    name="applicant_credit_features",
    entities=[loan_applicant],
    ttl=timedelta(days=30),
    schema=[
        Field(name="credit_score", dtype=Int64),
        Field(name="payment_to_income_ratio", dtype=Float32),
        Field(name="num_delinquencies_12m", dtype=Int64),
    ],
    source=FileSource(path="data/applicant_features.parquet"),
)

Training retrieves features from the same store as serving. The transformation logic runs once, in one place. No skew.

The three main feature store options in 2026 have distinct trade-offs:

Feature StoreBest ForStreaming SupportPricing
FeastTeams wanting open-source controlCommunity pluginsFree (infra costs only)
TectonEnterprise real-time ML (built by Uber's Michelangelo team)First-classEnterprise (paid)
HopsworksRegulated industries needing on-prem or sovereign cloudYesOpen-source + managed

For teams not ready for a full feature store, the minimum viable solution is to extract all feature transformations into a shared Python library that both training and serving import. Same code, same results.

Pro Tip: The "hidden cost" of Feast is the engineering time to maintain it. If your team is small and time-to-production matters more than software costs, Tecton's managed service often pays for itself.

Model Training and Experiment Tracking

Experiment tracking records every training run: hyperparameters, metrics, artifacts, environment. Without it, you can't answer "what changed between the model we deployed in January and the one we deployed in March?"

MLflow is the standard open-source choice. For the loan model:

python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

mlflow.set_experiment("loan-default-prediction")

with mlflow.start_run(run_name="gbm-v3-tuned"):
    model = GradientBoostingClassifier(
        n_estimators=300,
        max_depth=5,
        learning_rate=0.05,
        subsample=0.8
    )
    model.fit(X_train, y_train)

    auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])

    mlflow.log_params(model.get_params())
    mlflow.log_metric("val_auc", auc)
    mlflow.sklearn.log_model(model, "model")

Weights & Biases is the stronger choice for teams doing deep learning or needing richer visualization, but MLflow handles tree-based models with zero overhead. For teams who want end-to-end pipeline orchestration built in — not just experiment tracking — ZenML integrates with both MLflow and W&B and handles the full CI/CD lifecycle for ML.

Model Evaluation and Validation Gates

A validation gate is a hard check that a model must pass before it can proceed to the next pipeline stage. Gates prevent bad models from reaching production automatically.

For the loan default model, reasonable gates include:

GateThresholdRationale
Validation AUC> 0.82Minimum acceptable discrimination
KS Statistic> 0.30Credit risk regulatory requirement
Max false negative rate< 15%Approved bad loans cost more than rejected good ones
Performance vs. champion> -0.5% AUCNew model must not regress vs. current production
Fairness: demographic parity< 10% gapRegulatory compliance

Hardcode these gates in your CI pipeline. If a model doesn't pass, the run fails and no artifact is registered.

Model Registry and Versioning

The model registry is where trained models are stored, versioned, and staged for promotion. MLflow Registry provides three stages: Staging, Production, and Archived.

The promotion workflow matters more than the tool. A model moves through stages via code review and approval, not manual clicks in a UI. In the loan model pipeline, every Friday's retrain produces a Staging candidate. A human reviews the validation metrics. If they look good, a single mlflow.MlflowClient().transition_model_version_stage() call promotes it to Production. The previous version moves to Archived but stays retrievable for rollback.

Model Serving

Model serving is how your trained model becomes a real-time API. For the loan model handling 50,000 daily decisions, the serving layer needs to be fast, fault-tolerant, and observable.

REST API with FastAPI and Docker

FastAPI is the right default for ML model serving. It's fast (ASGI-based), automatically generates API docs, and validates request/response schemas with Pydantic. The pattern below is production-ready:

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import mlflow.sklearn
import numpy as np

app = FastAPI(title="Loan Default Prediction API", version="1.0.0")

# Load model once at startup (not per-request)
model = mlflow.sklearn.load_model("models:/loan-default/Production")

class LoanApplication(BaseModel):
    credit_score: int = Field(ge=300, le=850)
    loan_amount: float = Field(gt=0)
    debt_to_income_ratio: float = Field(ge=0, le=1)
    annual_income: float = Field(gt=0)
    num_delinquencies_12m: int = Field(ge=0)

class PredictionResponse(BaseModel):
    default_probability: float
    decision: str
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(application: LoanApplication):
    features = np.array([[
        application.credit_score,
        application.loan_amount,
        application.debt_to_income_ratio,
        application.annual_income,
        application.num_delinquencies_12m
    ]])

    prob = model.predict_proba(features)[0][1]
    decision = "reject" if prob > 0.35 else "approve"

    return PredictionResponse(
        default_probability=round(float(prob), 4),
        decision=decision,
        model_version="1.3.2"
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}

Batch vs Real-Time Inference

The choice between batch and real-time inference is a product requirement, not a technical one.

CriterionReal-TimeBatch
Latency requirement< 200msHours to days
Use caseInteractive decisionsNightly scoring runs
InfrastructureAlways-on APIJob scheduler (Airflow, Prefect)
CostHigher (always running)Lower (pay per run)
ExampleLoan approval at applicationMonthly customer risk scoring

For the loan model: new applications need real-time scoring (applicant waiting on the screen), but the bank's existing portfolio gets batch-scored nightly for risk monitoring.

Production Serving Frameworks

Beyond FastAPI, four frameworks handle the harder cases:

FrameworkBest ForKey StrengthWhen to Reach For It
BentoMLPackaging any ML modelAdaptive batching (v1.3+)Early-stage teams wanting fast deployment
Ray ServeComplex multi-model pipelinesActor-based horizontal scalingCPU preprocessing feeding GPU model
NVIDIA TritonGPU-accelerated inferenceMulti-framework, dynamic batchingAny DL model in production
KServeKubernetes-native servingStandardized inference protocol, autoscalingTeams already on K8s who need multi-model serving

BentoML's adaptive batching, added in version 1.3 (late 2025), automatically groups concurrent requests for GPU inference — cutting per-request latency by 40 to 60% under load. KServe, formerly KFServing, has become the Kubernetes standard for multi-model serving with its V2 inference protocol and canary rollout support built in.

Pro Tip: For most teams with scikit-learn or XGBoost models, FastAPI + Docker + a load balancer handles 99% of cases. Reach for Ray Serve or KServe only when you have specific requirements they uniquely solve.

Monitoring and Observability

A model deployed without monitoring is a model you've abandoned. Monitoring answers two questions: is the model still seeing the same kind of data it was trained on, and is it still making good decisions?

ML monitoring taxonomy: data drift, concept drift, model drift, and infrastructure driftClick to expandML monitoring taxonomy: data drift, concept drift, model drift, and infrastructure drift

Data Drift Detection

Data drift occurs when the distribution of production inputs diverges from the training distribution. For the loan model, this could mean a policy change that shifted the applicant pool, or an economic event that changed debt-to-income ratios across the board.

Two statistical tests are standard:

KS test (Kolmogorov-Smirnov): A nonparametric test that measures the maximum absolute difference between two empirical CDFs. Use it for continuous features where you care about distribution shape.

PSI (Population Stability Index): A credit industry standard that quantifies how much a distribution has shifted. Originally developed for credit scoring, it's now widely used across financial ML.

PSI=i=1n(ActualiExpectedi)×ln(ActualiExpectedi)\text{PSI} = \sum_{i=1}^{n} \left( \text{Actual}_i - \text{Expected}_i \right) \times \ln\left(\frac{\text{Actual}_i}{\text{Expected}_i}\right)

Where:

  • Actuali\text{Actual}_i is the proportion of production observations in bucket ii
  • Expectedi\text{Expected}_i is the proportion of training observations in bucket ii
  • nn is the number of buckets (typically 10 for continuous variables)
  • ln\ln is the natural logarithm

In Plain English: PSI is like comparing two bar charts of loan scores, bucket by bucket. If the bars have shifted significantly between training and production, the PSI will be high. A PSI of 0 means the distributions are identical. A PSI above 0.25 means the population has changed enough that the model's learned patterns may no longer apply.

PSI threshold interpretation — what practitioners actually use:

PSI ValueStatusRecommended Action
< 0.10StableNo action needed
0.10 to 0.20Minor shiftMonitor more frequently
0.20 to 0.25Moderate shiftInvestigate features, plan retrain
> 0.25Major shiftTrigger retraining immediately

Here's the implementation for the loan model's drift monitoring:

code
KS Statistic: 0.1770
P-value: 0.000000
Drift detected: True

PSI Score: 0.2324
Status: Major shift (model retraining needed)

The KS test flags distributional shift at high confidence (p-value effectively zero). The PSI of 0.23 falls in the "moderate to major" range — this simulated six-month drift scenario would trigger an automatic retraining job in a properly configured monitoring pipeline.

Prediction Drift vs Data Drift

These are not the same thing, and conflating them leads to wrong responses.

Data drift is an input-space problem: the features the model receives have changed. You can detect it without any ground truth labels. Run it daily.

Concept drift is a relationship problem: the statistical relationship between features and the target has changed. Low debt-to-income was a good default predictor in 2024; a macroeconomic shift in 2025 changes that relationship. Concept drift can happen even when input distributions look stable.

Model performance drift is an output-quality problem: you've collected delayed ground truth labels and the model's AUC has fallen. This is the most direct signal but requires waiting for outcomes (for a loan model, you may wait 12 months for default outcomes).

The practical approach: use data drift as an early warning system. Use performance metrics when labels arrive. Design your monitoring to trigger retraining proactively via data drift rather than reactively via performance decay.

Monitoring Tool Landscape in 2026

Three tools dominate the open-source monitoring space:

ToolStrengthsBest ForPricing
Evidently AIData drift, target drift, easy dashboardsStartups, text-based modelsFree up to 10k rows/month
Arize PhoenixOpenTelemetry-native, 7,800+ GitHub stars, LLM tracesTeams running both classical ML and LLMsFree (open-source self-hosted)
WhyLabsPrivacy-first, real-time guardrails, SOC 2 Type 2Regulated industries, GenAI safetyFree tier: 10M predictions/month

Arize Phoenix (formerly just Arize) moved to the OpenTelemetry standard in 2025, making it the strongest choice for teams whose monitoring needs span both classical models and LLM applications. Evidently AI remains the fastest to set up for pure drift detection.

Alerting and Automated Retraining

For the loan model, a tiered alerting strategy works well:

SignalThresholdAction
PSI on credit_score> 0.10Slack alert to ML team
PSI on credit_score> 0.25Auto-trigger retraining job
KS p-value< 0.01 on 3+ featuresPage on-call engineer
Model AUC (rolling 30d)Drops > 2% from baselineImmediate review
Prediction rate to "reject"Shifts > 15%Business alert

Automated retraining on drift signals reduces mean time to recovery from days (waiting for someone to notice) to hours.

CI/CD for ML

CI/CD for machine learning extends traditional software CI/CD with data validation, model evaluation, and staged deployment logic. The key difference: ML pipelines can "pass all tests" and still produce a worse model.

GitHub Actions Workflow

A minimal but production-grade GitHub Actions workflow for the loan model:

yaml
name: ML Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 0'  # Weekly Sunday 2am retrain

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run data validation
        run: python scripts/validate_data.py

  train-and-evaluate:
    needs: validate-data
    runs-on: ubuntu-latest
    steps:
      - name: Train model
        run: python scripts/train.py --experiment-name ${{ github.sha }}
      - name: Run evaluation gates
        run: python scripts/evaluate.py --min-auc 0.82 --max-fnr 0.15
      - name: Register model if gates pass
        run: python scripts/register_model.py --stage Staging

  deploy-shadow:
    needs: train-and-evaluate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to shadow environment
        run: kubectl apply -f k8s/shadow-deployment.yaml
      - name: Run 24h shadow validation
        run: python scripts/shadow_validate.py --hours 24

Model Testing

ML models need three kinds of tests that traditional software doesn't:

Unit tests — test individual transformation functions, not the model itself. Does calculate_debt_to_income() handle zero income correctly? Does the feature pipeline produce the right shape?

Integration tests — run the full pipeline on a small known dataset and check outputs. Train a model on 1000 rows, verify it produces predictions with the right schema and value ranges.

Model behavior tests — invariance tests and directional tests. A higher credit score should never increase the default probability. A loan amount 10x larger should increase default probability. These test that the model learned sensible relationships.

python
# Model behavior test for the loan model
def test_credit_score_monotonicity(model, feature_template):
    """Higher credit score should always lower default probability."""
    low_score = feature_template.copy()
    low_score["credit_score"] = 550

    high_score = feature_template.copy()
    high_score["credit_score"] = 780

    prob_low = model.predict_proba([low_score])[0][1]
    prob_high = model.predict_proba([high_score])[0][1]

    assert prob_high < prob_low, (
        f"Credit score monotonicity violated: "
        f"score=780 gave higher default prob ({prob_high:.3f}) "
        f"than score=550 ({prob_low:.3f})"
    )

Shadow Mode and Champion/Challenger

Shadow deployment sends every production request to both the champion (live) model and the challenger (new) model, but only returns the champion's prediction to users. The challenger's outputs are logged silently. After 48 hours, you compare the distributions and, once delayed labels arrive, the accuracy metrics.

Champion/challenger testing is the formal A/B version: the challenger receives a small percentage of live traffic (typically 5 to 10%) and its predictions count. This is a production experiment, not just logging. For the loan model, a small fraction of applicants are actually scored by the new model. The business impact must be acceptable for the duration of the test.

Use shadow mode first. Promote to champion/challenger only after shadow data confirms the challenger behaves as expected.

LLMOps: When Your Model Is a Language Model

Classical MLOps covers scikit-learn models, gradient boosted trees, and neural networks with fixed output schemas. LLMOps is the additional layer you need when the model is a large language model — and the differences are significant enough to warrant dedicated tooling.

MLOps vs LLMOps: key differences in training, evaluation, monitoring, and versioningClick to expandMLOps vs LLMOps: key differences in training, evaluation, monitoring, and versioning

The fundamental difference comes down to output behavior. A gradient boosted tree is deterministic: the same input always produces the same prediction. An LLM is probabilistic: the same prompt can produce different outputs on repeated calls, even at temperature 0. This single fact cascades into every part of the operations stack.

What Changes in LLMOps

Prompt versioning replaces model versioning. In classical MLOps, you version model artifacts and hyperparameters. In LLMOps, a small prompt edit can break outputs without any code change, so prompts, system messages, RAG configurations, and guardrail settings all need version control and rollback capability. Your model registry now tracks entire system configurations, not just weights.

Evaluation requires LLM-as-judge. You can't evaluate LLM outputs with AUC or F1. For the loan model, correct is binary — approved or rejected. For an LLM answering customer questions about loan terms, "correct" is subjective. Evaluation requires LLM-as-judge pipelines, human review samples, and metrics like faithfulness, relevance, and hallucination rate. Tools like Arize Phoenix and Braintrust handle this natively.

Monitoring watches for hallucination, not drift. Instead of PSI and KS tests, LLMOps monitoring tracks hallucination rate (measured by a judge model), toxicity scores, PII leakage, and cost per token. Latency is also a first-class metric — a GPT-4o call at 500ms is very different from a local Llama-3 call at 120ms.

Safety is a deployment requirement. With the EU AI Act in force as of 2026, high-risk AI systems require documented guardrails. Toxicity filters, PII redaction, and policy checks aren't optional for production LLMs — they're legal requirements in regulated jurisdictions. WhyLabs and Arize both provide guardrail monitoring as first-class features.

Key Insight: Gartner projects that over 50% of enterprise generative AI deployments will fail by 2026, with hallucinated outputs and poor grounding as the primary causes. The teams that succeed treat LLM outputs as infrastructure that needs the same observability as any other production system.

LLMOps Tooling in 2026

ConcernClassical MLOps ToolLLMOps Tool
Experiment trackingMLflow, W&BLangSmith, Braintrust, W&B
EvaluationCustom pytestLLM-as-judge (Arize, Braintrust)
MonitoringEvidently AI, WhyLabsArize Phoenix, WhyLabs
ServingFastAPI, TritonvLLM, TGI, LiteLLM
GuardrailsNot applicableNeMo Guardrails, Guardrails AI

The convergence trend worth watching: platforms like ZenML and Arize Phoenix now handle both classical ML and LLM workflows in the same stack, which matters for teams running hybrid systems (a tree-based fraud model alongside an LLM for customer communication).

When Managed MLOps Beats Self-Built

Vertex AI (Google Cloud) and SageMaker (AWS) offer fully managed MLOps pipelines. The trade-off is clear:

CriterionManaged (Vertex/SageMaker)Self-Built
Time to first deploymentDaysWeeks to months
Infrastructure maintenanceNoneFull responsibility
Cost at scaleHigh (vendor markup)Lower
FlexibilityLimited to platform APIsComplete control
Compliance/auditBuilt-inManual
Best forStartups, regulated industriesPlatform teams, 10+ models

For the fintech loan model at a startup: start with Vertex AI or SageMaker. Get to production, prove the business value, collect real monitoring data. Then, when you understand your actual requirements, decide whether to migrate to self-built infrastructure. Building Kubernetes pipelines before you have a validated model in production is engineering theater.

Common MLOps Mistakes

No monitoring post-deployment. A model goes live and the ML team moves on to the next project. Six months later, the model is silently predicting on a shifted population. The business notices via complaints. This is the most common MLOps failure mode, and research confirms 73% of AI production failures are linked to unforeseen shifts in input data relevance.

Manual model promotion. Someone SSHes into the production server and copies a model file. No version tracking, no rollback path, no audit trail. Any team doing this will eventually have an incident where they can't identify what model is running or how to revert.

Training on production data without isolation. The model retrains on data that includes its own predictions as ground truth. For the loan model, applicants who were rejected (because the model predicted high default risk) never appear in the training data, creating a feedback loop that amplifies the model's existing biases. This is survivorship bias baked into the training set.

Ignoring the tail. Aggregate metrics look fine; the model's average AUC is 0.84. But the 2% of applicants with unusual income patterns get systematically misclassified. Monitoring should include subgroup analysis, not just aggregate statistics.

Treating LLMOps as identical to MLOps. Teams that bolt an LLM onto their classical MLOps stack and skip prompt versioning, LLM-specific evaluation, and guardrail monitoring are flying blind. The failure modes are completely different.

Conclusion

MLOps is ultimately about one thing: making ML systems reliable enough that you trust them with real decisions. The loan model that drifts undetected is worse than no model at all — it creates an illusion of data-driven decisions while actually producing arbitrary ones.

Start with the basics: automated training, validation gates, a model registry, and drift monitoring. These four things resolve 80% of production ML failures before they become customer-facing problems. Level 2 automation — full CI/CD, shadow deployments, champion/challenger testing — is built on this foundation, not a replacement for it.

If you're deploying LLMs alongside classical models, treat LLMOps as an extension, not an afterthought. Prompt versioning, LLM-specific evaluation, and guardrails need to be first-class citizens in your stack from day one.

For experiment tracking in your pipeline, MLflow is the most direct path from notebook to registered artifact. For teams doing deep learning, Weights & Biases adds richer visualization and collaboration features that MLflow doesn't match. If your system uses RAG or other LLM-based retrieval, the monitoring requirements shift significantly from what this article covers for classical models.

The 87% of models that never reach production fail for organizational and process reasons, not technical ones. MLOps is the process answer.

Interview Questions

What is training-serving skew and why is it hard to debug?

Training-serving skew occurs when features computed during training are calculated differently than features computed at serving time, causing the model to predict on data that doesn't match its training distribution. It's hard to debug because both systems may appear to work correctly in isolation — the skew only surfaces as subtle model miscalibration over time. A feature store solves this by centralizing transformation logic so both environments run the same code.

Explain the difference between data drift, concept drift, and model drift.

Data drift is a change in the statistical distribution of input features — the model receives different kinds of data than it was trained on. Concept drift is a change in the underlying relationship between inputs and the target variable, which can happen even when inputs look stable. Model drift (or performance drift) is the observed degradation in prediction quality, typically measured when delayed ground truth labels become available. Data drift is the earliest detectable signal; concept drift is often invisible until performance metrics fall.

What is PSI and when would you trigger a model retrain based on it?

Population Stability Index (PSI) measures how much a feature's distribution has shifted between two time periods, typically training and production. PSI below 0.10 indicates stable distributions; between 0.10 and 0.20 indicates a minor shift worth monitoring; between 0.20 and 0.25 signals a moderate shift warranting investigation; above 0.25 indicates significant shift that typically warrants immediate model retraining. In credit risk applications, PSI > 0.25 is the conventional hard threshold for triggering retraining.

How does shadow deployment differ from A/B testing for ML models?

In shadow deployment, the new model receives all production requests and logs predictions, but its outputs are never returned to users — only the current production model's predictions count. There is zero business risk. In A/B testing (champion/challenger), the new model's predictions are returned to a fraction of real users, so the test has measurable business impact. Shadow deployment is used for initial validation; champion/challenger is used once shadow data confirms the challenger is safe to expose to live traffic.

Your loan model has 0.85 AUC on the validation set but performs poorly in production. What do you investigate?

First, check for training-serving skew — validate that the features at serving time match the training distribution exactly. Second, examine data drift: compare the production input distribution to the training distribution using PSI or KS tests. Third, look for target leakage in the training data — features that inadvertently encode information about the label. Fourth, check whether the validation set was representative of production traffic, or whether there was temporal leakage (future data in the training split). Finally, verify that the decision threshold (0.35 in our example) is calibrated to the actual production class distribution.

What are model behavior tests and why are they important?

Model behavior tests check that a model produces predictions consistent with known real-world logic, regardless of the specific numbers. For a loan default model: higher credit score should lower default probability (directional test), identical applicants should get identical scores (invariance test), and extreme feature values should produce reasonable outputs. These tests catch cases where a model has technically high accuracy but has learned spurious correlations that will fail on edge cases. Standard metrics like AUC don't detect these problems.

How does LLMOps differ from classical MLOps?

Classical MLOps tracks model artifacts, hyperparameters, and distribution drift using statistical tests like PSI and KS. LLMOps adds prompt versioning, RAG configuration management, LLM-as-judge evaluation, and guardrail monitoring for hallucinations and toxicity. The core difference is output behavior: classical models are deterministic (same input, same output), while LLMs are probabilistic — which means the entire quality measurement and monitoring stack must be different. Teams can't simply reuse their MLflow setup and call it LLMOps.

How would you design an automated retraining trigger system?

The system monitors multiple signals: daily PSI scores on key features, rolling 30-day model performance metrics as labels become available, and prediction distribution changes. Each signal has a tiered response: soft alerts at early thresholds, automatic retraining job triggers at moderate thresholds, and paging on-call engineers at critical thresholds. The retraining job itself is automated but model promotion still requires passing validation gates — the system can train automatically, but should not deploy automatically without quality checks. In practice, I'd connect this to the same CI/CD pipeline used for code changes.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths