87% of ML models never make it to production. Of those that do, most fail silently within six months — not from bad code, but from the gap between how data scientists build models and how engineers run software. MLOps closes that gap. The global MLOps market reached $3.4 billion in 2026 and is growing at 37% annually, which tells you everything about how much money is wasted when that gap stays open.
This article walks through the complete production lifecycle for an ML model at a fintech company deploying a loan default prediction system handling 50,000+ daily predictions. Every concept maps to a real decision you'll face.
MLOps Maturity Levels
MLOps maturity measures how much of your ML lifecycle is automated and reproducible. Google's MLOps whitepaper defines three levels, and knowing which level your team is at determines what to build next.
Click to expandMLOps maturity levels from manual to full CI/CD automation
Level 0: Manual process. Data scientists train models in notebooks, export weights manually, and hand them off to engineering. There's no pipeline, no versioning, no monitoring. The loan model gets retrained when someone notices accuracy dropping, which usually means a customer complaint first. Most teams start here.
Level 1: ML pipeline automation. Training is automated and reproducible. When new data arrives, the pipeline runs: data validation, feature engineering, training, evaluation. Models are registered in a versioned registry. The loan model retrains on a weekly schedule. Engineers can trigger retraining with a single command. Monitoring exists. Most production teams should be here.
Level 2: CI/CD for ML. Full automation. Code changes trigger pipeline tests. Model quality gates prevent bad models from deploying. Shadow deployments and champion/challenger testing are standard. Retraining can be triggered automatically by drift alerts. This is where financial institutions and large-scale consumer products operate.
Key Insight: Most teams waste time jumping straight to Level 2 infrastructure before they've solved Level 1 basics. Get your training pipeline automated and monitored before worrying about A/B testing frameworks.
The Production ML Pipeline
A production ML pipeline is not a single script — it's a sequence of validated stages where each step checks its own outputs before passing data downstream.
Click to expandFull production ML pipeline from data ingestion to monitoring loop
Data Validation and Schema Enforcement
Data validation is the unglamorous work that prevents 70% of production failures. Before any training begins, you need to verify that the incoming data matches what the model expects.
Great Expectations and Pandera are the two main tools here. Great Expectations defines an "expectation suite" — a declarative set of rules for what your data should look like. For the loan model, this includes checks like: credit_score must be between 300 and 850, loan_amount cannot be null, debt_to_income_ratio must be a positive float.
import great_expectations as gx
context = gx.get_context()
suite = context.add_expectation_suite("loan_features")
# Define expectations on the loan dataset
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
column="credit_score", min_value=300, max_value=850
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(
column="loan_amount"
))
# Run validation
results = context.run_checkpoint("loan_checkpoint")
if not results.success:
raise ValueError("Data validation failed — aborting pipeline")
Pandera is better for DataFrame-level schemas in Python-native workflows:
import pandera as pa
loan_schema = pa.DataFrameSchema({
"credit_score": pa.Column(int, pa.Check.between(300, 850)),
"loan_amount": pa.Column(float, pa.Check.greater_than(0)),
"debt_to_income_ratio": pa.Column(float, pa.Check.between(0, 1)),
"annual_income": pa.Column(float, pa.Check.greater_than(0)),
})
validated_df = loan_schema.validate(raw_df) # Raises SchemaError on failure
Common Pitfall: Skipping data validation because "the data is always clean." Production data sources break. Upstream schema changes, null values appear, distributions shift. Validation catches these before a corrupt model gets deployed.
Feature Engineering and Feature Stores
Feature engineering in production has a hidden trap called training-serving skew. This happens when the features used to train the model differ — even slightly — from the features computed at serving time. The loan model calculates payment_to_income_ratio during training using a batch SQL query. At serving time, the same ratio is computed inline with slightly different rounding. Six months later, the model is systematically miscalibrated and no one knows why.
Feast, the open-source feature store, solves this by becoming the single source of truth for feature computation. You define features once:
from feast import FeatureView, Entity, Field, FileSource
from feast.types import Float32, Int64
loan_applicant = Entity(name="applicant_id", join_keys=["applicant_id"])
applicant_features = FeatureView(
name="applicant_credit_features",
entities=[loan_applicant],
ttl=timedelta(days=30),
schema=[
Field(name="credit_score", dtype=Int64),
Field(name="payment_to_income_ratio", dtype=Float32),
Field(name="num_delinquencies_12m", dtype=Int64),
],
source=FileSource(path="data/applicant_features.parquet"),
)
Training retrieves features from the same store as serving. The transformation logic runs once, in one place. No skew.
The three main feature store options in 2026 have distinct trade-offs:
| Feature Store | Best For | Streaming Support | Pricing |
|---|---|---|---|
| Feast | Teams wanting open-source control | Community plugins | Free (infra costs only) |
| Tecton | Enterprise real-time ML (built by Uber's Michelangelo team) | First-class | Enterprise (paid) |
| Hopsworks | Regulated industries needing on-prem or sovereign cloud | Yes | Open-source + managed |
For teams not ready for a full feature store, the minimum viable solution is to extract all feature transformations into a shared Python library that both training and serving import. Same code, same results.
Pro Tip: The "hidden cost" of Feast is the engineering time to maintain it. If your team is small and time-to-production matters more than software costs, Tecton's managed service often pays for itself.
Model Training and Experiment Tracking
Experiment tracking records every training run: hyperparameters, metrics, artifacts, environment. Without it, you can't answer "what changed between the model we deployed in January and the one we deployed in March?"
MLflow is the standard open-source choice. For the loan model:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
mlflow.set_experiment("loan-default-prediction")
with mlflow.start_run(run_name="gbm-v3-tuned"):
model = GradientBoostingClassifier(
n_estimators=300,
max_depth=5,
learning_rate=0.05,
subsample=0.8
)
model.fit(X_train, y_train)
auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
mlflow.log_params(model.get_params())
mlflow.log_metric("val_auc", auc)
mlflow.sklearn.log_model(model, "model")
Weights & Biases is the stronger choice for teams doing deep learning or needing richer visualization, but MLflow handles tree-based models with zero overhead. For teams who want end-to-end pipeline orchestration built in — not just experiment tracking — ZenML integrates with both MLflow and W&B and handles the full CI/CD lifecycle for ML.
Model Evaluation and Validation Gates
A validation gate is a hard check that a model must pass before it can proceed to the next pipeline stage. Gates prevent bad models from reaching production automatically.
For the loan default model, reasonable gates include:
| Gate | Threshold | Rationale |
|---|---|---|
| Validation AUC | > 0.82 | Minimum acceptable discrimination |
| KS Statistic | > 0.30 | Credit risk regulatory requirement |
| Max false negative rate | < 15% | Approved bad loans cost more than rejected good ones |
| Performance vs. champion | > -0.5% AUC | New model must not regress vs. current production |
| Fairness: demographic parity | < 10% gap | Regulatory compliance |
Hardcode these gates in your CI pipeline. If a model doesn't pass, the run fails and no artifact is registered.
Model Registry and Versioning
The model registry is where trained models are stored, versioned, and staged for promotion. MLflow Registry provides three stages: Staging, Production, and Archived.
The promotion workflow matters more than the tool. A model moves through stages via code review and approval, not manual clicks in a UI. In the loan model pipeline, every Friday's retrain produces a Staging candidate. A human reviews the validation metrics. If they look good, a single mlflow.MlflowClient().transition_model_version_stage() call promotes it to Production. The previous version moves to Archived but stays retrievable for rollback.
Model Serving
Model serving is how your trained model becomes a real-time API. For the loan model handling 50,000 daily decisions, the serving layer needs to be fast, fault-tolerant, and observable.
REST API with FastAPI and Docker
FastAPI is the right default for ML model serving. It's fast (ASGI-based), automatically generates API docs, and validates request/response schemas with Pydantic. The pattern below is production-ready:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import mlflow.sklearn
import numpy as np
app = FastAPI(title="Loan Default Prediction API", version="1.0.0")
# Load model once at startup (not per-request)
model = mlflow.sklearn.load_model("models:/loan-default/Production")
class LoanApplication(BaseModel):
credit_score: int = Field(ge=300, le=850)
loan_amount: float = Field(gt=0)
debt_to_income_ratio: float = Field(ge=0, le=1)
annual_income: float = Field(gt=0)
num_delinquencies_12m: int = Field(ge=0)
class PredictionResponse(BaseModel):
default_probability: float
decision: str
model_version: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(application: LoanApplication):
features = np.array([[
application.credit_score,
application.loan_amount,
application.debt_to_income_ratio,
application.annual_income,
application.num_delinquencies_12m
]])
prob = model.predict_proba(features)[0][1]
decision = "reject" if prob > 0.35 else "approve"
return PredictionResponse(
default_probability=round(float(prob), 4),
decision=decision,
model_version="1.3.2"
)
@app.get("/health")
async def health():
return {"status": "healthy"}
Batch vs Real-Time Inference
The choice between batch and real-time inference is a product requirement, not a technical one.
| Criterion | Real-Time | Batch |
|---|---|---|
| Latency requirement | < 200ms | Hours to days |
| Use case | Interactive decisions | Nightly scoring runs |
| Infrastructure | Always-on API | Job scheduler (Airflow, Prefect) |
| Cost | Higher (always running) | Lower (pay per run) |
| Example | Loan approval at application | Monthly customer risk scoring |
For the loan model: new applications need real-time scoring (applicant waiting on the screen), but the bank's existing portfolio gets batch-scored nightly for risk monitoring.
Production Serving Frameworks
Beyond FastAPI, four frameworks handle the harder cases:
| Framework | Best For | Key Strength | When to Reach For It |
|---|---|---|---|
| BentoML | Packaging any ML model | Adaptive batching (v1.3+) | Early-stage teams wanting fast deployment |
| Ray Serve | Complex multi-model pipelines | Actor-based horizontal scaling | CPU preprocessing feeding GPU model |
| NVIDIA Triton | GPU-accelerated inference | Multi-framework, dynamic batching | Any DL model in production |
| KServe | Kubernetes-native serving | Standardized inference protocol, autoscaling | Teams already on K8s who need multi-model serving |
BentoML's adaptive batching, added in version 1.3 (late 2025), automatically groups concurrent requests for GPU inference — cutting per-request latency by 40 to 60% under load. KServe, formerly KFServing, has become the Kubernetes standard for multi-model serving with its V2 inference protocol and canary rollout support built in.
Pro Tip: For most teams with scikit-learn or XGBoost models, FastAPI + Docker + a load balancer handles 99% of cases. Reach for Ray Serve or KServe only when you have specific requirements they uniquely solve.
Monitoring and Observability
A model deployed without monitoring is a model you've abandoned. Monitoring answers two questions: is the model still seeing the same kind of data it was trained on, and is it still making good decisions?
Click to expandML monitoring taxonomy: data drift, concept drift, model drift, and infrastructure drift
Data Drift Detection
Data drift occurs when the distribution of production inputs diverges from the training distribution. For the loan model, this could mean a policy change that shifted the applicant pool, or an economic event that changed debt-to-income ratios across the board.
Two statistical tests are standard:
KS test (Kolmogorov-Smirnov): A nonparametric test that measures the maximum absolute difference between two empirical CDFs. Use it for continuous features where you care about distribution shape.
PSI (Population Stability Index): A credit industry standard that quantifies how much a distribution has shifted. Originally developed for credit scoring, it's now widely used across financial ML.
Where:
- is the proportion of production observations in bucket
- is the proportion of training observations in bucket
- is the number of buckets (typically 10 for continuous variables)
- is the natural logarithm
In Plain English: PSI is like comparing two bar charts of loan scores, bucket by bucket. If the bars have shifted significantly between training and production, the PSI will be high. A PSI of 0 means the distributions are identical. A PSI above 0.25 means the population has changed enough that the model's learned patterns may no longer apply.
PSI threshold interpretation — what practitioners actually use:
| PSI Value | Status | Recommended Action |
|---|---|---|
| < 0.10 | Stable | No action needed |
| 0.10 to 0.20 | Minor shift | Monitor more frequently |
| 0.20 to 0.25 | Moderate shift | Investigate features, plan retrain |
| > 0.25 | Major shift | Trigger retraining immediately |
Here's the implementation for the loan model's drift monitoring:
KS Statistic: 0.1770
P-value: 0.000000
Drift detected: True
PSI Score: 0.2324
Status: Major shift (model retraining needed)
The KS test flags distributional shift at high confidence (p-value effectively zero). The PSI of 0.23 falls in the "moderate to major" range — this simulated six-month drift scenario would trigger an automatic retraining job in a properly configured monitoring pipeline.
Prediction Drift vs Data Drift
These are not the same thing, and conflating them leads to wrong responses.
Data drift is an input-space problem: the features the model receives have changed. You can detect it without any ground truth labels. Run it daily.
Concept drift is a relationship problem: the statistical relationship between features and the target has changed. Low debt-to-income was a good default predictor in 2024; a macroeconomic shift in 2025 changes that relationship. Concept drift can happen even when input distributions look stable.
Model performance drift is an output-quality problem: you've collected delayed ground truth labels and the model's AUC has fallen. This is the most direct signal but requires waiting for outcomes (for a loan model, you may wait 12 months for default outcomes).
The practical approach: use data drift as an early warning system. Use performance metrics when labels arrive. Design your monitoring to trigger retraining proactively via data drift rather than reactively via performance decay.
Monitoring Tool Landscape in 2026
Three tools dominate the open-source monitoring space:
| Tool | Strengths | Best For | Pricing |
|---|---|---|---|
| Evidently AI | Data drift, target drift, easy dashboards | Startups, text-based models | Free up to 10k rows/month |
| Arize Phoenix | OpenTelemetry-native, 7,800+ GitHub stars, LLM traces | Teams running both classical ML and LLMs | Free (open-source self-hosted) |
| WhyLabs | Privacy-first, real-time guardrails, SOC 2 Type 2 | Regulated industries, GenAI safety | Free tier: 10M predictions/month |
Arize Phoenix (formerly just Arize) moved to the OpenTelemetry standard in 2025, making it the strongest choice for teams whose monitoring needs span both classical models and LLM applications. Evidently AI remains the fastest to set up for pure drift detection.
Alerting and Automated Retraining
For the loan model, a tiered alerting strategy works well:
| Signal | Threshold | Action |
|---|---|---|
| PSI on credit_score | > 0.10 | Slack alert to ML team |
| PSI on credit_score | > 0.25 | Auto-trigger retraining job |
| KS p-value | < 0.01 on 3+ features | Page on-call engineer |
| Model AUC (rolling 30d) | Drops > 2% from baseline | Immediate review |
| Prediction rate to "reject" | Shifts > 15% | Business alert |
Automated retraining on drift signals reduces mean time to recovery from days (waiting for someone to notice) to hours.
CI/CD for ML
CI/CD for machine learning extends traditional software CI/CD with data validation, model evaluation, and staged deployment logic. The key difference: ML pipelines can "pass all tests" and still produce a worse model.
GitHub Actions Workflow
A minimal but production-grade GitHub Actions workflow for the loan model:
name: ML Pipeline
on:
push:
branches: [main]
schedule:
- cron: '0 2 * * 0' # Weekly Sunday 2am retrain
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run data validation
run: python scripts/validate_data.py
train-and-evaluate:
needs: validate-data
runs-on: ubuntu-latest
steps:
- name: Train model
run: python scripts/train.py --experiment-name ${{ github.sha }}
- name: Run evaluation gates
run: python scripts/evaluate.py --min-auc 0.82 --max-fnr 0.15
- name: Register model if gates pass
run: python scripts/register_model.py --stage Staging
deploy-shadow:
needs: train-and-evaluate
runs-on: ubuntu-latest
steps:
- name: Deploy to shadow environment
run: kubectl apply -f k8s/shadow-deployment.yaml
- name: Run 24h shadow validation
run: python scripts/shadow_validate.py --hours 24
Model Testing
ML models need three kinds of tests that traditional software doesn't:
Unit tests — test individual transformation functions, not the model itself. Does calculate_debt_to_income() handle zero income correctly? Does the feature pipeline produce the right shape?
Integration tests — run the full pipeline on a small known dataset and check outputs. Train a model on 1000 rows, verify it produces predictions with the right schema and value ranges.
Model behavior tests — invariance tests and directional tests. A higher credit score should never increase the default probability. A loan amount 10x larger should increase default probability. These test that the model learned sensible relationships.
# Model behavior test for the loan model
def test_credit_score_monotonicity(model, feature_template):
"""Higher credit score should always lower default probability."""
low_score = feature_template.copy()
low_score["credit_score"] = 550
high_score = feature_template.copy()
high_score["credit_score"] = 780
prob_low = model.predict_proba([low_score])[0][1]
prob_high = model.predict_proba([high_score])[0][1]
assert prob_high < prob_low, (
f"Credit score monotonicity violated: "
f"score=780 gave higher default prob ({prob_high:.3f}) "
f"than score=550 ({prob_low:.3f})"
)
Shadow Mode and Champion/Challenger
Shadow deployment sends every production request to both the champion (live) model and the challenger (new) model, but only returns the champion's prediction to users. The challenger's outputs are logged silently. After 48 hours, you compare the distributions and, once delayed labels arrive, the accuracy metrics.
Champion/challenger testing is the formal A/B version: the challenger receives a small percentage of live traffic (typically 5 to 10%) and its predictions count. This is a production experiment, not just logging. For the loan model, a small fraction of applicants are actually scored by the new model. The business impact must be acceptable for the duration of the test.
Use shadow mode first. Promote to champion/challenger only after shadow data confirms the challenger behaves as expected.
LLMOps: When Your Model Is a Language Model
Classical MLOps covers scikit-learn models, gradient boosted trees, and neural networks with fixed output schemas. LLMOps is the additional layer you need when the model is a large language model — and the differences are significant enough to warrant dedicated tooling.
Click to expandMLOps vs LLMOps: key differences in training, evaluation, monitoring, and versioning
The fundamental difference comes down to output behavior. A gradient boosted tree is deterministic: the same input always produces the same prediction. An LLM is probabilistic: the same prompt can produce different outputs on repeated calls, even at temperature 0. This single fact cascades into every part of the operations stack.
What Changes in LLMOps
Prompt versioning replaces model versioning. In classical MLOps, you version model artifacts and hyperparameters. In LLMOps, a small prompt edit can break outputs without any code change, so prompts, system messages, RAG configurations, and guardrail settings all need version control and rollback capability. Your model registry now tracks entire system configurations, not just weights.
Evaluation requires LLM-as-judge. You can't evaluate LLM outputs with AUC or F1. For the loan model, correct is binary — approved or rejected. For an LLM answering customer questions about loan terms, "correct" is subjective. Evaluation requires LLM-as-judge pipelines, human review samples, and metrics like faithfulness, relevance, and hallucination rate. Tools like Arize Phoenix and Braintrust handle this natively.
Monitoring watches for hallucination, not drift. Instead of PSI and KS tests, LLMOps monitoring tracks hallucination rate (measured by a judge model), toxicity scores, PII leakage, and cost per token. Latency is also a first-class metric — a GPT-4o call at 500ms is very different from a local Llama-3 call at 120ms.
Safety is a deployment requirement. With the EU AI Act in force as of 2026, high-risk AI systems require documented guardrails. Toxicity filters, PII redaction, and policy checks aren't optional for production LLMs — they're legal requirements in regulated jurisdictions. WhyLabs and Arize both provide guardrail monitoring as first-class features.
Key Insight: Gartner projects that over 50% of enterprise generative AI deployments will fail by 2026, with hallucinated outputs and poor grounding as the primary causes. The teams that succeed treat LLM outputs as infrastructure that needs the same observability as any other production system.
LLMOps Tooling in 2026
| Concern | Classical MLOps Tool | LLMOps Tool |
|---|---|---|
| Experiment tracking | MLflow, W&B | LangSmith, Braintrust, W&B |
| Evaluation | Custom pytest | LLM-as-judge (Arize, Braintrust) |
| Monitoring | Evidently AI, WhyLabs | Arize Phoenix, WhyLabs |
| Serving | FastAPI, Triton | vLLM, TGI, LiteLLM |
| Guardrails | Not applicable | NeMo Guardrails, Guardrails AI |
The convergence trend worth watching: platforms like ZenML and Arize Phoenix now handle both classical ML and LLM workflows in the same stack, which matters for teams running hybrid systems (a tree-based fraud model alongside an LLM for customer communication).
When Managed MLOps Beats Self-Built
Vertex AI (Google Cloud) and SageMaker (AWS) offer fully managed MLOps pipelines. The trade-off is clear:
| Criterion | Managed (Vertex/SageMaker) | Self-Built |
|---|---|---|
| Time to first deployment | Days | Weeks to months |
| Infrastructure maintenance | None | Full responsibility |
| Cost at scale | High (vendor markup) | Lower |
| Flexibility | Limited to platform APIs | Complete control |
| Compliance/audit | Built-in | Manual |
| Best for | Startups, regulated industries | Platform teams, 10+ models |
For the fintech loan model at a startup: start with Vertex AI or SageMaker. Get to production, prove the business value, collect real monitoring data. Then, when you understand your actual requirements, decide whether to migrate to self-built infrastructure. Building Kubernetes pipelines before you have a validated model in production is engineering theater.
Common MLOps Mistakes
No monitoring post-deployment. A model goes live and the ML team moves on to the next project. Six months later, the model is silently predicting on a shifted population. The business notices via complaints. This is the most common MLOps failure mode, and research confirms 73% of AI production failures are linked to unforeseen shifts in input data relevance.
Manual model promotion. Someone SSHes into the production server and copies a model file. No version tracking, no rollback path, no audit trail. Any team doing this will eventually have an incident where they can't identify what model is running or how to revert.
Training on production data without isolation. The model retrains on data that includes its own predictions as ground truth. For the loan model, applicants who were rejected (because the model predicted high default risk) never appear in the training data, creating a feedback loop that amplifies the model's existing biases. This is survivorship bias baked into the training set.
Ignoring the tail. Aggregate metrics look fine; the model's average AUC is 0.84. But the 2% of applicants with unusual income patterns get systematically misclassified. Monitoring should include subgroup analysis, not just aggregate statistics.
Treating LLMOps as identical to MLOps. Teams that bolt an LLM onto their classical MLOps stack and skip prompt versioning, LLM-specific evaluation, and guardrail monitoring are flying blind. The failure modes are completely different.
Conclusion
MLOps is ultimately about one thing: making ML systems reliable enough that you trust them with real decisions. The loan model that drifts undetected is worse than no model at all — it creates an illusion of data-driven decisions while actually producing arbitrary ones.
Start with the basics: automated training, validation gates, a model registry, and drift monitoring. These four things resolve 80% of production ML failures before they become customer-facing problems. Level 2 automation — full CI/CD, shadow deployments, champion/challenger testing — is built on this foundation, not a replacement for it.
If you're deploying LLMs alongside classical models, treat LLMOps as an extension, not an afterthought. Prompt versioning, LLM-specific evaluation, and guardrails need to be first-class citizens in your stack from day one.
For experiment tracking in your pipeline, MLflow is the most direct path from notebook to registered artifact. For teams doing deep learning, Weights & Biases adds richer visualization and collaboration features that MLflow doesn't match. If your system uses RAG or other LLM-based retrieval, the monitoring requirements shift significantly from what this article covers for classical models.
The 87% of models that never reach production fail for organizational and process reasons, not technical ones. MLOps is the process answer.
Interview Questions
What is training-serving skew and why is it hard to debug?
Training-serving skew occurs when features computed during training are calculated differently than features computed at serving time, causing the model to predict on data that doesn't match its training distribution. It's hard to debug because both systems may appear to work correctly in isolation — the skew only surfaces as subtle model miscalibration over time. A feature store solves this by centralizing transformation logic so both environments run the same code.
Explain the difference between data drift, concept drift, and model drift.
Data drift is a change in the statistical distribution of input features — the model receives different kinds of data than it was trained on. Concept drift is a change in the underlying relationship between inputs and the target variable, which can happen even when inputs look stable. Model drift (or performance drift) is the observed degradation in prediction quality, typically measured when delayed ground truth labels become available. Data drift is the earliest detectable signal; concept drift is often invisible until performance metrics fall.
What is PSI and when would you trigger a model retrain based on it?
Population Stability Index (PSI) measures how much a feature's distribution has shifted between two time periods, typically training and production. PSI below 0.10 indicates stable distributions; between 0.10 and 0.20 indicates a minor shift worth monitoring; between 0.20 and 0.25 signals a moderate shift warranting investigation; above 0.25 indicates significant shift that typically warrants immediate model retraining. In credit risk applications, PSI > 0.25 is the conventional hard threshold for triggering retraining.
How does shadow deployment differ from A/B testing for ML models?
In shadow deployment, the new model receives all production requests and logs predictions, but its outputs are never returned to users — only the current production model's predictions count. There is zero business risk. In A/B testing (champion/challenger), the new model's predictions are returned to a fraction of real users, so the test has measurable business impact. Shadow deployment is used for initial validation; champion/challenger is used once shadow data confirms the challenger is safe to expose to live traffic.
Your loan model has 0.85 AUC on the validation set but performs poorly in production. What do you investigate?
First, check for training-serving skew — validate that the features at serving time match the training distribution exactly. Second, examine data drift: compare the production input distribution to the training distribution using PSI or KS tests. Third, look for target leakage in the training data — features that inadvertently encode information about the label. Fourth, check whether the validation set was representative of production traffic, or whether there was temporal leakage (future data in the training split). Finally, verify that the decision threshold (0.35 in our example) is calibrated to the actual production class distribution.
What are model behavior tests and why are they important?
Model behavior tests check that a model produces predictions consistent with known real-world logic, regardless of the specific numbers. For a loan default model: higher credit score should lower default probability (directional test), identical applicants should get identical scores (invariance test), and extreme feature values should produce reasonable outputs. These tests catch cases where a model has technically high accuracy but has learned spurious correlations that will fail on edge cases. Standard metrics like AUC don't detect these problems.
How does LLMOps differ from classical MLOps?
Classical MLOps tracks model artifacts, hyperparameters, and distribution drift using statistical tests like PSI and KS. LLMOps adds prompt versioning, RAG configuration management, LLM-as-judge evaluation, and guardrail monitoring for hallucinations and toxicity. The core difference is output behavior: classical models are deterministic (same input, same output), while LLMs are probabilistic — which means the entire quality measurement and monitoring stack must be different. Teams can't simply reuse their MLflow setup and call it LLMOps.
How would you design an automated retraining trigger system?
The system monitors multiple signals: daily PSI scores on key features, rolling 30-day model performance metrics as labels become available, and prediction distribution changes. Each signal has a tiered response: soft alerts at early thresholds, automatic retraining job triggers at moderate thresholds, and paging on-call engineers at critical thresholds. The retraining job itself is automated but model promotion still requires passing validation gates — the system can train automatically, but should not deploy automatically without quality checks. In practice, I'd connect this to the same CI/CD pipeline used for code changes.