Model Drift Lab
Watch a prediction model degrade in real time as customer data shifts beneath it. Diagnose which features drifted, retrain to recover, and learn when monitoring alerts should fire — the complete model lifecycle in one interactive lab.
OVERVIEW & LEARNING OBJECTIVES
A prediction model is only as good as the data it was trained on. When the real world changes — new customer segments arrive, competitors launch, or external shocks hit — the model's assumptions break and its performance degrades. This is model drift, and it's one of the most common (and costly) problems in production analytics.
This lab lets you see drift happen, diagnose its causes, and practise the intervention cycle that every analyst must master.
🎯 What You'll Learn
- Drift is invisible until it's costly: Performance degrades gradually — by the time someone notices, the damage is already done.
- PSI and KS statistics: Quantitative tools to measure how much each feature has shifted from its training distribution.
- The sawtooth pattern: Real production models follow a cycle of drift → retrain → drift → retrain. You'll build this pattern yourself.
- Alert thresholds: Setting monitoring triggers — too sensitive wastes resources, too lenient misses degradation.
- Data drift vs. performance drift: Feature distributions can shift before accuracy drops. Monitoring data drift gives you early warning.
💡 Why This Matters for Marketing: Churn models, lead scores, recommendation engines, pricing models — every ML system in marketing is vulnerable to drift. Companies like Netflix, Spotify, and Amazon retrain their models on schedules ranging from daily to quarterly. This lab teaches you why they do that and how to decide when.
📐 Key Metrics Explained
Population Stability Index (PSI)
🗣️ In plain English: PSI answers a simple question: "Does today's data still look like the data we trained on?"
Imagine you trained a churn model when average order value ranged from $50–$80 for most customers. PSI checks whether that's still true. If a wave of budget shoppers shifts the range to $30–$55, PSI goes up — even if the model's accuracy hasn't dropped yet.
It works by dividing a feature into bins (like a histogram) and comparing how many customers fall in each bin now versus during training. The more the bins mismatch, the higher the PSI.
Formally, PSI is computed as:
Where \(p_i^{\text{ref}}\) and \(p_i^{\text{mon}}\) are the proportions in bin \(i\) for the reference and monitoring distributions.
| PSI Range | Interpretation | What It Feels Like | Action |
|---|---|---|---|
| < 0.10 | No significant drift | Data still looks like training — histograms mostly overlap | Continue monitoring |
| 0.10 – 0.25 | Moderate drift | Something changed — one or more features have a noticeably different shape | Investigate which features shifted and why |
| > 0.25 | Significant drift | Today's customers look like a different population than what the model trained on | Retrain — the model is making decisions based on outdated patterns |
Kolmogorov-Smirnov (KS) Statistic
🗣️ In plain English: KS finds the single point where two distributions disagree the most. Think of it as: "What's the biggest gap between the cumulative 'training' curve and the cumulative 'today' curve?"
If KS is high, there's at least one region of the feature's range where today's data is strikingly different. For example, maybe the top quartile of spenders disappeared — KS would capture that even if the average barely moved.
Formally:
Values above 0.1 suggest meaningful distributional shift. Unlike PSI (which summarises overall divergence across all bins), KS is sensitive to the location of maximum divergence — useful for detecting concentrated shifts that PSI might average out.
F1 Score — Why Accuracy Alone Can Lie
🚨 The Accuracy Trap: Imagine your CTR model predicts "no click" for every single ad impression. If only 12% of impressions actually get clicked, that lazy model scores 88% accuracy — it looks great! But it found zero actual clicks. It's completely useless.
This is the accuracy paradox: with imbalanced outcomes (which is almost every marketing problem — few people churn, few ads get clicked, few leads convert), accuracy rewards a model for predicting the majority class and ignoring the thing you actually care about.
F1 Score solves this by combining two questions:
- Precision — Of the people the model flagged, how many actually did the thing? ("When it says 'click', is it right?")
- Recall — Of the people who actually did the thing, how many did the model catch? ("Did it find all the real clicks?")
F1 is the harmonic mean of precision and recall — it only scores high when both are high. A model that cheats by predicting all-negative gets F1 = 0 (recall is zero), even though accuracy is 88%.
🎯 Why F1 is the hero metric in this lab: As drift pushes a model's predictions away from reality, the first thing to collapse is its ability to find the minority class (clicks, churners, converters). F1 captures that collapse immediately. Accuracy might stay flat — or even improve — while F1 craters. When you see those two lines diverge on the chart below, that's drift doing its damage.
UNDERSTANDING DRIFT — WHY MODELS BREAK
Before exploring the scenarios, let's build intuition for what model drift is, why it happens, and — critically — how to see it coming before accuracy drops.
Data Drift (Covariate Shift)
The input features change distribution — customer ages skew older, ad spend distributions shift, seasonal patterns break. The model was trained on a world that no longer matches reality.
Concept Drift (Posterior Shift)
The relationship between inputs and outcomes changes — the same customer profile that used to churn now stays (or vice versa). Features look stable, but the mapping from features → target is broken.
📝 This lab simulates data drift (covariate shift). Concept drift is shown here for contrast — understanding the difference is part of the learning.
🔑 The Key Insight: Leading vs. Lagging Indicators
As you explore the scenarios below, watch for this gap. You'll see the feature drift chart (PSI bars) light up months before the F1 line starts falling. That gap is precisely why we monitor data drift, not just model performance. And watch what happens to accuracy vs. F1 — sometimes accuracy stays flat while F1 craters. That's the accuracy paradox.
MARKETING SCENARIOS
Select a scenario above to load a marketing case study and explore model drift interactively.
👆 Select a marketing scenario above to begin exploring model drift.
YOUR MODEL WAS GOOD ONCE
Meet your model in its prime. These are the performance metrics from the first few months after deployment — when training data and real-world data still matched. Take note of where you're starting. Everything that follows is downhill.
Model Health
Baseline Performance
✅ All systems go. The model was trained on data that accurately represents the current customer population. Feature distributions match training data. No drift detected.
WATCH IT DRIFT
Use the timeline scrubber to advance through time. Watch the model's health degrade as real-world data diverges from what the model was trained on. Hit Play for an animated view, or drag the slider manually.
👀 What to watch for: Start by dragging slowly. The amber F1 line reveals month by month — you can't see the future, just like in production. The PSI bars (feature drift) will light up before the F1 line drops — that's the leading vs. lagging gap. The Timeline Events log below will narrate key moments. Check the gauge and drift report for accuracy comparison.
- Events will appear here as you advance the timeline…
Performance Over Time ⏱ LAGGING INDICATOR
The bold amber line is your model's F1 Score — the metric that catches real degradation (read "Why Not Just Accuracy?" above). Reference lines: red dotted = never retrain, green dotted = retrain monthly. ⭐ stars mark your retrains. Check the gauge and drift report for accuracy comparison.
Model Health
Drift Report
Feature Drift (PSI by Feature) 🚨 LEADING INDICATOR
Each bar shows how much that feature's distribution has shifted from the reference period (first 3 months). Longer bars = more drift. This chart will show movement before accuracy drops. Healthy Warning Degraded
Distribution Shift — Top Feature 🚨 LEADING INDICATOR
Grey = reference distribution (training period). Blue = current month's distribution. Watch them separate as drift progresses — this visual separation happens before accuracy drops. The PSI & KS values in the corner quantify the shift.
DIAGNOSE THE SHIFT
Switch to comparison mode. Pick a reference window and a monitoring window to see exactly how each feature's distribution has changed — and by how much. This is the deep-dive diagnosis that tells you why the model broke.
💡 When to use this: Once you've seen the gauge turn yellow in Step 2, come here to investigate. Set the reference window to an early month (before drift) and the monitoring window to the month where things went wrong.
👆 Click Compare Windows above to see per-feature distribution overlays.
Drift Breakdown
| Feature | PSI | KS Stat | Status |
|---|---|---|---|
| Compare two windows above to see per-feature drift statistics. | |||
💡 How to Read This Comparison
- Overlapping histograms: Grey (reference) and blue (monitoring) should largely overlap if no drift occurred. Separation = drift.
- PSI interpretation: < 0.10 = stable, 0.10–0.25 = moderate shift (investigate), > 0.25 = significant shift (retrain).
- KS statistic: Captures the maximum point of divergence. Useful for detecting shifts that PSI's bin-based approach might average out.
- Which features drift first? Often a leading indicator — one feature drifts before performance drops visibly. That's your early warning signal.
Feature Means Over Time 📈 COVARIATE SHIFT TIMELINE
Each line tracks one feature's monthly average as a % change from its training-period mean. The dashed line at 0% is where training data sat. As lines pull away from zero, the model is seeing inputs it wasn't trained on — that's covariate shift in action.
💡 How to Read This Chart
- 0% line = training normal. Each feature's mean during months 1–3 is the baseline (0%). Positive means the feature is higher than training; negative means lower.
- Lines separating from 0% = covariate shift. The inputs are "walking away" from what the model learned. This is data drift — the model's parameters haven't changed, but the world has.
- Different features drift at different rates. Look for which feature diverges first — that's often the leading indicator for the scenario.
- After retraining, a new model would recalibrate to the current means — but the lines here show the original model's perspective.
RETRAIN AND RECOVER
You've seen the drift and diagnosed which features shifted. Now it's time to act. Click Retrain Model (above in Step 2) to simulate retraining on current data. Performance will jump back — but drift will resume. Keep scrubbing forward and retrain again. Watch the sawtooth pattern emerge.
The Sawtooth Pattern
In production, every prediction model follows this lifecycle:
Deploy → Drift → Retrain → Deploy → Drift → Retrain … This is normal. The question isn't whether to retrain, but when the cost of drift exceeds the cost of retraining. That's what monitoring thresholds decide.
⚠️ Key insight: Retraining doesn't always restore original accuracy. If the world has fundamentally changed (regime shift), the model might reach a new, lower ceiling. That's a signal that you may need new features, not just fresh training data.
🛑 Wait — Why Not Just Retrain Every Month?
The green "retrained monthly" line above always looks best. So why not just automate retraining every cycle? In the real world, retraining is not free:
- Direct cost: Compute, engineering hours, QA testing, model validation. For a large recommender system, a single retrain can cost $10,000–$50,000+ in cloud compute alone.
- Contamination risk: If you retrain on bad data (a logging bug, bot traffic, a biased sample), you bake the problem into the model. The "always retrain" strategy is only safe if your data pipeline is perfect — and it never is.
- Regulatory and governance: In finance, healthcare, and credit scoring, every model change requires documentation, audit trails, bias testing, and sometimes committee approval. You can't just silently swap models.
- Stakeholder trust: Business teams need to explain model decisions. A model that changes monthly is harder to interpret and debug. "The model says differently now" erodes confidence.
- Diminishing returns: If drift is minor (PSI < 0.1), the performance gain from retraining may be negligible — you spend $20K to recover 0.3% of F1. Not worth it.
The real skill is knowing when drift is bad enough to justify the cost and risk of retraining. That's why the leading indicators in Step 3 matter — they give you the evidence to make that call at the right moment, not too early and not too late.
📊 Retrain Results
Retrain the model (Step 2) to see before/after performance comparisons here.
SET YOUR ALERT THRESHOLD
How sensitive should your drift monitoring be? A low PSI threshold catches problems early but triggers frequent (potentially unnecessary) retrains. A high threshold saves resources but risks longer periods of degraded performance. Find the sweet spot.
Adjust the alert threshold above and scrub through the full timeline to see how different sensitivity levels perform. The goal: catch drift before it costs you — without crying wolf.
💡 The Monitoring Tradeoff
- Too sensitive (PSI < 0.1): Alerts fire on normal variation. You waste time and compute resources retraining models that were still performing fine.
- Too lenient (PSI > 0.25): You miss gradual degradation until accuracy is already in the red zone. Months of suboptimal predictions go undetected.
- The sweet spot: Most production teams start with PSI = 0.10 and adjust based on the cost of false alarms vs. missed degradation for their specific use case.
- Real-world practice: Companies often combine PSI thresholds with performance monitoring — alert on either significant PSI or a direct accuracy drop.
EXPLORE FURTHER
🤔 Work Through These
- Speed of Drift: Compare the BrightCart (gradual) and MediaMax (regime change) scenarios. How many months does it take each model to cross from green to red? What does the speed difference tell you about the type of drift?
- Leading Indicators: In the FreshBrew scenario, which feature shows PSI drift before accuracy visibly drops? Could monitoring that single feature give you earlier warning than tracking accuracy alone?
- Retrain Ceiling: In the AdVantage scenario, retrain the model at month 12. Does accuracy return to 82%? If not, what does the new ceiling tell you about the post-privacy-change data environment?
- Alert Threshold Tuning: Set the alert to PSI = 0.25 (lenient) and run through the BrightCart timeline. How many months of degraded performance occur before the alert fires? Now try PSI = 0.08. How many false alarms do you get?
- Cost of Drift: If a 1% drop in accuracy costs your company $50K/month in misallocated marketing spend, estimate the cumulative cost of not retraining in the MediaMax scenario between months 4 and 12.
- Beyond Retraining: When retraining can't restore original accuracy, what other interventions might help? Think about new features, different model architectures, or changes to the business process itself.
📚 Connecting to Broader Concepts
📊 Data Drift vs. Concept Drift
Data drift = input features shift (what this lab primarily shows). Concept drift = the relationship between inputs and outcomes changes. Both can degrade models, but concept drift is harder to detect because features may look stable while the target relationship shifts.
🔄 MLOps & Model Monitoring
In production, drift monitoring is part of MLOps — the practice of deploying, monitoring, and maintaining ML models. Tools like Evidently AI, WhyLabs, and AWS SageMaker Monitor automate what you've done manually in this lab.
🧪 A/B Testing Retrained Models
Never deploy a retrained model blindly. Use A/B testing to compare the old model vs. the retrained model on live traffic. If the retrained model doesn't improve key business metrics, the drift might need a different intervention.
🛡️ Feature Engineering for Stability
Some features are inherently more stable than others. Ratios, ranks, and normalised scores tend to drift less than raw values. Designing drift-resistant features upfront reduces future monitoring burden.