Model Drift Lab

Model Fitting Model Monitoring

Watch a prediction model degrade in real time as customer data shifts beneath it. Diagnose which features drifted, retrain to recover, and learn when monitoring alerts should fire — the complete model lifecycle in one interactive lab.

OVERVIEW & LEARNING OBJECTIVES

A prediction model is only as good as the data it was trained on. When the real world changes — new customer segments arrive, competitors launch, or external shocks hit — the model's assumptions break and its performance degrades. This is model drift, and it's one of the most common (and costly) problems in production analytics.

This lab lets you see drift happen, diagnose its causes, and practise the intervention cycle that every analyst must master.

🎯 What You'll Learn

Drift is invisible until it's costly: Performance degrades gradually — by the time someone notices, the damage is already done.
PSI and KS statistics: Quantitative tools to measure how much each feature has shifted from its training distribution.
The sawtooth pattern: Real production models follow a cycle of drift → retrain → drift → retrain. You'll build this pattern yourself.
Alert thresholds: Setting monitoring triggers — too sensitive wastes resources, too lenient misses degradation.
Data drift vs. performance drift: Feature distributions can shift before accuracy drops. Monitoring data drift gives you early warning.

💡 Why This Matters for Marketing: Churn models, lead scores, recommendation engines, pricing models — every ML system in marketing is vulnerable to drift. Companies like Netflix, Spotify, and Amazon retrain their models on schedules ranging from daily to quarterly. This lab teaches you why they do that and how to decide when.

📐 Key Metrics Explained

Population Stability Index (PSI)

🗣️ In plain English: PSI answers a simple question: "Does today's data still look like the data we trained on?"

Imagine you trained a churn model when average order value ranged from $50–$80 for most customers. PSI checks whether that's still true. If a wave of budget shoppers shifts the range to $30–$55, PSI goes up — even if the model's accuracy hasn't dropped yet.

It works by dividing a feature into bins (like a histogram) and comparing how many customers fall in each bin now versus during training. The more the bins mismatch, the higher the PSI.

Formally, PSI is computed as:

$$\text{PSI} = \sum_{i=1}^{B} (p_i^{\text{mon}} - p_i^{\text{ref}}) \cdot \ln\!\left(\frac{p_i^{\text{mon}}}{p_i^{\text{ref}}}\right)$$

Where $p_i^{\text{ref}}$ and $p_i^{\text{mon}}$ are the proportions in bin $i$ for the reference and monitoring distributions.

PSI Range	Interpretation	What It Feels Like	Action
< 0.10	No significant drift	Data still looks like training — histograms mostly overlap	Continue monitoring
0.10 – 0.25	Moderate drift	Something changed — one or more features have a noticeably different shape	Investigate which features shifted and why
> 0.25	Significant drift	Today's customers look like a different population than what the model trained on	Retrain — the model is making decisions based on outdated patterns

Kolmogorov-Smirnov (KS) Statistic

🗣️ In plain English: KS finds the single point where two distributions disagree the most. Think of it as: "What's the biggest gap between the cumulative 'training' curve and the cumulative 'today' curve?"

If KS is high, there's at least one region of the feature's range where today's data is strikingly different. For example, maybe the top quartile of spenders disappeared — KS would capture that even if the average barely moved.

Formally:

$$\text{KS} = \max_x |F_{\text{ref}}(x) - F_{\text{mon}}(x)|$$

Values above 0.1 suggest meaningful distributional shift. Unlike PSI (which summarises overall divergence across all bins), KS is sensitive to the location of maximum divergence — useful for detecting concentrated shifts that PSI might average out.

F1 Score — Why Accuracy Alone Can Lie

🚨 The Accuracy Trap: Imagine your CTR model predicts "no click" for every single ad impression. If only 12% of impressions actually get clicked, that lazy model scores 88% accuracy — it looks great! But it found zero actual clicks. It's completely useless.

This is the accuracy paradox: with imbalanced outcomes (which is almost every marketing problem — few people churn, few ads get clicked, few leads convert), accuracy rewards a model for predicting the majority class and ignoring the thing you actually care about.

F1 Score solves this by combining two questions:

Precision — Of the people the model flagged, how many actually did the thing? ("When it says 'click', is it right?")
Recall — Of the people who actually did the thing, how many did the model catch? ("Did it find all the real clicks?")

$$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

F1 is the harmonic mean of precision and recall — it only scores high when both are high. A model that cheats by predicting all-negative gets F1 = 0 (recall is zero), even though accuracy is 88%.

🎯 Why F1 is the hero metric in this lab: As drift pushes a model's predictions away from reality, the first thing to collapse is its ability to find the minority class (clicks, churners, converters). F1 captures that collapse immediately. Accuracy might stay flat — or even improve — while F1 craters. When you see those two lines diverge on the chart below, that's drift doing its damage.

UNDERSTANDING DRIFT — WHY MODELS BREAK

Before exploring the scenarios, let's build intuition for what model drift is, why it happens, and — critically — how to see it coming before accuracy drops.

📊

Data Drift (Covariate Shift)

The input features change distribution — customer ages skew older, ad spend distributions shift, seasonal patterns break. The model was trained on a world that no longer matches reality.

Example: Your churn model was trained when average customer tenure was 18 months. A viral campaign brings a flood of new customers — average tenure drops to 6 months. The feature distributions shift, even though churn behaviour hasn't fundamentally changed.

🟢 Detectable EARLY PSI & KS statistics catch this before accuracy drops.

🔄

Concept Drift (Posterior Shift)

The relationship between inputs and outcomes changes — the same customer profile that used to churn now stays (or vice versa). Features look stable, but the mapping from features → target is broken.

Example: A competitor launches a loyalty program. Customers with the same demographics and spending patterns that used to churn now stay — the definition of a "likely churner" changed, not the customer data itself.

🔴 Harder to detect early PSI won't catch it — only shows up when accuracy drops.

📝 This lab simulates data drift (covariate shift). Concept drift is shown here for contrast — understanding the difference is part of the learning.

🔑 The Key Insight: Leading vs. Lagging Indicators

PSI (Feature Drift) LEADING — catches distribution shifts before they hurt accuracy. This is your early alarm.

F1 Score LAGGING — only drops after drift has been happening for a while. But unlike accuracy, F1 always drops when drift hurts the model’s real job.

As you explore the scenarios below, watch for this gap. You'll see the feature drift chart (PSI bars) light up months before the F1 line starts falling. That gap is precisely why we monitor data drift, not just model performance. And watch what happens to accuracy vs. F1 — sometimes accuracy stays flat while F1 craters. That's the accuracy paradox.

MARKETING SCENARIOS

Load a marketing use case:

Select a scenario above to load a marketing case study and explore model drift interactively.

👆 Select a marketing scenario above to begin exploring model drift.

Step 1

YOUR MODEL WAS GOOD ONCE

Meet your model in its prime. These are the performance metrics from the first few months after deployment — when training data and real-world data still matched. Take note of where you're starting. Everything that follows is downhill.

Model Health

Healthy

Accuracy: 86%

Baseline Performance

— Accuracy

— F1 Score

— Precision

— Recall

✅ All systems go. The model was trained on data that accurately represents the current customer population. Feature distributions match training data. No drift detected.

Step 2

WATCH IT DRIFT

Use the timeline scrubber to advance through time. Watch the model's health degrade as real-world data diverges from what the model was trained on. Hit Play for an animated view, or drag the slider manually.

👀 What to watch for: Start by dragging slowly. The amber F1 line reveals month by month — you can't see the future, just like in production. The PSI bars (feature drift) will light up before the F1 line drops — that's the leading vs. lagging gap. The Timeline Events log below will narrate key moments. Check the gauge and drift report for accuracy comparison.

Month 1 of 24

Month 1 Month 24

📋 Timeline Events

Events will appear here as you advance the timeline…

Performance Over Time ⏱ LAGGING INDICATOR

The bold amber line is your model's F1 Score — the metric that catches real degradation (read "Why Not Just Accuracy?" above). Reference lines: red dotted = never retrain, green dotted = retrain monthly. ⭐ stars mark your retrains. Check the gauge and drift report for accuracy comparison.

Model Health

Healthy

100% of baseline

💰

Est. Monthly Cost of Drift $0

No drift — no cost yet

📊 Total cost since drift started: $0

0 retrains

Drift Report

No Drift Detected

— F1 Score

— Accuracy

— Max Feature PSI

— Max Feature KS

Load a scenario and advance the timeline to see drift diagnostics.

🔑 Compare these two charts: The PSI bars below often light up months before the F1 line above starts falling. That gap is your early warning window — this is why we monitor feature drift, not just F1.

Feature Drift (PSI by Feature) 🚨 LEADING INDICATOR

Each bar shows how much that feature's distribution has shifted from the reference period (first 3 months). Longer bars = more drift. This chart will show movement before accuracy drops. Healthy Warning Degraded

Distribution Shift — Top Feature 🚨 LEADING INDICATOR

Grey = reference distribution (training period). Blue = current month's distribution. Watch them separate as drift progresses — this visual separation happens before accuracy drops. The PSI & KS values in the corner quantify the shift.

Step 3

DIAGNOSE THE SHIFT

Switch to comparison mode. Pick a reference window and a monitoring window to see exactly how each feature's distribution has changed — and by how much. This is the deep-dive diagnosis that tells you why the model broke.

💡 When to use this: Once you've seen the gauge turn yellow in Step 2, come here to investigate. Set the reference window to an early month (before drift) and the monitoring window to the month where things went wrong.

Reference Window:

Monitoring Window:

👆 Click Compare Windows above to see per-feature distribution overlays.

Drift Breakdown

Feature	PSI	KS Stat	Status
Compare two windows above to see per-feature drift statistics.

💡 How to Read This Comparison

Overlapping histograms: Grey (reference) and blue (monitoring) should largely overlap if no drift occurred. Separation = drift.
PSI interpretation: < 0.10 = stable, 0.10–0.25 = moderate shift (investigate), > 0.25 = significant shift (retrain).
KS statistic: Captures the maximum point of divergence. Useful for detecting shifts that PSI's bin-based approach might average out.
Which features drift first? Often a leading indicator — one feature drifts before performance drops visibly. That's your early warning signal.

Feature Means Over Time 📈 COVARIATE SHIFT TIMELINE

Each line tracks one feature's monthly average as a % change from its training-period mean. The dashed line at 0% is where training data sat. As lines pull away from zero, the model is seeing inputs it wasn't trained on — that's covariate shift in action.

💡 How to Read This Chart

0% line = training normal. Each feature's mean during months 1–3 is the baseline (0%). Positive means the feature is higher than training; negative means lower.
Lines separating from 0% = covariate shift. The inputs are "walking away" from what the model learned. This is data drift — the model's parameters haven't changed, but the world has.
Different features drift at different rates. Look for which feature diverges first — that's often the leading indicator for the scenario.
After retraining, a new model would recalibrate to the current means — but the lines here show the original model's perspective.

Step 4

RETRAIN AND RECOVER

You've seen the drift and diagnosed which features shifted. Now it's time to act. Click Retrain Model (above in Step 2) to simulate retraining on current data. Performance will jump back — but drift will resume. Keep scrubbing forward and retrain again. Watch the sawtooth pattern emerge.

The Sawtooth Pattern

In production, every prediction model follows this lifecycle:

Deploy → Drift → Retrain → Deploy → Drift → Retrain … This is normal. The question isn't whether to retrain, but when the cost of drift exceeds the cost of retraining. That's what monitoring thresholds decide.

⚠️ Key insight: Retraining doesn't always restore original accuracy. If the world has fundamentally changed (regime shift), the model might reach a new, lower ceiling. That's a signal that you may need new features, not just fresh training data.

🛑 Wait — Why Not Just Retrain Every Month?

The green "retrained monthly" line above always looks best. So why not just automate retraining every cycle? In the real world, retraining is not free:

Direct cost: Compute, engineering hours, QA testing, model validation. For a large recommender system, a single retrain can cost $10,000–$50,000+ in cloud compute alone.
Contamination risk: If you retrain on bad data (a logging bug, bot traffic, a biased sample), you bake the problem into the model. The "always retrain" strategy is only safe if your data pipeline is perfect — and it never is.
Regulatory and governance: In finance, healthcare, and credit scoring, every model change requires documentation, audit trails, bias testing, and sometimes committee approval. You can't just silently swap models.
Stakeholder trust: Business teams need to explain model decisions. A model that changes monthly is harder to interpret and debug. "The model says differently now" erodes confidence.
Diminishing returns: If drift is minor (PSI < 0.1), the performance gain from retraining may be negligible — you spend $20K to recover 0.3% of F1. Not worth it.

The real skill is knowing when drift is bad enough to justify the cost and risk of retraining. That's why the leading indicators in Step 3 matter — they give you the evidence to make that call at the right moment, not too early and not too late.

📊 Retrain Results

Retrain the model (Step 2) to see before/after performance comparisons here.

Step 5

SET YOUR ALERT THRESHOLD

How sensitive should your drift monitoring be? A low PSI threshold catches problems early but triggers frequent (potentially unnecessary) retrains. A high threshold saves resources but risks longer periods of degraded performance. Find the sweet spot.

PSI Alert Threshold 0.10

0.05 (Sensitive) 0.50 (Lenient)

— Alerts Would Have Fired

— Early Warnings (before perf. drop)

— Late Alerts (after significant loss)

Adjust the alert threshold above and scrub through the full timeline to see how different sensitivity levels perform. The goal: catch drift before it costs you — without crying wolf.

💡 The Monitoring Tradeoff

Too sensitive (PSI < 0.1): Alerts fire on normal variation. You waste time and compute resources retraining models that were still performing fine.
Too lenient (PSI > 0.25): You miss gradual degradation until accuracy is already in the red zone. Months of suboptimal predictions go undetected.
The sweet spot: Most production teams start with PSI = 0.10 and adjust based on the cost of false alarms vs. missed degradation for their specific use case.
Real-world practice: Companies often combine PSI thresholds with performance monitoring — alert on either significant PSI or a direct accuracy drop.

EXPLORE FURTHER

🤔 Work Through These

Speed of Drift: Compare the BrightCart (gradual) and MediaMax (regime change) scenarios. How many months does it take each model to cross from green to red? What does the speed difference tell you about the type of drift?
Leading Indicators: In the FreshBrew scenario, which feature shows PSI drift before accuracy visibly drops? Could monitoring that single feature give you earlier warning than tracking accuracy alone?
Retrain Ceiling: In the AdVantage scenario, retrain the model at month 12. Does accuracy return to 82%? If not, what does the new ceiling tell you about the post-privacy-change data environment?
Alert Threshold Tuning: Set the alert to PSI = 0.25 (lenient) and run through the BrightCart timeline. How many months of degraded performance occur before the alert fires? Now try PSI = 0.08. How many false alarms do you get?
Cost of Drift: If a 1% drop in accuracy costs your company $50K/month in misallocated marketing spend, estimate the cumulative cost of not retraining in the MediaMax scenario between months 4 and 12.
Beyond Retraining: When retraining can't restore original accuracy, what other interventions might help? Think about new features, different model architectures, or changes to the business process itself.

📚 Connecting to Broader Concepts

📊 Data Drift vs. Concept Drift

Data drift = input features shift (what this lab primarily shows). Concept drift = the relationship between inputs and outcomes changes. Both can degrade models, but concept drift is harder to detect because features may look stable while the target relationship shifts.

🔄 MLOps & Model Monitoring

In production, drift monitoring is part of MLOps — the practice of deploying, monitoring, and maintaining ML models. Tools like Evidently AI, WhyLabs, and AWS SageMaker Monitor automate what you've done manually in this lab.

🧪 A/B Testing Retrained Models

Never deploy a retrained model blindly. Use A/B testing to compare the old model vs. the retrained model on live traffic. If the retrained model doesn't improve key business metrics, the drift might need a different intervention.

🛡️ Feature Engineering for Stability

Some features are inherently more stable than others. Ratios, ranks, and normalised scores tend to drift less than raw values. Designing drift-resistant features upfront reduces future monitoring burden.

👨‍🏫 Professor Mode: Guided Learning Experience

OVERVIEW & LEARNING OBJECTIVES

Population Stability Index (PSI)

Kolmogorov-Smirnov (KS) Statistic

F1 Score — Why Accuracy Alone Can Lie

UNDERSTANDING DRIFT — WHY MODELS BREAK

Data Drift (Covariate Shift)

Concept Drift (Posterior Shift)

🔑 The Key Insight: Leading vs. Lagging Indicators

MARKETING SCENARIOS

YOUR MODEL WAS GOOD ONCE

Model Health

Baseline Performance

WATCH IT DRIFT

Performance Over Time ⏱ LAGGING INDICATOR

Model Health

Drift Report

Feature Drift (PSI by Feature) 🚨 LEADING INDICATOR

Distribution Shift — Top Feature 🚨 LEADING INDICATOR

DIAGNOSE THE SHIFT

Drift Breakdown

Feature Means Over Time 📈 COVARIATE SHIFT TIMELINE

RETRAIN AND RECOVER

The Sawtooth Pattern

SET YOUR ALERT THRESHOLD

EXPLORE FURTHER

📊 Data Drift vs. Concept Drift

🔄 MLOps & Model Monitoring

🧪 A/B Testing Retrained Models

🛡️ Feature Engineering for Stability