Model Drift Lab

Model Fitting Model Monitoring

Watch a prediction model degrade in real time as customer data shifts beneath it. Diagnose which features drifted, retrain to recover, and learn when monitoring alerts should fire — the complete model lifecycle in one interactive lab.

👨‍🏫 Professor Mode: Guided Learning Experience

New to model drift? Enable Professor Mode for step-by-step guidance through watching data shifts degrade model performance!

OVERVIEW & LEARNING OBJECTIVES

A prediction model is only as good as the data it was trained on. When the real world changes — new customer segments arrive, competitors launch, or external shocks hit — the model's assumptions break and its performance degrades. This is model drift, and it's one of the most common (and costly) problems in production analytics.

This lab lets you see drift happen, diagnose its causes, and practise the intervention cycle that every analyst must master.

🎯 What You'll Learn
  • Drift is invisible until it's costly: Performance degrades gradually — by the time someone notices, the damage is already done.
  • PSI and KS statistics: Quantitative tools to measure how much each feature has shifted from its training distribution.
  • The sawtooth pattern: Real production models follow a cycle of drift → retrain → drift → retrain. You'll build this pattern yourself.
  • Alert thresholds: Setting monitoring triggers — too sensitive wastes resources, too lenient misses degradation.
  • Data drift vs. performance drift: Feature distributions can shift before accuracy drops. Monitoring data drift gives you early warning.

💡 Why This Matters for Marketing: Churn models, lead scores, recommendation engines, pricing models — every ML system in marketing is vulnerable to drift. Companies like Netflix, Spotify, and Amazon retrain their models on schedules ranging from daily to quarterly. This lab teaches you why they do that and how to decide when.

📐 Key Metrics Explained

Population Stability Index (PSI)

🗣️ In plain English: PSI answers a simple question: "Does today's data still look like the data we trained on?"

Imagine you trained a churn model when average order value ranged from $50–$80 for most customers. PSI checks whether that's still true. If a wave of budget shoppers shifts the range to $30–$55, PSI goes up — even if the model's accuracy hasn't dropped yet.

It works by dividing a feature into bins (like a histogram) and comparing how many customers fall in each bin now versus during training. The more the bins mismatch, the higher the PSI.

Formally, PSI is computed as:

$$\text{PSI} = \sum_{i=1}^{B} (p_i^{\text{mon}} - p_i^{\text{ref}}) \cdot \ln\!\left(\frac{p_i^{\text{mon}}}{p_i^{\text{ref}}}\right)$$

Where \(p_i^{\text{ref}}\) and \(p_i^{\text{mon}}\) are the proportions in bin \(i\) for the reference and monitoring distributions.

PSI RangeInterpretationWhat It Feels LikeAction
< 0.10 No significant drift Data still looks like training — histograms mostly overlap Continue monitoring
0.10 – 0.25 Moderate drift Something changed — one or more features have a noticeably different shape Investigate which features shifted and why
> 0.25 Significant drift Today's customers look like a different population than what the model trained on Retrain — the model is making decisions based on outdated patterns

Kolmogorov-Smirnov (KS) Statistic

🗣️ In plain English: KS finds the single point where two distributions disagree the most. Think of it as: "What's the biggest gap between the cumulative 'training' curve and the cumulative 'today' curve?"

If KS is high, there's at least one region of the feature's range where today's data is strikingly different. For example, maybe the top quartile of spenders disappeared — KS would capture that even if the average barely moved.

Formally:

$$\text{KS} = \max_x |F_{\text{ref}}(x) - F_{\text{mon}}(x)|$$

Values above 0.1 suggest meaningful distributional shift. Unlike PSI (which summarises overall divergence across all bins), KS is sensitive to the location of maximum divergence — useful for detecting concentrated shifts that PSI might average out.

F1 Score — Why Accuracy Alone Can Lie

🚨 The Accuracy Trap: Imagine your CTR model predicts "no click" for every single ad impression. If only 12% of impressions actually get clicked, that lazy model scores 88% accuracy — it looks great! But it found zero actual clicks. It's completely useless.

This is the accuracy paradox: with imbalanced outcomes (which is almost every marketing problem — few people churn, few ads get clicked, few leads convert), accuracy rewards a model for predicting the majority class and ignoring the thing you actually care about.

F1 Score solves this by combining two questions:

  • Precision — Of the people the model flagged, how many actually did the thing? ("When it says 'click', is it right?")
  • Recall — Of the people who actually did the thing, how many did the model catch? ("Did it find all the real clicks?")
$$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

F1 is the harmonic mean of precision and recall — it only scores high when both are high. A model that cheats by predicting all-negative gets F1 = 0 (recall is zero), even though accuracy is 88%.

🎯 Why F1 is the hero metric in this lab: As drift pushes a model's predictions away from reality, the first thing to collapse is its ability to find the minority class (clicks, churners, converters). F1 captures that collapse immediately. Accuracy might stay flat — or even improve — while F1 craters. When you see those two lines diverge on the chart below, that's drift doing its damage.

UNDERSTANDING DRIFT — WHY MODELS BREAK

Before exploring the scenarios, let's build intuition for what model drift is, why it happens, and — critically — how to see it coming before accuracy drops.

📊

Data Drift (Covariate Shift)

The input features change distribution — customer ages skew older, ad spend distributions shift, seasonal patterns break. The model was trained on a world that no longer matches reality.

Training Production SHIFT
Example: Your churn model was trained when average customer tenure was 18 months. A viral campaign brings a flood of new customers — average tenure drops to 6 months. The feature distributions shift, even though churn behaviour hasn't fundamentally changed.
🟢 Detectable EARLY PSI & KS statistics catch this before accuracy drops.
🔄

Concept Drift (Posterior Shift)

The relationship between inputs and outcomes changes — the same customer profile that used to churn now stays (or vice versa). Features look stable, but the mapping from features → target is broken.

Same distribution, new meaning old boundary new boundary ?
Example: A competitor launches a loyalty program. Customers with the same demographics and spending patterns that used to churn now stay — the definition of a "likely churner" changed, not the customer data itself.
🔴 Harder to detect early PSI won't catch it — only shows up when accuracy drops.

📝 This lab simulates data drift (covariate shift). Concept drift is shown here for contrast — understanding the difference is part of the learning.

🔑 The Key Insight: Leading vs. Lagging Indicators

Month 1 Month 6 Month 10 Month 14 Month 18 PSI ↑ F1 ↓ ← PSI already rising ← F1 finally drops EARLY WARNING WINDOW
PSI (Feature Drift) LEADING — catches distribution shifts before they hurt accuracy. This is your early alarm.
F1 Score LAGGING — only drops after drift has been happening for a while. But unlike accuracy, F1 always drops when drift hurts the model’s real job.

As you explore the scenarios below, watch for this gap. You'll see the feature drift chart (PSI bars) light up months before the F1 line starts falling. That gap is precisely why we monitor data drift, not just model performance. And watch what happens to accuracy vs. F1 — sometimes accuracy stays flat while F1 craters. That's the accuracy paradox.

MARKETING SCENARIOS

Select a scenario above to load a marketing case study and explore model drift interactively.

👆 Select a marketing scenario above to begin exploring model drift.