Rare Event Calibration Lab

Model Calibration Rare Event Modeling Classification Foundations

You have 10,000 customers. Only 5% will churn. A naive model learns that saying "won't churn" is correct 95% of the time — so it stops identifying churners at all. Oversampling fixes the learning problem. But it quietly breaks the probability scale. This lab shows you both effects — and the one-line fix.

OVERVIEW & LEARNING OBJECTIVES

The Setup: Imagine you're building a churn model. 10,000 customers. 500 will actually churn (5%). The other 9,500 won't. You train a logistic regression on this data and it learns a dark secret: if it just predicts "won't churn" for every single customer, it's correct 95% of the time. Great accuracy. Useless model. It will never alert you to a churner because it learned that ignoring them is the safe bet.

The Fix — and its hidden cost: Oversampling artificially boosts the churn cases in training data — say, from 5% to 50% — so the model sees enough examples to actually learn what a churner looks like. This genuinely works: the model's ability to rank high-risk customers improves dramatically. But now the model thinks it lives in a world where 50% of people churn. Its raw probability scores are inflated 5–15× compared to reality. If you use those scores directly for budget decisions, you'll massively overspend.

The Correction: A one-line algebraic formula (King & Zeng, 2001) scales every raw score back to the true population rate. No retraining. No new data. You get the discrimination improvement and trustworthy probabilities.

📋 Step-by-Step: How to Use This Tool
1
Run the simulation (default settings are fine to start)

Click Run Simulation below. Four models will train on the same 10,000-observation dataset — one with no oversampling, and three with progressively aggressive oversampling (20%, 35%, 50% positive rate in training).

2
Look at the AUC chip on each condition card

AUC (0 to 1) measures how well the model rank-orders customers by risk. AUC = 0.50 is a coin flip — useless. AUC = 0.90 means a truly risky customer outranks a safe one 90% of the time. Watch how much AUC rises as oversampling increases. This is the win.

3
Look at the Mean Pred % chip — this is the problem

The true churn rate is 5%. Mean Pred % is what the model thinks the average customer's probability is. For oversampled models this number will be 3–10× higher than 5%. The "Inflation" chip shows the exact multiple. A 4× inflation means every probability is four times too high.

4
Toggle the King-Zeng correction — watch what changes and what doesn't

Enable Show King-Zeng Corrected Probabilities above the condition cards. The Mean Pred % will snap back to ~5%. The mini-charts will shift back toward the diagonal. But AUC will not change at all — the correction rescales probabilities but preserves rank order perfectly.

🎯 What You'll Learn
  • Why "natural" models fail on rare events: The model sees 1 churner per 19 non-churners. It learns that predicting "no" is almost always correct — earning high accuracy but near-random ability to find actual churners. AUC near 0.5 means barely better than a coin flip. Check the Natural Sampling card after running — this is why oversampling exists.
  • What oversampling fixes (and what it doesn't): Oversampling dramatically improves AUC — how well the model rank-orders customers by risk. It does NOT fix probability estimates — in fact, it makes them dangerously wrong.
  • What "calibration" means in plain English: If the model says a customer has a 10% churn probability, are roughly 10% of those customers actually churning? Calibration measures truthfulness of the probability number itself — not just whether it ranks customers in the right order.
  • When the probability number matters: Rank-ordering only? (Top 20% of customers for a calling campaign) → AUC is all you need. Dollar-value calculations? (Bid = p × value, CLV = p × margin × tenure) → You need calibration too. A 5× inflated probability → 5× miscalculated bids.
  • The King-Zeng correction: One formula. No retraining. Rescales every raw score back to the true population rate by accounting for the ratio between training positive rate and true positive rate.

💡 Concrete dollar example: Programmatic ad bidding: Bid = p(click) × revenue_per_click. If your model outputs p = 0.40 but truth is p = 0.05, you bid 8× too much per impression. On a $100K monthly budget, that's roughly $87,500 wasted on overpriced inventory. Calibration is a financial requirement, not just a statistical nicety.

📐 Mathematical Foundations
✅ Act 1 — Why Oversample (The Fix)

With τ = 5% natural data, gradient descent sees 1 positive per 19 negatives each iteration. The intercept b₀ gets pushed deeply negative, p̂ ≈ 0 for nearly everything, and the gradient for the slope b₁ vanishes. The model never learns who is risky — just that almost no one is.

Keeping all positives and subsampling negatives to 50/50 creates a balanced gradient in every training step. Now b₁ gets a strong, consistent learning signal — AUC rises substantially.

⚠️ Act 2 — The Calibration Cost (The Problem)

The model's intercept b₀ now calibrates to a 50% base rate — because that's the world it trained in. On the real holdout (5% positives), raw predictions are 5–15× too high.

Better discrimination. Broken probability scale.

King-Zeng Prior Correction (2001):

$$\hat{p}_{\text{corrected}} = \frac{\hat{p}_{\text{raw}}}{\hat{p}_{\text{raw}} + (1 - \hat{p}_{\text{raw}}) \cdot \dfrac{s \,(1-\tau)}{(1-s)\,\tau}}$$

s = positive rate in training set  |  τ = true population base rate  |  When s = τ, this reduces to corrected = raw

Reliability Diagram (Calibration Curve):

Bin all holdout predictions into deciles by predicted probability. For each bin, plot the mean predicted probability (x-axis) against the observed fraction of positives (y-axis). Points on the diagonal = perfect calibration. Points above the diagonal = probabilities are inflated (model is overconfident).

Metric What It Measures Effect of Oversampling
Sensitivity @50% Fraction of churners caught at a 50% decision threshold ↑ Improves dramatically — 0% → 60%+ (the primary operational win)
AUC-ROC Rank ordering quality (discrimination) ≈ Unchanged — logistic regression rank order is class-balance invariant
Brier Score Mean squared probability error (lower = better) ↑ Worsens — probabilities inflated
ECE Expected Calibration Error (lower = better) ↑ Worsens significantly
Mean Predicted Prob. Average output score on holdout ↑ Should equal τ; with oversample it exceeds τ

⚠️ Verification Check: After applying the correction, the mean predicted probability on any representative holdout should approximately equal τ. If it doesn't, check whether your training positive rate s and true base rate τ are specified correctly.

SIMULATION SETTINGS

KEY INSIGHTS

🧠 The Analyst's Playbook for Rare-Event Models
🎯 Discrimination vs. Calibration — Two Different Jobs

Discrimination (AUC): Can the model rank customers correctly? Does the high-risk customer score higher than the low-risk one? You need this for prioritization, targeting, and triage.

Calibration: Do predicted probabilities mean what they say? If the model says 10%, do roughly 10% of those customers actually convert? You need this for any dollar-value calculation.

The workflow: Oversample to get discrimination. Apply the correction to restore calibration. You don't have to choose between them.

📊 When Does Calibration Actually Matter?
Use CaseNeed AUC?Need Calibration?Reason
Direct mail to top 10% of model ✅ Yes ❌ No Only rank matters — who's in the top decile
Programmatic bid pricing ✅ Yes ✅ Yes Bid = p × value; bad p → bad bid
CLV estimation ✅ Yes CLV formulas directly multiply probability
Churn score thresholding ✅ Yes ⚠️ Sometimes Depends on whether threshold is absolute or relative
A/B test lift measurement ✅ Yes Comparing predicted lift requires calibrated probability differences
🔧 Calibration Correction Methods Compared
MethodComplexityBest For
King-Zeng formula ⭐ One formula Logistic regression with known oversampling rate
Platt scaling ⭐⭐ Fit a 2nd logistic model SVM, neural networks, any model with known holdout
Isotonic regression ⭐⭐⭐ Non-parametric Large holdout set, any model, no shape assumption
Temperature scaling ⭐⭐ Single parameter Neural networks, quick recalibration post-training
⚠️ What This Simulation Doesn't Show

This tool uses negative subsampling: all positive (rare) examples are kept; the majority class is subsampled to hit the target ratio. In practice, SMOTE generates synthetic interpolated positives instead of subsampling, which avoids discarding negatives. The King-Zeng calibration cost applies to both approaches equally — anything that shifts the training class balance shifts the intercept.

The simulation also uses a single continuous predictor with logistic regression. With tree-based models (XGBoost, Random Forest), raw probability outputs are inherently miscalibrated even without oversampling — Platt scaling or isotonic regression are the standard fix there.