Rare Event Calibration Lab

Model Calibration Rare Event Modeling Classification Foundations

You have 10,000 customers. Only 5% will churn. A naive model learns that saying "won't churn" is correct 95% of the time — so it stops identifying churners at all. Oversampling fixes the learning problem. But it quietly breaks the probability scale. This lab shows you both effects — and the one-line fix.

OVERVIEW & LEARNING OBJECTIVES

The Setup: Imagine you're building a churn model. 10,000 customers. 500 will actually churn (5%). The other 9,500 won't. You train a logistic regression on this data and it learns a dark secret: if it just predicts "won't churn" for every single customer, it's correct 95% of the time. Great accuracy. Useless model. It will never alert you to a churner because it learned that ignoring them is the safe bet.

The Fix — and its hidden cost: Oversampling artificially boosts the churn cases in training data — say, from 5% to 50% — so the model sees enough examples to actually learn what a churner looks like. This genuinely works: the model's ability to rank high-risk customers improves dramatically. But now the model thinks it lives in a world where 50% of people churn. Its raw probability scores are inflated 5–15× compared to reality. If you use those scores directly for budget decisions, you'll massively overspend.

The Correction: A one-line algebraic formula (King & Zeng, 2001) scales every raw score back to the true population rate. No retraining. No new data. You get the discrimination improvement and trustworthy probabilities.

📋 Step-by-Step: How to Use This Tool

Run the simulation (default settings are fine to start)

Click Run Simulation below. Four models will train on the same 10,000-observation dataset — one with no oversampling, and three with progressively aggressive oversampling (20%, 35%, 50% positive rate in training).

Look at the AUC chip on each condition card

AUC (0 to 1) measures how well the model rank-orders customers by risk. AUC = 0.50 is a coin flip — useless. AUC = 0.90 means a truly risky customer outranks a safe one 90% of the time. Watch how much AUC rises as oversampling increases. This is the win.

Look at the Mean Pred % chip — this is the problem

The true churn rate is 5%. Mean Pred % is what the model thinks the average customer's probability is. For oversampled models this number will be 3–10× higher than 5%. The "Inflation" chip shows the exact multiple. A 4× inflation means every probability is four times too high.

Toggle the King-Zeng correction — watch what changes and what doesn't

Enable Show King-Zeng Corrected Probabilities above the condition cards. The Mean Pred % will snap back to ~5%. The mini-charts will shift back toward the diagonal. But AUC will not change at all — the correction rescales probabilities but preserves rank order perfectly.

🎯 What You'll Learn

Why "natural" models fail on rare events: The model sees 1 churner per 19 non-churners. It learns that predicting "no" is almost always correct — earning high accuracy but near-random ability to find actual churners. AUC near 0.5 means barely better than a coin flip. Check the Natural Sampling card after running — this is why oversampling exists.
What oversampling fixes (and what it doesn't): Oversampling dramatically improves AUC — how well the model rank-orders customers by risk. It does NOT fix probability estimates — in fact, it makes them dangerously wrong.
What "calibration" means in plain English: If the model says a customer has a 10% churn probability, are roughly 10% of those customers actually churning? Calibration measures truthfulness of the probability number itself — not just whether it ranks customers in the right order.
When the probability number matters: Rank-ordering only? (Top 20% of customers for a calling campaign) → AUC is all you need. Dollar-value calculations? (Bid = p × value, CLV = p × margin × tenure) → You need calibration too. A 5× inflated probability → 5× miscalculated bids.
The King-Zeng correction: One formula. No retraining. Rescales every raw score back to the true population rate by accounting for the ratio between training positive rate and true positive rate.

💡 Concrete dollar example: Programmatic ad bidding: Bid = p(click) × revenue_per_click. If your model outputs p = 0.40 but truth is p = 0.05, you bid 8× too much per impression. On a $100K monthly budget, that's roughly $87,500 wasted on overpriced inventory. Calibration is a financial requirement, not just a statistical nicety.

📐 Mathematical Foundations

✅ Act 1 — Why Oversample (The Fix)

With τ = 5% natural data, gradient descent sees 1 positive per 19 negatives each iteration. The intercept b₀ gets pushed deeply negative, p̂ ≈ 0 for nearly everything, and the gradient for the slope b₁ vanishes. The model never learns who is risky — just that almost no one is.

Keeping all positives and subsampling negatives to 50/50 creates a balanced gradient in every training step. Now b₁ gets a strong, consistent learning signal — AUC rises substantially.

⚠️ Act 2 — The Calibration Cost (The Problem)

The model's intercept b₀ now calibrates to a 50% base rate — because that's the world it trained in. On the real holdout (5% positives), raw predictions are 5–15× too high.

Better discrimination. Broken probability scale.

King-Zeng Prior Correction (2001):

$$\hat{p}_{\text{corrected}} = \frac{\hat{p}_{\text{raw}}}{\hat{p}_{\text{raw}} + (1 - \hat{p}_{\text{raw}}) \cdot \dfrac{s \,(1-\tau)}{(1-s)\,\tau}}$$

s = positive rate in training set | τ = true population base rate | When s = τ, this reduces to p̂_corrected = p̂_raw

Reliability Diagram (Calibration Curve):

Bin all holdout predictions into deciles by predicted probability. For each bin, plot the mean predicted probability (x-axis) against the observed fraction of positives (y-axis). Points on the diagonal = perfect calibration. Points above the diagonal = probabilities are inflated (model is overconfident).

Metric	What It Measures	Effect of Oversampling
Sensitivity @50%	Fraction of churners caught at a 50% decision threshold	↑ Improves dramatically — 0% → 60%+ (the primary operational win)
AUC-ROC	Rank ordering quality (discrimination)	≈ Unchanged — logistic regression rank order is class-balance invariant
Brier Score	Mean squared probability error (lower = better)	↑ Worsens — probabilities inflated
ECE	Expected Calibration Error (lower = better)	↑ Worsens significantly
Mean Predicted Prob.	Average output score on holdout	↑ Should equal τ; with oversample it exceeds τ

⚠️ Verification Check: After applying the correction, the mean predicted probability on any representative holdout should approximately equal τ. If it doesn't, check whether your training positive rate s and true base rate τ are specified correctly.

SIMULATION SETTINGS

True Base Rate (τ)

Dataset Size

Random Seed

TRAINING CONDITIONS

All four models train on the same underlying population — only the oversampling ratio changes. Every model is evaluated on the same holdout set that was held out before any oversampling. This ensures a fair, real-world comparison.

⚙️ How oversampling works here: This simulation uses negative subsampling — the standard first approach. All rare-class examples (churners) are kept. The majority class (non-churners) is randomly subsampled down until the target ratio is reached. For example, at 50/50: keep all 350 churners, randomly draw 350 non-churners → 700 training examples. Alternative approaches like SMOTE generate synthetic positive examples instead of discarding negatives — the calibration inflation story applies equally to both.

👀 What to look for on each card: The Sensitivity chip (blue) shows the fraction of actual churners your model would catch at a standard 50% decision threshold — watch it jump from 0% for Natural to 60%+ for 50/50 oversampling. The Mean Pred % chip (yellow) will also rise — and will be alarming: much higher than the true base rate. The Inflation chip (red) shows exactly how many times too high the probabilities are. Each card's mini-chart is a reliability diagram — note the x-axis range is fixed across all four cards so you can directly compare how far scores shift rightward as oversampling increases.

🔧 Show King-Zeng Corrected Probabilities Applies the prior correction formula to rescale raw scores back to the true base rate

CALIBRATION CURVES: ALL CONDITIONS

A reliability diagram plots binned mean predicted probability (x-axis) against observed positive rate (y-axis). The dashed diagonal is perfect calibration.

💡 How to Read This Chart

The dashed gray diagonal is perfect calibration — predicting 20% means 20% of those customers actually convert. Points above the diagonal indicate that the model's probabilities are inflated: it predicts 40% for customers who only convert at 5%.

Toggle to After Correction to see all four condition lines collapse back toward the diagonal. The correction works regardless of how aggressively you oversampled.

SENSITIVITY & CALIBRATION — SIDE BY SIDE

Two charts, two separate problems. The left chart shows sensitivity: what fraction of actual churners does this model catch at a 50% decision threshold? The right chart shows calibration: are the raw probability scores trustworthy? Oversampling dramatically helps the left chart — and quietly breaks the right one. The King-Zeng correction fixes the right chart without touching the left.

Sensitivity at 50% threshold (Left Chart)

With natural sampling, all scores cluster below 10% — no customer ever crosses a 50% decision threshold, so zero churners are flagged. Mild oversampling (20/80) shifts scores upward, but churner peaks stay around 40–45% — still shy of 50%. Moderate (35/65) gets a meaningful slice of churners above 50%, yielding modest sensitivity (~15–25%). Only the aggressive 50/50 approach pushes a large fraction of churners well above the line. This is why practitioners often go straight to 50/50 — it's uniquely reliable at creating actionable scores without threshold retuning.

What Mean Predicted Prob measures (Right Chart)

On a well-calibrated model applied to a random holdout sample, the mean predicted probability should equal the true base rate (e.g., 5%). After oversampling, it will be far above that line. After correction (lighter bars), it snaps back.

Sensitivity: Churners Caught at 50% Threshold
The core operational question: can your team act on this model?

Calibration: Mean Predicted Probability
Is the probability scale truthful? (Green line = true rate)

Score Distribution for Churners (Positives Only)
Where do actual churners' predicted scores land? Red dashed line = 50% decision threshold

⚠️ The tension in two sentences: On the left chart, Natural Sampling shows 0% — that's the problem oversampling solves. In the score distribution, churn bars shift right of the red line as oversampling increases — that's why sensitivity improves. On the right chart, oversampled raw bars sit far above the green line — that's the calibration cost. The lighter (corrected) bars should hug the green line — that's the fix.

SUMMARY METRICS

Sensitivity ↑ Sensitivity at 50% threshold. Fraction of actual churners flagged at a standard decision threshold. Should rise dramatically with oversampling (0% → 60%). After King-Zeng correction, the threshold must be re-tuned to the population rate.

AUC ↑ Discrimination. Rank-ordering quality (probability a churner outscores a non-churner). Barely changes with oversampling in logistic regression — the rank order is largely invariant. Unaffected by the King-Zeng correction.

ECE ↓ Expected Calibration Error. Average gap between predicted probability and observed rate across score buckets. Lower = better calibrated. Should worsen with oversampling, then recover after correction.

Brier ↓ Brier Score. Mean squared error of probability predictions (like MSE for regression, but for probabilities). Lower = better. Should worsen with oversampling and recover after correction.

Condition	Train N	Train % Pos	Sensitivity @50%	AUC	Brier (Raw) ↓	ECE (Raw) ↓	Mean Pred (Raw)	Brier (Corrected) ↓	ECE (Corrected) ↓

Reading this table: The Sensitivity @50% column tells the operational story — it should jump from 0% to 60%+ as oversampling increases. The AUC column will barely move — this is the mathematical reality for logistic regression: rank ordering is class-balance invariant at convergence. AUC is still a useful secondary metric for model comparison. Going down the ECE (Raw) and Brier (Raw) columns, numbers should increase — that's the calibration cost. Going down the Corrected columns, all conditions should return to similar values — that's the King-Zeng correction working.

KEY INSIGHTS

🧠 The Analyst's Playbook for Rare-Event Models

🎯 Discrimination vs. Calibration — Two Different Jobs

Discrimination (AUC): Can the model rank customers correctly? Does the high-risk customer score higher than the low-risk one? You need this for prioritization, targeting, and triage.

Calibration: Do predicted probabilities mean what they say? If the model says 10%, do roughly 10% of those customers actually convert? You need this for any dollar-value calculation.

The workflow: Oversample to get discrimination. Apply the correction to restore calibration. You don't have to choose between them.

📊 When Does Calibration Actually Matter?

Use Case	Need AUC?	Need Calibration?	Reason
Direct mail to top 10% of model	✅ Yes	❌ No	Only rank matters — who's in the top decile
Programmatic bid pricing	✅ Yes	✅ Yes	Bid = p × value; bad p → bad bid
CLV estimation	—	✅ Yes	CLV formulas directly multiply probability
Churn score thresholding	✅ Yes	⚠️ Sometimes	Depends on whether threshold is absolute or relative
A/B test lift measurement	—	✅ Yes	Comparing predicted lift requires calibrated probability differences

🔧 Calibration Correction Methods Compared

Method	Complexity	Best For
King-Zeng formula	⭐ One formula	Logistic regression with known oversampling rate
Platt scaling	⭐⭐ Fit a 2nd logistic model	SVM, neural networks, any model with known holdout
Isotonic regression	⭐⭐⭐ Non-parametric	Large holdout set, any model, no shape assumption
Temperature scaling	⭐⭐ Single parameter	Neural networks, quick recalibration post-training

⚠️ What This Simulation Doesn't Show

This tool uses negative subsampling: all positive (rare) examples are kept; the majority class is subsampled to hit the target ratio. In practice, SMOTE generates synthetic interpolated positives instead of subsampling, which avoids discarding negatives. The King-Zeng calibration cost applies to both approaches equally — anything that shifts the training class balance shifts the intercept.

The simulation also uses a single continuous predictor with logistic regression. With tree-based models (XGBoost, Random Forest), raw probability outputs are inherently miscalibrated even without oversampling — Platt scaling or isotonic regression are the standard fix there.

OVERVIEW & LEARNING OBJECTIVES

✅ Act 1 — Why Oversample (The Fix)

⚠️ Act 2 — The Calibration Cost (The Problem)

SIMULATION SETTINGS

TRAINING CONDITIONS

CALIBRATION CURVES: ALL CONDITIONS

SENSITIVITY & CALIBRATION — SIDE BY SIDE

Sensitivity at 50% threshold (Left Chart)

What Mean Predicted Prob measures (Right Chart)

Sensitivity: Churners Caught at 50% ThresholdThe core operational question: can your team act on this model?

Calibration: Mean Predicted ProbabilityIs the probability scale truthful? (Green line = true rate)

Score Distribution for Churners (Positives Only)Where do actual churners' predicted scores land? Red dashed line = 50% decision threshold

SUMMARY METRICS

KEY INSIGHTS

🎯 Discrimination vs. Calibration — Two Different Jobs

📊 When Does Calibration Actually Matter?

🔧 Calibration Correction Methods Compared

⚠️ What This Simulation Doesn't Show

Sensitivity: Churners Caught at 50% Threshold
The core operational question: can your team act on this model?

Calibration: Mean Predicted Probability
Is the probability scale truthful? (Green line = true rate)

Score Distribution for Churners (Positives Only)
Where do actual churners' predicted scores land? Red dashed line = 50% decision threshold