Decision Tree Classifier

Machine Learning

Build and interpret classification trees for marketing decisions. Understand how decision trees segment customers by learning splitting rules from data. Toggle between automated building and manual exploration to develop intuition for how trees partition feature space.

OVERVIEW & OBJECTIVE

A decision tree classifier is a supervised machine learning algorithm that learns to predict categorical outcomes by finding the best sequence of yes/no questions to ask about your data. Each question creates a "split" that separates customers into increasingly homogeneous groups.

The CART Algorithm: This tool uses Classification and Regression Trees (CART), which builds binary trees by finding the split that maximizes impurity reduction at each node. The algorithm evaluates every possible split point for every variable and chooses the one that best separates the classes.

Key Concepts & When to Use

📊 Best For

Customer segmentation with clear decision rules
Churn prediction with actionable thresholds
Lead scoring and qualification
Campaign targeting with explainable criteria

⚠️ Limitations

Can overfit with deep trees or small data
Axis-parallel splits only (can't capture diagonal boundaries)
Sensitive to small data changes (high variance)
May miss complex interactions without sufficient depth

🎯 Output

IF-THEN rules directly usable in campaigns
Feature importance rankings
Probability estimates per segment
Visual tree structure for stakeholder communication

📈 Marketing Advantage

Unlike "black box" models, decision trees produce interpretable rules you can explain to stakeholders and directly translate into marketing automation logic.

📊 Step 1: Choose Your Business Problem

Load a marketing use case:

Select a scenario above to see the business context and variables, or upload your own dataset below.

Or Upload Your Own Data

Drag & Drop CSV file (.csv, .tsv, .txt)

Include headers. One categorical outcome column + predictor columns (numeric or categorical).

⚙️ Step 2: Configure Tree Settings & Model

Build Mode

Algorithm finds optimal splits automatically.

Max Depth: 3

Maximum levels from root to deepest leaf.

Min Samples per Leaf: 10

Prevents tiny, overfit leaf nodes.

Target Class (Positive Outcome)

Which outcome class should be treated as "success" for metrics like precision, recall, and ROC curves.

⚙️ Advanced Settings

Split Criterion

How to measure quality of splits.

Train/Test Split: 70%

Percentage of data for training vs testing.

Random Seed (for Reproducibility)

Set a number (e.g. "12345") to get the same train/test split each time. Leave blank to use a random seed. Useful for classroom demos.

📚 Understanding Tree Settings

🤖 Auto vs 🛠️ Manual Mode

Auto mode uses the CART algorithm to find the statistically "best" split at each node—the one that most reduces impurity (Gini or Entropy). This is efficient and finds patterns humans might miss.

Manual mode is primarily an educational tool to help you develop intuition for how decision trees partition data. You choose splits based on your own logic, which helps you understand the tradeoffs involved.

📚 Real-World Practice: In industry, decision trees are virtually always built automatically using impurity-based criteria (Gini or Information Gain) combined with hyperparameter tuning—systematically testing different values of max depth, min samples per leaf, etc., to find the best-performing configuration. Manual tree construction is useful for learning but impractical at scale.

That said, manual mode can occasionally be valuable for creating rule-based segments where business logic matters more than pure predictive accuracy—e.g., when stakeholders need round-number thresholds that are easy to communicate and implement.

🌲 Max Depth

Depth controls complexity. A depth-1 tree (a "stump") makes one split. Depth-3 creates up to 8 segments. Deeper trees capture more patterns but risk overfitting—memorizing training data quirks that don't generalize.

Marketing rule of thumb: Start with depth 3-4. If test accuracy drops significantly from training accuracy, the tree is overfitting—reduce depth.

🎯 Target Class

For binary classification, metrics like precision and recall are calculated relative to one "positive" class. In marketing:

Churn prediction: "Churned" is typically the target (we want to catch churners)
Conversion: "Converted" is the target
Lead scoring: "Qualified" or "Won" is usually the target

⚖️ Gini vs Entropy (Split Criteria)

Both are impurity measures that quantify how mixed the classes are at a node. The algorithm evaluates every possible split and chooses the one that maximizes impurity reduction.

Gini Impurity: Measures the probability of misclassifying a randomly chosen element. Formula: 1 - Σ(p_i²). Computationally faster, tends to favor larger partitions.
Entropy (Information Gain): From information theory—measures the expected "surprise" or uncertainty. Formula: -Σ(p_i × log₂(p_i)). Can find slightly more balanced splits.

In practice, they usually produce nearly identical trees. Gini is the default in most implementations (including scikit-learn) due to speed.

🌳 Step 3: Build the Tree

🌱 Select a scenario and click "Build Tree" to grow your decision tree.

💡 Tip: Use scroll wheel to zoom, click & drag to pan. Click any node to see detailed statistics.

📈 Step 4: Model Evaluation

Test Accuracy --

Train Accuracy --

Precision --

Recall --

F1 Score --

📚 Understanding Classification Metrics

🎯 Accuracy

= (Correct Predictions) / (Total Predictions)

What it measures: Overall correctness—how often the model gets it right across all classes.

Marketing context: "Out of all customers we scored, what percentage did we classify correctly?"

⚠️ Caution: Accuracy can be misleading with imbalanced classes. If only 5% of customers churn, predicting "no churn" for everyone gives 95% accuracy but catches zero churners!

🔍 Precision

= True Positives / (True Positives + False Positives)

What it measures: When you predict the target class, how often are you right?

Marketing context: "Of the customers we flagged as likely churners, what percentage actually churned?"

When to prioritize: When false positives are costly. Example: Sending expensive retention offers to customers who weren't going to churn anyway wastes budget.

📡 Recall (Sensitivity)

= True Positives / (True Positives + False Negatives)

What it measures: Of all actual positive cases, how many did you catch?

Marketing context: "Of all customers who actually churned, what percentage did we identify in advance?"

When to prioritize: When missing positives is costly. Example: Missing a high-value customer about to churn means losing that revenue forever.

⚖️ F1 Score

= 2 × (Precision × Recall) / (Precision + Recall)

What it measures: Harmonic mean of precision and recall—a single metric that balances both.

Marketing context: "Overall, how well are we balancing catching churners vs. not wasting resources on false alarms?"

When to use: When you need a single number to compare models and care about both precision and recall. The harmonic mean penalizes extreme imbalances.

🤔 The Precision-Recall Tradeoff

You rarely maximize both. Being more aggressive (predicting "churn" more often) catches more actual churners (↑ recall) but includes more false alarms (↓ precision). Being conservative does the opposite.

Business decision: What's the relative cost of missing a churner vs. wasting a retention offer on a loyal customer? That guides which metric to optimize.

Confusion Matrix

Build a tree to see confusion matrix

How to read this

Rows = Actual class, Columns = Predicted class.

Diagonal (green): Correct predictions
Off-diagonal (red): Errors
False Positives: Predicted target, but wasn't (column sum minus diagonal)
False Negatives: Missed actual targets (row sum minus diagonal)

Marketing tip: Look at where errors concentrate. Confusing "Low Value" with "Medium Value" may be acceptable; confusing "Loyal" with "Churning" is not!

ROC Curve

Build a tree to see ROC curve

How to read this

ROC Curve (Receiver Operating Characteristic) plots True Positive Rate vs False Positive Rate at various thresholds.

Diagonal line: Random guessing (AUC = 0.5)
Upper-left corner: Perfect classifier (AUC = 1.0)
AUC (Area Under Curve): Probability that a randomly chosen positive case ranks higher than a randomly chosen negative case

Interpretation guide:

AUC 0.9-1.0: Excellent discrimination
AUC 0.8-0.9: Good discrimination
AUC 0.7-0.8: Fair discrimination
AUC 0.6-0.7: Poor discrimination
AUC 0.5-0.6: Barely better than random

Marketing context: High AUC means the model reliably ranks likely churners above non-churners, even if the exact threshold varies.

Feature Importance

Build a tree to see feature importance

📊 How to Read Feature Importance (Important!)

What Feature Importance Tells You

Feature importance quantifies how much each predictor variable contributed to the tree's ability to separate classes. Think of it as answering: "Which variables did the tree rely on most heavily when making decisions?"

Variables with high importance were used in splits that affected many observations and/or created big improvements in class purity. Variables with zero importance were never selected for any split—they didn't provide useful discrimination power given the other available features.

How It's Calculated

For decision trees, importance is computed using Mean Decrease in Impurity (MDI):

At each split on feature X, measure how much Gini impurity (or entropy) decreased
Weight that decrease by the number of samples reaching that node
Sum up all the weighted decreases for feature X across the entire tree
Normalize so all importances sum to 100%

Formula: Importance(X) = Σ [n_node × ΔImpurity] for all nodes splitting on X

⚠️ Critical Caveats for Interpretation

Importance ≠ Causation: A variable can be highly important because it's correlated with the true causal driver, not because it causes the outcome. "Ice cream sales" might predict drownings (both correlate with summer), but ice cream doesn't cause drownings.
Correlated features compete: If two variables carry similar information (e.g., "income" and "home value"), the tree may only use one. The unused variable gets low importance even though it's predictive.
Scale doesn't matter: Unlike some methods, tree-based importance isn't affected by whether a variable is in dollars vs. thousands of dollars.
Categorical variables with many levels can appear artificially important because they offer more potential split points.

🎯 Marketing Applications

Campaign prioritization: Focus retention efforts on the drivers that matter most. If "Days Since Last Purchase" dominates, recency-triggered campaigns may be most effective.
Data collection guidance: High-importance variables are worth investing in better data quality and coverage.
Stakeholder communication: Use importance rankings to explain which factors the model "pays attention to"—but always pair with business logic validation.
Feature engineering ideas: If behavioral variables dominate demographics, consider creating more engagement-based features.

💡 Model Summary & Business Rules

Build a tree to see model interpretation...

📊 How to Use Your Decision Tree Model

🎯 Turning Rules into Action

Decision trees produce IF-THEN rules that map directly to marketing segments:

Email targeting: Create segments matching each leaf's conditions
Offer personalization: Different offers for high-risk vs low-risk segments
Budget allocation: Prioritize retention spend on high-churn-probability segments
Customer journey triggers: Set up automations when customers enter certain rules

⚠️ Common Pitfalls

Overfitting: If training accuracy >> test accuracy, simplify the tree (reduce depth)
Leaky features: Beware variables that "know" the outcome (e.g., "cancellation_date" predicting churn)
Class imbalance: With rare events (2% churn), accuracy is misleading—focus on precision/recall
Small leaves: Rules based on 5 customers aren't reliable—enforce minimum samples

📈 Improving Your Model

Add features: Behavioral data (purchase recency, engagement) often beats demographics
Feature engineering: Ratios, trends, and time-since variables can be powerful
Try different depths: Plot train vs test accuracy at various depths to find the sweet spot
Consider ensemble methods: Random Forests (many trees averaged) usually beat single trees

Decision Tree Classifier

👨‍🏫 Professor Mode: Guided Learning Experience

OVERVIEW & OBJECTIVE

📊 Best For

⚠️ Limitations

🎯 Output

📈 Marketing Advantage

📊 Step 1: Choose Your Business Problem

Or Upload Your Own Data

Assign Variables

⚙️ Step 2: Configure Tree Settings & Model

🤖 Auto vs 🛠️ Manual Mode

🌲 Max Depth

🎯 Target Class

⚖️ Gini vs Entropy (Split Criteria)

🌳 Step 3: Build the Tree

📊 Customize Split

📈 Step 4: Model Evaluation

🎯 Accuracy

🔍 Precision

📡 Recall (Sensitivity)

⚖️ F1 Score

🤔 The Precision-Recall Tradeoff

Confusion Matrix

ROC Curve

Feature Importance

What Feature Importance Tells You

How It's Calculated

⚠️ Critical Caveats for Interpretation

🎯 Marketing Applications

💡 Model Summary & Business Rules

🎯 Turning Rules into Action

⚠️ Common Pitfalls

📈 Improving Your Model

👨‍🏫 Professor Mode: Guided Learning Experience

OVERVIEW & OBJECTIVE

📊 Best For

⚠️ Limitations

🎯 Output

📈 Marketing Advantage

📊 Step 1: Choose Your Business Problem

Or Upload Your Own Data

Assign Variables

⚙️ Step 2: Configure Tree Settings & Model

🤖 Auto vs 🛠️ Manual Mode

🌲 Max Depth

🎯 Target Class

⚖️ Gini vs Entropy (Split Criteria)

🌳 Step 3: Build the Tree

📊 Customize Split

📈 Step 4: Model Evaluation

🎯 Accuracy

🔍 Precision

📡 Recall (Sensitivity)

⚖️ F1 Score

🤔 The Precision-Recall Tradeoff

Confusion Matrix

ROC Curve

Feature Importance

What Feature Importance Tells You

How It's Calculated

⚠️ Critical Caveats for Interpretation

🎯 Marketing Applications

💡 Model Summary & Business Rules

🎯 Turning Rules into Action

⚠️ Common Pitfalls

📈 Improving Your Model

📊 Node Details