Decision Tree Classifier
Build and interpret classification trees for marketing decisions. Understand how decision trees segment customers by learning splitting rules from data. Toggle between automated building and manual exploration to develop intuition for how trees partition feature space.
OVERVIEW & OBJECTIVE
A decision tree classifier is a supervised machine learning algorithm that learns to predict categorical outcomes by finding the best sequence of yes/no questions to ask about your data. Each question creates a "split" that separates customers into increasingly homogeneous groups.
The CART Algorithm: This tool uses Classification and Regression Trees (CART), which builds binary trees by finding the split that maximizes impurity reduction at each node. The algorithm evaluates every possible split point for every variable and chooses the one that best separates the classes.
Key Concepts & When to Use
📊 Best For
- Customer segmentation with clear decision rules
- Churn prediction with actionable thresholds
- Lead scoring and qualification
- Campaign targeting with explainable criteria
⚠️ Limitations
- Can overfit with deep trees or small data
- Axis-parallel splits only (can't capture diagonal boundaries)
- Sensitive to small data changes (high variance)
- May miss complex interactions without sufficient depth
🎯 Output
- IF-THEN rules directly usable in campaigns
- Feature importance rankings
- Probability estimates per segment
- Visual tree structure for stakeholder communication
📈 Marketing Advantage
Unlike "black box" models, decision trees produce interpretable rules you can explain to stakeholders and directly translate into marketing automation logic.
📊 Step 1: Choose Your Business Problem
Select a scenario above to see the business context and variables, or upload your own dataset below.
Or Upload Your Own Data
Drag & Drop CSV file (.csv, .tsv, .txt)
Include headers. One categorical outcome column + predictor columns (numeric or categorical).
⚙️ Step 2: Configure Tree Settings & Model
Algorithm finds optimal splits automatically.
Maximum levels from root to deepest leaf.
Prevents tiny, overfit leaf nodes.
Which outcome class should be treated as "success" for metrics like precision, recall, and ROC curves.
⚙️ Advanced Settings
How to measure quality of splits.
Percentage of data for training vs testing.
Set a number (e.g. "12345") to get the same train/test split each time. Leave blank to use a random seed. Useful for classroom demos.
📚 Understanding Tree Settings
🤖 Auto vs 🛠️ Manual Mode
Auto mode uses the CART algorithm to find the statistically "best" split at each node—the one that most reduces impurity (Gini or Entropy). This is efficient and finds patterns humans might miss.
Manual mode is primarily an educational tool to help you develop intuition for how decision trees partition data. You choose splits based on your own logic, which helps you understand the tradeoffs involved.
📚 Real-World Practice: In industry, decision trees are virtually always built automatically using impurity-based criteria (Gini or Information Gain) combined with hyperparameter tuning—systematically testing different values of max depth, min samples per leaf, etc., to find the best-performing configuration. Manual tree construction is useful for learning but impractical at scale.
That said, manual mode can occasionally be valuable for creating rule-based segments where business logic matters more than pure predictive accuracy—e.g., when stakeholders need round-number thresholds that are easy to communicate and implement.
🌲 Max Depth
Depth controls complexity. A depth-1 tree (a "stump") makes one split. Depth-3 creates up to 8 segments. Deeper trees capture more patterns but risk overfitting—memorizing training data quirks that don't generalize.
Marketing rule of thumb: Start with depth 3-4. If test accuracy drops significantly from training accuracy, the tree is overfitting—reduce depth.
🎯 Target Class
For binary classification, metrics like precision and recall are calculated relative to one "positive" class. In marketing:
- Churn prediction: "Churned" is typically the target (we want to catch churners)
- Conversion: "Converted" is the target
- Lead scoring: "Qualified" or "Won" is usually the target
⚖️ Gini vs Entropy (Split Criteria)
Both are impurity measures that quantify how mixed the classes are at a node. The algorithm evaluates every possible split and chooses the one that maximizes impurity reduction.
- Gini Impurity: Measures the probability of misclassifying a randomly chosen element. Formula: 1 - Σ(pi²). Computationally faster, tends to favor larger partitions.
- Entropy (Information Gain): From information theory—measures the expected "surprise" or uncertainty. Formula: -Σ(pi × log₂(pi)). Can find slightly more balanced splits.
In practice, they usually produce nearly identical trees. Gini is the default in most implementations (including scikit-learn) due to speed.
🌳 Step 3: Build the Tree
🌱 Select a scenario and click "Build Tree" to grow your decision tree.
💡 Tip: Use scroll wheel to zoom, click & drag to pan. Click any node to see detailed statistics.
📈 Step 4: Model Evaluation
📚 Understanding Classification Metrics
🎯 Accuracy
= (Correct Predictions) / (Total Predictions)
What it measures: Overall correctness—how often the model gets it right across all classes.
Marketing context: "Out of all customers we scored, what percentage did we classify correctly?"
⚠️ Caution: Accuracy can be misleading with imbalanced classes. If only 5% of customers churn, predicting "no churn" for everyone gives 95% accuracy but catches zero churners!
🔍 Precision
= True Positives / (True Positives + False Positives)
What it measures: When you predict the target class, how often are you right?
Marketing context: "Of the customers we flagged as likely churners, what percentage actually churned?"
When to prioritize: When false positives are costly. Example: Sending expensive retention offers to customers who weren't going to churn anyway wastes budget.
📡 Recall (Sensitivity)
= True Positives / (True Positives + False Negatives)
What it measures: Of all actual positive cases, how many did you catch?
Marketing context: "Of all customers who actually churned, what percentage did we identify in advance?"
When to prioritize: When missing positives is costly. Example: Missing a high-value customer about to churn means losing that revenue forever.
⚖️ F1 Score
= 2 × (Precision × Recall) / (Precision + Recall)
What it measures: Harmonic mean of precision and recall—a single metric that balances both.
Marketing context: "Overall, how well are we balancing catching churners vs. not wasting resources on false alarms?"
When to use: When you need a single number to compare models and care about both precision and recall. The harmonic mean penalizes extreme imbalances.
🤔 The Precision-Recall Tradeoff
You rarely maximize both. Being more aggressive (predicting "churn" more often) catches more actual churners (↑ recall) but includes more false alarms (↓ precision). Being conservative does the opposite.
Business decision: What's the relative cost of missing a churner vs. wasting a retention offer on a loyal customer? That guides which metric to optimize.
Confusion Matrix
Build a tree to see confusion matrix
How to read this
Rows = Actual class, Columns = Predicted class.
- Diagonal (green): Correct predictions
- Off-diagonal (red): Errors
- False Positives: Predicted target, but wasn't (column sum minus diagonal)
- False Negatives: Missed actual targets (row sum minus diagonal)
Marketing tip: Look at where errors concentrate. Confusing "Low Value" with "Medium Value" may be acceptable; confusing "Loyal" with "Churning" is not!
ROC Curve
Build a tree to see ROC curve
How to read this
ROC Curve (Receiver Operating Characteristic) plots True Positive Rate vs False Positive Rate at various thresholds.
- Diagonal line: Random guessing (AUC = 0.5)
- Upper-left corner: Perfect classifier (AUC = 1.0)
- AUC (Area Under Curve): Probability that a randomly chosen positive case ranks higher than a randomly chosen negative case
Interpretation guide:
- AUC 0.9-1.0: Excellent discrimination
- AUC 0.8-0.9: Good discrimination
- AUC 0.7-0.8: Fair discrimination
- AUC 0.6-0.7: Poor discrimination
- AUC 0.5-0.6: Barely better than random
Marketing context: High AUC means the model reliably ranks likely churners above non-churners, even if the exact threshold varies.
Feature Importance
Build a tree to see feature importance
📊 How to Read Feature Importance (Important!)
What Feature Importance Tells You
Feature importance quantifies how much each predictor variable contributed to the tree's ability to separate classes. Think of it as answering: "Which variables did the tree rely on most heavily when making decisions?"
Variables with high importance were used in splits that affected many observations and/or created big improvements in class purity. Variables with zero importance were never selected for any split—they didn't provide useful discrimination power given the other available features.
How It's Calculated
For decision trees, importance is computed using Mean Decrease in Impurity (MDI):
- At each split on feature X, measure how much Gini impurity (or entropy) decreased
- Weight that decrease by the number of samples reaching that node
- Sum up all the weighted decreases for feature X across the entire tree
- Normalize so all importances sum to 100%
Formula: Importance(X) = Σ [nnode × ΔImpurity] for all nodes splitting on X
⚠️ Critical Caveats for Interpretation
- Importance ≠ Causation: A variable can be highly important because it's correlated with the true causal driver, not because it causes the outcome. "Ice cream sales" might predict drownings (both correlate with summer), but ice cream doesn't cause drownings.
- Correlated features compete: If two variables carry similar information (e.g., "income" and "home value"), the tree may only use one. The unused variable gets low importance even though it's predictive.
- Scale doesn't matter: Unlike some methods, tree-based importance isn't affected by whether a variable is in dollars vs. thousands of dollars.
- Categorical variables with many levels can appear artificially important because they offer more potential split points.
🎯 Marketing Applications
- Campaign prioritization: Focus retention efforts on the drivers that matter most. If "Days Since Last Purchase" dominates, recency-triggered campaigns may be most effective.
- Data collection guidance: High-importance variables are worth investing in better data quality and coverage.
- Stakeholder communication: Use importance rankings to explain which factors the model "pays attention to"—but always pair with business logic validation.
- Feature engineering ideas: If behavioral variables dominate demographics, consider creating more engagement-based features.
💡 Model Summary & Business Rules
Build a tree to see model interpretation...
📊 How to Use Your Decision Tree Model
🎯 Turning Rules into Action
Decision trees produce IF-THEN rules that map directly to marketing segments:
- Email targeting: Create segments matching each leaf's conditions
- Offer personalization: Different offers for high-risk vs low-risk segments
- Budget allocation: Prioritize retention spend on high-churn-probability segments
- Customer journey triggers: Set up automations when customers enter certain rules
⚠️ Common Pitfalls
- Overfitting: If training accuracy >> test accuracy, simplify the tree (reduce depth)
- Leaky features: Beware variables that "know" the outcome (e.g., "cancellation_date" predicting churn)
- Class imbalance: With rare events (2% churn), accuracy is misleading—focus on precision/recall
- Small leaves: Rules based on 5 customers aren't reliable—enforce minimum samples
📈 Improving Your Model
- Add features: Behavioral data (purchase recency, engagement) often beats demographics
- Feature engineering: Ratios, trends, and time-since variables can be powerful
- Try different depths: Plot train vs test accuracy at various depths to find the sweet spot
- Consider ensemble methods: Random Forests (many trees averaged) usually beat single trees