Categorization Studio

Data Wrangling

Learn how different binning strategies transform the same numeric variable into categories with different meanings. Compare methods side-by-side to see how design choices shape interpretation.

👨‍🏫 Professor Mode: Guided Learning Experience

New to binning? Enable Professor Mode for step-by-step guidance through transforming numeric data into meaningful categories!

CORE LEARNING OUTCOME

Categories are constructed through design choices, and different choices create different meanings from the same data.

This tool teaches you five approaches to converting numeric variables into categorical groups. Each method makes different trade-offs between interpretability, balance, and fidelity to the underlying distribution.

About binning methods
  • Equal-width: Divides the range into bins of equal size. Simple but can create empty or imbalanced bins with skewed data.
  • Quantile (equal-frequency): Creates bins with equal numbers of observations. Balanced groups, but bin widths vary.
  • Manual rules: You define the cutpoints. Maximum interpretability when business thresholds exist.
  • Jenks natural breaks: Minimizes within-group variance while maximizing between-group variance. Finds "natural" clusters in the data.
  • K-means (1D): Clustering-based approach that groups similar values together. Good for discovering natural groupings.
When to use each method
Method Best For Avoid When
Equal-width Uniform distributions, simple reporting Highly skewed data
Quantile Ranking, percentile-based targets Many tied values at boundaries
Manual Known thresholds (e.g., NPS categories) No domain knowledge available
Jenks Geographic/spatial data, natural clusters Need for equal-sized groups
K-means Discovering patterns, exploratory analysis Need reproducible business rules

MARKETING SCENARIOS

Each case study demonstrates a different distributional pattern commonly found in marketing data. Select one to explore how binning choices affect interpretation, or upload your own dataset.

DATA INPUT

Load data

Case study loaded

Select a case study from the dropdown above to load data.

Upload your CSV or Excel file

Provide a file with a header row. After upload, select which numeric variable to analyze.

Drag & Drop data file (.csv, .tsv, .txt, .xlsx)

First row should contain column headers.

No file uploaded.

BINNING SCHEMA

Load data and apply a schema to see the visualization.

How to read this chart

Each dot represents one observation. The x-position shows the raw numeric value (this never changes). The y-position is random jitter—it spreads dots vertically so you can see individual points instead of them stacking on top of each other. The y-axis has no meaning; ignore it.

The color and vertical band show the bin assignment under the current schema. When you change the binning method or number of bins, watch how dots get reassigned to different categories—same data, different interpretation.

Design tensions to consider
  • Interpretability vs. behavioral fidelity: Simple round-number cutoffs are easy to explain but may not reflect natural patterns in the data.
  • Even group sizes vs. identical behavior: Quantile bins ensure balanced samples but may group dissimilar observations together.
  • Dashboard stability vs. sensitivity: Fixed cutoffs are stable over time, but may miss emerging patterns.

COMPARE BINNING STRATEGIES

Click to expand side-by-side comparison

Compare how different binning methods partition the same data. Each panel updates automatically as you change settings. Click "Use This" to apply that schema to the Bin Summary below.

Why compare schemas?

Different binning methods optimize for different goals. Equal-width creates uniform ranges (good for interpretability). Quantile creates equal-sized groups (good for comparisons). Jenks and K-means minimize within-group variance (good for clustering). Seeing them side-by-side reveals how the same data tells different stories depending on your binning philosophy.

BIN SUMMARY

Apply a schema to see bin statistics.

EXPORT

Code templates

Generate deterministic code to reproduce this binning in your analysis environment.