k-Prototypes Clustering Tool

Mixed Data Segmentation

Segment customers, products, or campaigns using k-prototypes clustering—designed specifically for datasets with both continuous (spend, frequency) and categorical (region, tier) variables. Upload mixed-type data or explore prebuilt marketing scenarios.

👨‍🏫 Professor Mode: Guided Learning Experience

New to mixed-data clustering? Enable Professor Mode for step-by-step guidance through segmenting customers using both numeric and categorical variables!

OVERVIEW & OBJECTIVE

k-prototypes extends k-means to handle mixed data types by combining Euclidean distance for continuous variables with simple matching for categorical variables. Unlike k-means (continuous only) or k-modes (categorical only), k-prototypes handles realistic marketing datasets where customers or products have both numerical metrics (spend, frequency) and descriptive attributes (region, tier, channel).

$$d(X, Q) = \sum_{j \in \text{continuous}} (x_j - q_j)^2 + \gamma \sum_{j \in \text{categorical}} \delta(x_j, q_j)$$

where \(\delta(a, b) = 0\) if \(a = b\), else \(1\). The parameter \(\gamma\) balances the influence of continuous vs. categorical variables. Each cluster center (prototype) contains means for continuous features and modes (most frequent values) for categorical features.

Additional notes & assumptions

Like k-means, k-prototypes is sensitive to initialization and feature scales. The tool auto-calculates \(\gamma\) as the average standard deviation of continuous features, but you can override this in advanced settings. Results are exploratory—validate segments with business knowledge and holdout data.

MARKETING SCENARIOS

Use presets to explore realistic segmentation scenarios with mixed data types, such as customer profiles (spend + demographics), product portfolios (performance + attributes), or lead databases (engagement + firmographics). Download and customize scenario data in Excel.

INPUTS & SETTINGS

Load data & assign variable types

Upload CSV with mixed data types

Include a header row. Columns can be continuous (numeric values) or categorical (text labels). The tool auto-detects variable types, which you can adjust after upload. Limit: 5,000 rows. Alternatively, select a scenario above to load sample data.

Drag & Drop CSV file (.csv, .tsv, .txt)

Include headers with mixed continuous and categorical columns.

No file uploaded.


Preprocessing & clustering

to

Advanced settings

Distance weight parameter (γ)

Gamma (γ) controls the relative weight of categorical vs. continuous variables in distance calculations. Auto-mode uses the average standard deviation of continuous features (typically works well). Increase γ to give categorical variables more influence; decrease to prioritize continuous variables.

Auto: γ = (will be calculated after data load)

Additional info & guidance

Start with k=3–4 and run diagnostics for k=2–8. Look for an elbow in the cost plot and high silhouette values (>0.3) to identify well-separated clusters. Because k-prototypes uses multiple random initializations, results are generally stable but may vary slightly between runs.

Standardization is recommended when continuous variables have very different scales (e.g., age 0–100 vs. spend $0–$10,000). Note: standardization affects auto-calculated gamma by changing variance structure. If clusters seem overly driven by one variable type, adjust gamma manually.

VISUAL OUTPUT

Parallel Coordinates Plot

How to read this chart

Each line represents one observation, colored by cluster. Continuous axes show numeric scales; categorical axes show discrete levels. Look for line bundles within the same color—these indicate observations with similar profiles across multiple variables. Diverging patterns suggest different cluster characteristics.

Elbow Plot (Total Cost vs. k)

Interpretation Aid

Total cost combines continuous squared distances and γ-weighted categorical mismatches. Look for an "elbow" where cost stops dropping significantly—this suggests a natural cluster count. Beyond the elbow, additional clusters provide diminishing value. Your selected k is highlighted.

Silhouette Plot (Average Silhouette vs. k)

Interpretation Aid

Silhouette values range from -1 to 1. Higher values indicate better-defined clusters (observations are closer to their own cluster than to others). Values near 0 suggest overlapping clusters; negative values indicate possible misassignments.

CLUSTER SUMMARY & RESULTS

Summary metrics

k (clusters) --
n (observations) --
Total cost --
Avg. silhouette --
Gamma (γ) --
What do these metrics mean?
  • Total cost: The sum of distances from each observation to its assigned cluster center. Lower values indicate tighter, more cohesive clusters. Use this to compare different k values—look for an "elbow" where cost reduction slows.
  • Avg. silhouette: Measures how well observations fit their clusters. Ranges from -1 to 1. Interpretation: >0.7 = strong structure, 0.5–0.7 = reasonable, 0.25–0.5 = weak, <0.25 = little to no structure.
  • Gamma (γ): Balances the influence of continuous vs. categorical variables in the distance calculation. A mismatch on a categorical variable adds γ to the distance. When auto-calculated, it equals the average standard deviation of scaled continuous features.

Cluster profiles

For continuous variables, values show cluster means with within-cluster standard deviation (±sd). For categorical variables, values show the mode with percentage of cluster members having that value.

ClusterSize
Run clustering to see cluster profiles.
Interpreting cluster profiles

Each row shows a cluster's prototype. For continuous variables, the table displays the mean value across all cluster members. For categorical variables, it shows the mode (most common category). Use these profiles to create business-friendly labels—e.g., a cluster with high spend, "West" region, and "Enterprise" tier might be "High-Value West Coast Enterprise Customers."