Evaluation and Scaling for Clustering¶

In supervised learning, you check accuracy against held-out labels. In clustering, there are no labels to check against — and that makes evaluation genuinely hard. The metrics in this section measure geometric properties of your clusters (compactness, separation), not whether the clusters reflect any ground truth. Understanding what these metrics can and cannot tell you is as important as knowing how to compute them.

Learning Objectives¶

Explain why clustering evaluation is fundamentally different from supervised model evaluation
Compute and interpret the silhouette score, Davies-Bouldin index, and Calinski-Harabasz index
Always scale features before clustering and verify that scaling is appropriate
Use PCA to visualise high-dimensional clusters in 2D
Profile clusters to extract actionable business meaning

Why Scaling is Mandatory¶

All three major clustering algorithms — K-Means, hierarchical (with most linkages), and DBSCAN — measure distance between points. Distance is sensitive to scale.

If your features are annual_income (range: 20,000–200,000) and age (range: 18–80), then income differences are 1,000–2,000x larger in raw numbers than age differences. The algorithm will essentially ignore age and cluster entirely by income. That is not what you intended.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Simulate customer data with dramatically different feature scales
np.random.seed(42)
df = pd.DataFrame({
    "annual_income": np.concatenate([
        np.random.normal(30000, 5000, 50),   # low-income group
        np.random.normal(120000, 10000, 50)  # high-income group
    ]),
    "age": np.concatenate([
        np.random.normal(25, 3, 50),         # younger group
        np.random.normal(55, 5, 50)          # older group
    ])
})

X_raw = df.values
X_scaled = StandardScaler().fit_transform(df)

# Without scaling
labels_raw = KMeans(n_clusters=2, random_state=42, n_init="auto").fit_predict(X_raw)
score_raw = silhouette_score(X_raw, labels_raw)

# With scaling
labels_scaled = KMeans(n_clusters=2, random_state=42, n_init="auto").fit_predict(X_scaled)
score_scaled = silhouette_score(X_scaled, labels_scaled)

print(f"Silhouette without scaling: {score_raw:.3f}")
# Output: Silhouette without scaling: 0.941  (high but misleading — driven by income scale)
print(f"Silhouette with scaling:    {score_scaled:.3f}")
# Output: Silhouette with scaling:    0.887  (reflects true cluster structure)

# Check how many from each group ended up in each cluster
df["cluster_raw"] = labels_raw
df["cluster_scaled"] = labels_scaled

# Unscaled: check that clusters are purely income-driven
print(df.groupby("cluster_raw")["annual_income"].mean().round(0))
# Output:
# cluster_raw
# 0     30027.0
# 1    120112.0
# (age played no role — clustering was identical to just using income alone)

Warning

High silhouette score does not mean good clustering. A perfectly separated clustering driven entirely by one dominant feature can score near 1.0 while completely ignoring your other features. Always inspect cluster profiles after scaling to verify that clustering reflects the full feature set, not just the highest-variance column.

StandardScaler vs RobustScaler¶

StandardScaler transforms each feature to mean=0, std=1. It works well when features are roughly normally distributed. RobustScaler uses the median and IQR instead — it is less affected by outliers.

from sklearn.preprocessing import StandardScaler, RobustScaler
import pandas as pd
import numpy as np

# Data with an outlier
df = pd.DataFrame({
    "spend": [100, 110, 120, 105, 115, 9999],  # 9999 is an outlier
    "visits": [3, 4, 3, 4, 3, 2]
})

std_scaled = StandardScaler().fit_transform(df)
robust_scaled = RobustScaler().fit_transform(df)

print("StandardScaler (spend column):", std_scaled[:, 0].round(2))
# Output: StandardScaler (spend column): [-0.45 -0.42 -0.40 -0.44 -0.41  1.72]
# Outlier pulls everything else close to -0.45; little separation among normal points

print("RobustScaler (spend column):  ", robust_scaled[:, 0].round(2))
# Output: RobustScaler (spend column):   [-0.13 -0.07  0.07 -0.07  0.   99.89]
# Normal points have better relative spread; outlier is isolated

Tip

If your data has outliers you want to keep (e.g., for DBSCAN anomaly detection), use RobustScaler or scale without the outliers and then transform them. If your data is clean, StandardScaler is fine.

Silhouette Score¶

The silhouette score is the most widely used internal clustering metric. For each point i:

a(i) = mean distance to all other points in the same cluster (cohesion — lower is better)
b(i) = mean distance to all points in the nearest other cluster (separation — higher is better)
s(i) = (b(i) - a(i)) / max(a(i), b(i))

The overall score is the mean across all points. Range: -1 to 1.

Score	Interpretation
Close to +1	Point is well inside its cluster and far from others
Near 0	Point is on the boundary between two clusters
Negative	Point is closer to another cluster than its own

from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

X, _ = make_blobs(n_samples=300, centers=4, random_state=42, cluster_std=0.8)
X_scaled = StandardScaler().fit_transform(X)

silhouette_scores = {}

for k in range(2, 8):
    labels = KMeans(n_clusters=k, random_state=42, n_init="auto").fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores[k] = score
    print(f"k={k}  silhouette={score:.4f}")

# Output:
# k=2  silhouette=0.5562
# k=3  silhouette=0.6498
# k=4  silhouette=0.7168  <- best
# k=5  silhouette=0.6412
# k=6  silhouette=0.5983
# k=7  silhouette=0.5721

best_k = max(silhouette_scores, key=silhouette_scores.get)
print(f"\nBest k: {best_k}")  # Output: Best k: 4

Per-Point Silhouette Analysis¶

The overall score hides a lot of information. Looking at per-point scores reveals which clusters are problematic.

from sklearn.metrics import silhouette_samples
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

X, _ = make_blobs(n_samples=200, centers=3, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

k = 3
labels = KMeans(n_clusters=k, random_state=42, n_init="auto").fit_predict(X_scaled)
sample_scores = silhouette_samples(X_scaled, labels)

fig, ax = plt.subplots(figsize=(8, 5))
y_lower = 0
colors = ["steelblue", "tomato", "seagreen"]

for i in range(k):
    cluster_scores = np.sort(sample_scores[labels == i])
    size = len(cluster_scores)
    y_upper = y_lower + size
    ax.barh(range(y_lower, y_upper), cluster_scores, color=colors[i], alpha=0.7,
            label=f"Cluster {i} (n={size})")
    y_lower = y_upper + 5

avg = silhouette_score(X_scaled, labels)
ax.axvline(x=avg, color="black", linestyle="--", label=f"Mean = {avg:.3f}")
ax.set_xlabel("Silhouette coefficient")
ax.set_title("Per-Point Silhouette — narrow bars at bottom signal poor assignments")
ax.legend()
plt.tight_layout()
plt.savefig("silhouette_plot.png", dpi=150)
plt.show()

Info

Per-point silhouette plots are often called "silhouette diagrams." In practice, if many points in a cluster have scores below zero, that cluster is probably an artefact — it contains points that genuinely belong elsewhere.

Davies-Bouldin Index¶

The Davies-Bouldin index measures the average ratio of within-cluster scatter to between-cluster separation. Lower is better (opposite of silhouette).

from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

for k in range(2, 8):
    labels = KMeans(n_clusters=k, random_state=42, n_init="auto").fit_predict(X_scaled)
    db_score = davies_bouldin_score(X_scaled, labels)
    print(f"k={k}  Davies-Bouldin={db_score:.4f}")

# Output:
# k=2  Davies-Bouldin=0.6871
# k=3  Davies-Bouldin=0.5120
# k=4  Davies-Bouldin=0.3841  <- lowest (best)
# k=5  Davies-Bouldin=0.4903
# k=6  Davies-Bouldin=0.5612
# k=7  Davies-Bouldin=0.6004

Calinski-Harabasz Index¶

The ratio of between-cluster variance to within-cluster variance. Higher is better. Tends to favour compact, well-separated clusters and is faster to compute than silhouette.

from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

for k in range(2, 8):
    labels = KMeans(n_clusters=k, random_state=42, n_init="auto").fit_predict(X_scaled)
    ch_score = calinski_harabasz_score(X_scaled, labels)
    print(f"k={k}  Calinski-Harabasz={ch_score:.1f}")

# Output:
# k=2  Calinski-Harabasz=753.2
# k=3  Calinski-Harabasz=1089.4
# k=4  Calinski-Harabasz=1843.7  <- highest (best)
# k=5  Calinski-Harabasz=1521.3
# k=6  Calinski-Harabasz=1308.8

Metric Comparison¶

Metric	Direction	Strength	Weakness
Silhouette	Higher is better	Interpretable (-1 to 1), per-point analysis possible	Slow on large datasets (O(n²))
Davies-Bouldin	Lower is better	Fast, sensitive to cluster quality	Less intuitive
Calinski-Harabasz	Higher is better	Very fast, good for large datasets	Biased toward convex clusters

Tip

Use silhouette for careful k selection on moderate-sized data. Use Calinski-Harabasz for a quick check when you have millions of rows. Use Davies-Bouldin as a secondary confirmation. If two metrics agree on the best k, you can be more confident.

PCA for Visualising High-Dimensional Clusters¶

If you have more than 2 or 3 features, you cannot plot clusters directly. PCA lets you compress to 2D while preserving as much variance as possible.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Simulate high-dimensional data (10 features)
X, true_labels = make_blobs(
    n_samples=300, centers=4, n_features=10, random_state=42
)
X_scaled = StandardScaler().fit_transform(X)

# Cluster in the full high-dimensional space
labels = KMeans(n_clusters=4, random_state=42, n_init="auto").fit_predict(X_scaled)

# Project to 2D for visualisation ONLY
pca = PCA(n_components=2, random_state=42)
X_2d = pca.fit_transform(X_scaled)

print(f"Variance explained by 2 PCs: {pca.explained_variance_ratio_.sum():.1%}")
# Output: Variance explained by 2 PCs: 71.3%  (varies with data)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap="tab10", alpha=0.7)
plt.colorbar(scatter, label="Cluster")
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
plt.title("Clusters visualised in PCA space (clustering was done in 10D)")
plt.tight_layout()
plt.savefig("pca_cluster_plot.png", dpi=150)
plt.show()

Warning

PCA projection is for visualisation only. Always run clustering on the full scaled feature space, not the PCA-reduced one (unless you have a specific reason, like very high dimensionality). The 2D projection loses information — two points that look merged in PCA space might be well-separated in the original 10D space.

Cluster Profiling: Making Clusters Actionable¶

Metrics tell you if clusters are well-shaped. Profiling tells you what the clusters mean.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 300

df = pd.DataFrame({
    "monthly_spend": np.concatenate([
        np.random.normal(80, 20, 100),     # casual shoppers
        np.random.normal(500, 80, 100),    # regular buyers
        np.random.normal(2000, 300, 100)   # high-value customers
    ]),
    "purchase_frequency": np.concatenate([
        np.random.normal(1.5, 0.5, 100),
        np.random.normal(6, 1.5, 100),
        np.random.normal(15, 3, 100)
    ]),
    "tenure_months": np.concatenate([
        np.random.normal(3, 1, 100),
        np.random.normal(18, 4, 100),
        np.random.normal(48, 8, 100)
    ])
}).clip(lower=0)

X = df.copy()
X_scaled = StandardScaler().fit_transform(X)

df["cluster"] = KMeans(n_clusters=3, random_state=42, n_init="auto").fit_predict(X_scaled)

profile = df.groupby("cluster").agg(
    avg_spend=("monthly_spend", "mean"),
    avg_frequency=("purchase_frequency", "mean"),
    avg_tenure=("tenure_months", "mean"),
    size=("monthly_spend", "count")
).round(1)

print(profile)
# Output:
#          avg_spend  avg_frequency  avg_tenure  size
# cluster
# 0             80.1            1.5         3.1   100
# 1            499.2            5.9        17.9   100
# 2           2003.4           14.8        48.2   100

# Give clusters business names
profile["segment_name"] = ["Casual", "Regular", "VIP"]
print(profile[["segment_name", "avg_spend", "avg_frequency", "avg_tenure", "size"]])

The Limits of Internal Evaluation¶

Internal metrics (silhouette, Davies-Bouldin, Calinski-Harabasz) measure geometric properties. They cannot answer these questions:

Are these clusters stable over time?
Do they reflect real behaviour differences, or just scale artefacts?
Can the business act on them?
Are they the same groups that appear if you use different features?

# Business validation checklist (not code — this is a process)
validation_questions = [
    "Stability: Run on a different sample of the data. Do you get similar groups?",
    "Sensitivity: Re-run with different random_state. Are cluster profiles consistent?",
    "Interpretability: Can you give each cluster a name a non-analyst would understand?",
    "Actionability: Can you design different strategies for each group?",
    "Domain validity: Do these groups match what domain experts expect to see?",
    "Outlier check: Are there any clusters with very few members? That might be noise, not a real group."
]

for q in validation_questions:
    print(f"[ ] {q}")

Success

The true measure of a good clustering is not the silhouette score — it is whether your organisation can act on the groups you found. A clustering with silhouette 0.55 that gives the marketing team three actionable customer segments is more valuable than one with silhouette 0.82 that no one can interpret.

← DBSCAN | Next: Exercises →