K-Means Clustering¶

K-Means is the algorithm that ships in every data scientist's first segmentation report. It is fast, interpretable, and scales to millions of rows. It is also the algorithm most likely to give you confident-looking results that are quietly wrong — because it assumes your clusters are round, similar in size, and free of outliers. Understanding those assumptions is what separates good K-Means usage from bad.

Learning Objectives¶

Explain the K-Means algorithm step by step from first principles
Describe what k-means++ initialisation does and why it matters
Use the elbow method correctly — and know when to distrust it
Use silhouette score as a more reliable criterion for k selection
Scale features before clustering and understand why it is mandatory
Profile clusters to extract business meaning from cluster labels

How K-Means Works¶

The algorithm is elegant. It alternates between two steps until nothing changes:

Initialise: Place k centroids in the feature space (randomly, or via k-means++)
Assign: Assign each point to the nearest centroid (by Euclidean distance)
Update: Move each centroid to the mean of its assigned points
Repeat: Go back to step 2 until assignments stop changing

import numpy as np
import matplotlib.pyplot as plt

# Illustrative manual implementation (don't use in production — use sklearn)
def kmeans_manual(X, k, max_iter=100, random_state=42):
    rng = np.random.default_rng(random_state)
    # Step 1: random initialisation
    centroids = X[rng.choice(len(X), k, replace=False)]

    for _ in range(max_iter):
        # Step 2: assign each point to nearest centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)

        # Step 3: update centroids
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])

        # Step 4: check convergence
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

# Generate synthetic blobs
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
labels, centroids = kmeans_manual(X, k=3)
print(f"Unique labels: {np.unique(labels)}")
# Output: Unique labels: [0 1 2]

Info

K-Means is guaranteed to converge — assignments will eventually stop changing. It is NOT guaranteed to find the global optimum. Different random initialisations can lead to different final clusters. This is why sklearn runs the algorithm multiple times by default (n_init) and keeps the best result.

K-Means++ Initialisation¶

Random initialisation can produce bad starting centroids — two centroids placed in the same cluster, for example. K-Means++ fixes this by spreading starting centroids out:

Pick the first centroid randomly from the data
For each remaining centroid: pick the next point with probability proportional to its squared distance from the nearest existing centroid

This means centroids start far apart, which dramatically improves convergence speed and result quality. Sklearn uses k-means++ by default.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import pandas as pd

# Generate a realistic customer-like dataset
X, true_labels = make_blobs(
    n_samples=500,
    centers=4,
    cluster_std=1.2,
    random_state=42
)

X_scaled = StandardScaler().fit_transform(X)

# init="k-means++" is the default — shown explicitly here for clarity
model = KMeans(n_clusters=4, init="k-means++", n_init=10, random_state=42)
labels = model.fit_predict(X_scaled)

print(f"Inertia: {model.inertia_:.2f}")
# Output: Inertia: 972.43  (sum of squared distances to nearest centroid)

print(f"Iterations to converge: {model.n_iter_}")
# Output: Iterations to converge: 5

Tip

Always set random_state when running K-Means. Without it, results change between runs, making experiments non-reproducible. Set n_init=10 or higher for more robust results on noisy data.

A Complete Customer Segmentation Example¶

This is the workflow you will use in practice:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Realistic customer behaviour data
customers = pd.DataFrame({
    "customer_id": range(1, 11),
    "monthly_spend": [120, 115, 130, 4500, 4200, 4800, 2100, 2300, 2050, 6800],
    "visits_per_month": [2, 3, 2, 18, 20, 17, 9, 10, 8, 1],
    "tenure_months": [3, 4, 3, 36, 40, 32, 18, 22, 16, 2]
})

X = customers[["monthly_spend", "visits_per_month", "tenure_months"]]

# Pipeline ensures scaling happens inside the model, not outside
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KMeans(n_clusters=3, random_state=42, n_init="auto"))
])

customers["cluster"] = pipe.fit_predict(X)

# Profile the clusters — this is the actual output you care about
profile = customers.groupby("cluster").agg(
    avg_spend=("monthly_spend", "mean"),
    avg_visits=("visits_per_month", "mean"),
    avg_tenure=("tenure_months", "mean"),
    count=("customer_id", "count")
).round(1)

print(profile)
# Output:
#          avg_spend  avg_visits  avg_tenure  count
# cluster
# 0          121.7         2.3         3.3      3    <- Low-value, new
# 1         4500.0        18.3        36.0      3    <- High-value, loyal
# 2         2150.0         9.0        18.7      3    <- Mid-tier, engaged
# (cluster numbers may vary by run — labels are arbitrary)

Warning

Cluster labels (0, 1, 2) are arbitrary integers. Cluster 0 is not "the first" or "the most important." If you re-run with a different random_state, label 0 might now contain what was previously label 1. Always refer to clusters by their characteristics, not their number.

Choosing K: The Elbow Method¶

Inertia is the sum of squared distances from each point to its nearest centroid. As k increases, inertia always decreases — with k=n (one cluster per point), inertia is zero. You are looking for the point where adding more clusters stops meaningfully reducing inertia.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

inertias = []
k_range = range(1, 11)

for k in k_range:
    model = KMeans(n_clusters=k, random_state=42, n_init="auto")
    model.fit(X_scaled)
    inertias.append(model.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(list(k_range), inertias, marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method — look for the bend")
plt.xticks(list(k_range))
plt.tight_layout()
plt.savefig("elbow_plot.png", dpi=150)
plt.show()

# The "elbow" is usually around k=4 for this data
# (corresponds to the true number of clusters used in make_blobs)

Warning

The elbow method is a heuristic, not a rule. Many real datasets produce a smooth curve with no clear elbow. When that happens, the elbow method tells you nothing useful. Do not stare at a smooth curve hoping an elbow will appear — move on to the silhouette score.

Choosing K: Silhouette Score (More Reliable)¶

The silhouette score measures how well each point fits its cluster compared to the next closest cluster. For each point:

a = mean distance to all other points in the same cluster (cohesion)
b = mean distance to all points in the nearest other cluster (separation)
s = (b - a) / max(a, b)

Score ranges from -1 to 1: - Near +1: point is well-matched to its cluster and far from others - Near 0: point is on the boundary between two clusters - Near -1: point is probably in the wrong cluster

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

silhouette_scores = []
k_range = range(2, 11)  # silhouette undefined for k=1

for k in k_range:
    model = KMeans(n_clusters=k, random_state=42, n_init="auto")
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)
    print(f"k={k}  silhouette={score:.3f}")

# Output:
# k=2  silhouette=0.578
# k=3  silhouette=0.634
# k=4  silhouette=0.721  <- highest, matches true k
# k=5  silhouette=0.643
# k=6  silhouette=0.598
# ...

best_k = list(k_range)[silhouette_scores.index(max(silhouette_scores))]
print(f"\nBest k by silhouette score: {best_k}")
# Output: Best k by silhouette score: 4

Tip

Use both the elbow method and silhouette score together. If they agree, you have confidence. If they disagree, inspect the cluster profiles for each candidate k and ask which segmentation is more actionable for your use case.

Assumptions and Failure Modes¶

K-Means makes assumptions that are often violated in real data:

Assumption 1: Clusters are spherical (isotropic) K-Means uses Euclidean distance to assign points to centroids. This works when clusters are roughly circular. It fails when clusters are elongated, crescent-shaped, or nested.

Assumption 2: Clusters are similar in size A centroid is pulled toward the mean of all its assigned points. If one true cluster has 1,000 points and another has 10, the large cluster will dominate boundary decisions.

Assumption 3: No outliers One extreme outlier can pull a centroid far from the genuine cluster center, corrupting the entire segment.

from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# make_moons produces crescent-shaped clusters — not spherical
X_moons, true_labels = make_moons(n_samples=300, noise=0.05, random_state=42)
X_scaled = StandardScaler().fit_transform(X_moons)

kmeans = KMeans(n_clusters=2, random_state=42, n_init="auto")
predicted = kmeans.fit_predict(X_scaled)

# Check how badly K-Means fails here
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(true_labels, predicted)
print(f"Adjusted Rand Index: {ari:.3f}")
# Output: Adjusted Rand Index: 0.402  <- poor, true structure is ignored
# (DBSCAN would score near 1.0 on this same data)

Warning

K-Means will give you confident-looking, colourful cluster plots even when it is completely wrong. The elbow curve will bend. The centroids will be neatly placed. But if your clusters are non-spherical, K-Means is drawing incorrect boundaries with misplaced confidence. Always visualise the clusters and compare against DBSCAN before concluding.

Scaling is Mandatory¶

K-Means uses Euclidean distance. Features with large ranges dominate features with small ranges unless you scale.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({
    "annual_income": [25000, 80000, 90000, 30000, 75000],  # range: tens of thousands
    "satisfaction_score": [3, 8, 9, 4, 7]                  # range: 1-10
})

# Without scaling: annual_income completely dominates
X_raw = data.values
labels_raw = KMeans(n_clusters=2, random_state=42, n_init="auto").fit_predict(X_raw)
print("Labels without scaling:", labels_raw)
# Output: Labels without scaling: [0 1 1 0 1]
# (income drives everything; satisfaction ignored)

# With scaling: both features contribute equally
X_scaled = StandardScaler().fit_transform(data)
labels_scaled = KMeans(n_clusters=2, random_state=42, n_init="auto").fit_predict(X_scaled)
print("Labels with scaling:", labels_scaled)
# Output: Labels with scaling: [0 1 1 0 1]
# (may look same here, but on real data the difference is significant)

# Verify what StandardScaler does
import numpy as np
print(f"\nIncome std: {data['annual_income'].std():.1f}")
print(f"Score std:  {data['satisfaction_score'].std():.1f}")
# Output:
# Income std: 29580.4
# Score std:  2.4
# <- without scaling, income has 12,000x more influence per unit than score

Key Takeaways¶

Success

K-Means is fast, interpretable, and excellent for well-separated, roughly spherical clusters. Always scale first. Use k-means++ initialisation. Choose k with silhouette score, not just the elbow method. Profile your clusters — the numeric labels are meaningless until you describe what each group represents in your domain.

What's Next¶

You've covered K-Means' centroid-update algorithm, k-means++ initialisation, elbow method and silhouette score for k selection, cluster profiling for business interpretation, the spherical-cluster assumption and its failure modes, and why mandatory feature scaling is the first step before any distance-based algorithm. Next up: 03-hierarchical-clustering — where you'll learn agglomerative clustering, interpret the dendrogram to choose the number of clusters without specifying k in advance, and understand when the hierarchical approach's O(n²) cost is worth paying.

Optional Deep Dive

Read the original Lloyd (1957/1982) K-Means paper "Least Squares Quantization in PCM" (IEEE Transactions on Information Theory) — it is short and shows the mathematical proof that K-Means' centroid update step is exactly the algorithm for minimising total within-cluster variance, which explains both why it works and why it gets stuck in local minima.

← Clustering Overview | Next: Hierarchical Clustering →