Clustering Overview¶

A bank's fraud team labels thousands of suspicious transactions by hand every year. Before they can do that, someone has to decide what "suspicious" even looks like — and that someone is often a clustering algorithm running on raw, unlabeled transaction data. Clustering is how you find structure before you have answers.

Learning Objectives¶

Explain what clustering is and how it differs from classification
Name four practical use cases and describe what the clusters represent in each
Choose the right algorithm given data shape, size, and goal
Explain why evaluating clustering is harder than evaluating supervised models

What Clustering Is¶

Clustering is the task of grouping observations so that points within a group are more similar to each other than to points in other groups. No labels required. The algorithm looks at the structure of the feature space and returns an assignment.

This makes it unsupervised learning — you are not teaching the model what categories exist. You are asking it to discover whatever categories the data implies.

import pandas as pd
import numpy as np

# Supervised: you have a target
df_supervised = pd.DataFrame({
    "spend": [200, 5000, 300, 8000],
    "churn": [0, 0, 1, 1]   # <-- label exists
})

# Unsupervised clustering: no target, just features
df_clustering = pd.DataFrame({
    "spend": [200, 5000, 300, 8000],
    "visits": [2, 15, 3, 18]
    # no churn column — you are discovering groups, not predicting one
})

Info

The word "cluster" comes from statistics, not machine learning. Statisticians were grouping observations long before neural networks existed. The algorithms here are 40–50 years old — and they still work.

Why Clustering Matters in Practice¶

Customer Segmentation¶

Group customers by behaviour (spend, frequency, recency) to tailor marketing, pricing, or retention strategy. The RFM model (Recency, Frequency, Monetary) used across e-commerce is a clustering problem.

Anomaly Detection¶

Clusters describe "normal." Points that do not belong to any cluster, or that form tiny isolated clusters, are anomalies. DBSCAN does this natively. K-Means can do it by measuring distance from the nearest centroid.

Data Compression and Quantisation¶

Image compression with K-Means: replace each pixel colour with the nearest centroid colour. 256 centroids can represent millions of colours with acceptable visual fidelity. This is called vector quantisation.

Pre-Labelling and Exploratory Analysis¶

Before training a supervised model, cluster your data to find natural groupings, then hand-label a representative sample from each cluster. This is cheaper than labelling randomly.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Imagine you have 100,000 unlabeled rows to label
# Clustering lets you label 50 representative examples
# instead of all 100,000

X = pd.DataFrame({
    "spend": [120, 130, 5000, 5100, 4900, 200, 180],
    "visits": [2, 3, 20, 22, 19, 2, 3]
})

X_scaled = StandardScaler().fit_transform(X)
labels = KMeans(n_clusters=3, random_state=42, n_init="auto").fit_predict(X_scaled)

# Sample one point per cluster for manual labelling
X["cluster"] = labels
representative_sample = X.groupby("cluster").first()
print(representative_sample)
# Output:
#          spend  visits  cluster
# cluster
# 0          120       2        0
# 1         5000      20        1
# 2          200       2        2  (if 3 groups appear)

Document and Topic Grouping¶

Cluster TF-IDF vectors of news articles to find topic groups without needing category labels. This is how topic modelling systems bootstrap themselves.

The Three Algorithms You Need to Know¶

Algorithm	Core idea	Requires k?	Handles noise?	Handles non-spherical shapes?
K-Means	Assign to nearest centroid, update centroid	Yes	No	No
Hierarchical	Merge nearest clusters, build tree	No (cut later)	No	Somewhat
DBSCAN	Grow clusters from dense cores	No	Yes (labels -1)	Yes

Tip

Use this as a decision tree: Does your data have noise or outliers you care about? -> DBSCAN. Do you need a hierarchy? -> Hierarchical. Is your data large and roughly blob-shaped? -> K-Means.

The Evaluation Problem¶

In classification, you compare predictions to labels and get accuracy. In clustering, there are no labels. You cannot know if your clusters are "correct" — only whether they are:

Compact: points within a cluster are close together
Separated: clusters are far apart from each other
Stable: the same clusters appear if you re-run on different samples
Interpretable: the clusters make domain sense

This is why clustering evaluation metrics (silhouette score, Davies-Bouldin index) measure geometry, not correctness. They tell you whether your clusters are well-shaped, not whether they reflect any real-world truth.

Warning

Clusters are not automatically meaningful. K-Means will always return k groups, even if your data has no natural structure whatsoever. Always profile your clusters after forming them and ask whether they tell you something actionable.

Algorithm Selection Guide¶

What does your data look like?
│
├── Large dataset (>10k rows), blob-shaped clusters, k is guessable
│   └── K-Means — fast, scalable, interpretable centroids
│
├── Small dataset (<5k rows), want to see the full merge hierarchy
│   └── Hierarchical (Agglomerative, Ward linkage)
│
├── Unknown number of clusters, non-spherical shapes, noise present
│   └── DBSCAN — finds clusters of arbitrary shape, labels outliers
│
└── High-dimensional data (images, text embeddings, >50 features)
    └── Reduce with PCA first, then K-Means or DBSCAN

Success

The three algorithms in this session cover most real-world clustering use cases. K-Means for speed and interpretability. Hierarchical for exploration and hierarchy. DBSCAN for irregular shapes and outlier detection. Learn all three; use the right one for the job.

Before You Cluster: A Checklist¶

Scale your features — distance-based algorithms treat income (range: 20,000–200,000) and age (range: 18–80) as equally important only if you scale first
Check for outliers — one extreme value can pull a centroid far from the actual cluster center
Decide what you are grouping by — do not blindly throw all columns in; choose features that reflect the thing you are trying to group
Have a business question ready — "find me clusters" is not a question; "find me groups of customers with distinct spending behaviour" is

What's Next¶

You've covered the definition of clustering as unsupervised learning, the distinction between centroid-based, density-based, and hierarchical approaches, real-world use cases for each paradigm, and the pre-clustering checklist for scaling, outlier handling, and feature selection. Next up: 02-k-means — where you'll implement K-Means from scratch mentally, use the elbow method and silhouette score to choose k, profile clusters into interpretable segments, and learn the specific failure modes where K-Means gives wrong answers with false confidence.

Optional Deep Dive

Read "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron, Chapter 9 (Unsupervised Learning Techniques) — it covers K-Means, DBSCAN, and Gaussian Mixture Models with the same sklearn API used in this bootcamp, including the cluster evaluation metrics and visualisation techniques for interpreting results.

Agenda | Next: K-Means →