Supervised vs Unsupervised Learning¶

Every ML problem you will encounter in industry starts with the same question: do I have a target? That single question determines your entire modelling approach — what algorithms are available, how you evaluate success, and what "done" means. Getting this wrong at the start costs weeks of work.

This note walks through each paradigm with enough depth that you can map any new problem to the right approach without guessing.

Learning Objectives¶

Distinguish supervised and unsupervised learning at a conceptual and practical level
Separate classification from regression and choose the right one for a given problem
Identify when clustering or dimensionality reduction is the right tool
Recognise semi-supervised and self-supervised learning by name
Apply a decision framework to any new problem

Supervised Learning¶

The core idea¶

In supervised learning, you have labeled training data. For every sample in your training set, you know the correct answer. The model learns to map inputs to that answer.

Think of it as training with a teacher. The teacher (your labels) corrects the student (your model) after every example until the student can answer correctly on its own.

Training time:  (X, y) → model learns a mapping
Prediction time: X_new → model predicts y_new

The word "supervised" refers to the supervision provided by the labels — not human supervision of the model training process.

Classification¶

The target variable is a discrete category. The model learns a decision boundary that separates the classes.

Problem	Features	Target classes
Email spam detection	word frequencies, sender reputation	spam / not spam
Medical diagnosis	symptoms, lab values, age	disease A / B / healthy
Customer churn	tenure, spend, complaints	will churn / will not
Digit recognition	pixel values	0 through 9
Sentiment analysis	word embeddings	positive / negative / neutral

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load classification dataset
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target  # 0 = malignant, 1 = benign

print("Target distribution:")
print(y.value_counts())
# Output:
# 1    357
# 0    212

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["malignant", "benign"]))
# Output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.93      0.95        42
#       benign       0.96      0.99      0.97        72
#     accuracy                           0.96       114

Tip

When you have more than two classes, it is called multiclass classification. Sklearn handles this automatically for most algorithms. When each sample can belong to multiple classes simultaneously (e.g. an article tagged as both "sports" and "finance"), that is multilabel classification — a different problem requiring different metrics.

Regression¶

The target variable is a continuous number. The model learns to predict a quantity.

Problem	Features	Target
House price prediction	bedrooms, area, postcode	price in £
Demand forecasting	date, promotions, weather	units sold
Insurance premium	age, claims history, coverage	annual premium
Energy consumption	time, temperature, occupancy	kWh

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

data = load_diabetes(as_frame=True)
X = data.data
y = data.target  # Disease progression (continuous)

print(f"Target range: {y.min():.1f} to {y.max():.1f}")
# Output: Target range: 25.0 to 346.0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"MAE:  {mean_absolute_error(y_test, y_pred):.1f}")  # Output: MAE:  44.0
print(f"R²:   {r2_score(y_test, y_pred):.3f}")             # Output: R²:   0.493

Warning

Do not use accuracy as a metric for regression. Accuracy is a classification metric — it counts correct predictions out of total. For regression, use MAE, RMSE, or R². Using the wrong metric gives you a number that looks meaningful but tells you nothing.

Classification vs Regression: the decision¶

Is the target a category?
    └─ Yes → Classification
       └─ Two categories → Binary classification
       └─ Three or more → Multiclass classification

Is the target a number?
    └─ Yes → Regression
       └─ Bounded (0–100%) → Regression or classification depending on use
       └─ Count data (0, 1, 2, ...) → Poisson regression or count models

The boundary blurs sometimes. "Predict the probability of churn" outputs a number (0 to 1) but the underlying problem is classification. You can threshold the probability to get a binary prediction.

Unsupervised Learning¶

The core idea¶

No target column. You have X only. The goal is to discover structure that exists in the data itself — patterns, clusters, compressed representations.

Unsupervised learning is harder to evaluate because there is no ground truth to compare against. You need domain knowledge or a downstream task to judge whether the discovered structure is meaningful.

Clustering¶

Group similar samples together. Samples within a cluster should be more similar to each other than to samples in other clusters.

When to use it: - You want to discover natural segments in your customer base - You want to group documents by topic without predefined topics - You want to detect anomalies (points that do not fit any cluster)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Synthetic data with 3 natural clusters
X, true_labels = make_blobs(n_samples=300, centers=3, random_state=42, cluster_std=1.2)

# Scale first — KMeans is distance-based, sensitive to scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42, n_init="auto")
kmeans.fit(X_scaled)

labels = kmeans.labels_
print("Cluster sizes:", np.bincount(labels))
# Output: Cluster sizes: [103  94 103]

# Inertia: sum of squared distances to nearest cluster centre
# Lower is better, but always compare against a business baseline
print(f"Inertia: {kmeans.inertia_:.2f}")
# Output: Inertia: 286.87

Warning

KMeans requires you to specify the number of clusters k in advance. Choosing k wrong is the most common clustering mistake. Use the elbow method or silhouette score to inform your choice — do not just guess.

from sklearn.metrics import silhouette_score

inertias = []
silhouette_scores = []
k_range = range(2, 8)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, km.labels_))

# The k with the highest silhouette score is often a good choice
best_k = k_range.start + silhouette_scores.index(max(silhouette_scores))
print(f"Best k by silhouette: {best_k}")  # Output: Best k by silhouette: 3

Dimensionality Reduction¶

Convert high-dimensional data into fewer dimensions while preserving as much meaningful structure as possible.

Why it matters: - Visualisation: compress 50 features to 2D to see structure - Noise reduction: remove dimensions that add noise, not signal - Computational cost: train faster downstream models on fewer features - Collinearity: many algorithms struggle with correlated features

from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

data = load_wine(as_frame=True)
X = data.data
y = data.target

print("Original dimensions:", X.shape[1])  # Output: Original dimensions: 13

# Scale before PCA — PCA is variance-based, sensitive to scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2 dimensions for visualisation
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print("Reduced dimensions:", X_pca.shape[1])  # Output: Reduced dimensions: 2
print("Variance explained:", pca.explained_variance_ratio_.round(3))
# Output: Variance explained: [0.362 0.192]
# The first 2 components explain ~55% of total variance

# Check if the 2D projection separates the wine classes
for class_id in [0, 1, 2]:
    mask = y == class_id
    print(f"Class {class_id} — PC1 mean: {X_pca[mask, 0].mean():.2f}")
# Output:
# Class 0 — PC1 mean: 3.25
# Class 1 — PC1 mean: -1.01
# Class 2 — PC1 mean: -2.90
# Classes are well separated along PC1

Info

PCA finds linear combinations of features. UMAP and t-SNE find non-linear low-dimensional representations — often more useful for visualising complex clusters, but harder to interpret and not suitable for preprocessing before a linear model.

Beyond the Two Main Paradigms¶

Semi-supervised learning¶

You have a small amount of labeled data and a large amount of unlabeled data. The unlabeled data helps the model learn better representations.

This is common in practice: labeling is expensive, collecting unlabeled data is cheap. A hospital might have 200 annotated scans and 50,000 unannotated ones.

Self-supervised learning¶

The label is generated from the data itself — no human labeling required. Examples:

Predict the next word given the previous words (language models)
Predict a masked image patch given its surroundings (image models)
Predict whether two augmented views of the same image match (contrastive learning)

This is how GPT, BERT, and most modern foundation models are pre-trained. The resulting representations transfer powerfully to downstream supervised tasks.

Info

You will not build self-supervised models in this bootcamp, but you will use their outputs. When you use a pre-trained sentence transformer to embed text, you are using a self-supervised model's learned representations as features.

Choosing the Right Paradigm¶

Use this decision framework on any new problem:

Step 1: Do you have a clear target to predict?
    └─ Yes → Supervised Learning
       └─ Step 2: What type is the target?
          └─ Discrete category → Classification
          └─ Continuous number → Regression
    └─ No → Unsupervised Learning
       └─ Step 2: What structure are you looking for?
          └─ Groups / segments → Clustering
          └─ Lower-dimensional representation → Dimensionality Reduction
          └─ Unusual samples → Anomaly Detection

Applied examples¶

# Walk through five problems and classify them

problems = [
    ("Predict monthly sales revenue", "continuous target", "Regression"),
    ("Detect whether a transaction is fraudulent", "binary target", "Classification"),
    ("Group customers into segments for marketing", "no target", "Clustering"),
    ("Reduce 200 gene expression features to 10", "no target", "Dimensionality Reduction"),
    ("Rank products by likelihood of purchase", "ordinal target", "Ranking / Classification"),
]

for problem, note, answer in problems:
    print(f"Problem: {problem}")
    print(f"  Note:   {note}")
    print(f"  Answer: {answer}\n")

Tip

If someone hands you a dataset and says "build a model", the first question to ask is: "What is the target column?" If they cannot answer that, the project is not ready for ML. Spend more time on problem definition.

Common Mistakes¶

Warning

Using clustering when you have labels and the goal is prediction. Clustering is exploratory — it discovers structure. If you already know the structure (you have a target), supervised learning will almost always give better results for prediction tasks.

Warning

Treating a regression problem as classification by binning the target. Binning "high / medium / low" revenue throws away information. A model that knows a value is $48,000 learns more than one told "medium". Only bin if the business decision is genuinely categorical.

Warning

Including IDs, timestamps, or leaky columns as features in supervised models. These confuse the model and can cause it to learn spurious patterns. Always audit your feature list before training.

Interview Questions¶

Q1: What is the difference between classification and regression?

Show answer

Both are supervised learning tasks. Classification predicts a discrete category — the output is one of a fixed set of classes (e.g. spam/not spam, cat/dog/bird). Regression predicts a continuous numerical value (e.g. house price, temperature). The choice depends entirely on the type of the target variable, not on the algorithm.

Q2: When would you choose unsupervised learning over supervised?

Show answer

When you do not have a target variable. This happens when: you are exploring data before you know what you want to predict; you want to discover natural groupings (customer segments, document topics); you want to compress or visualise high-dimensional data; or labeling data is too expensive or time-consuming and you need to learn representations from unlabeled data first.

Q3: A colleague says "let's cluster our customer data to predict churn". What would you say?

Show answer

Clustering is an unsupervised technique for discovering structure without a target. If we have labeled churn data (past customers who did or did not churn), we should use supervised classification — it directly optimises for predicting churn and will almost always outperform clustering for a prediction task. Clustering might be useful upstream (to find customer segments, then train a separate churn model per segment), but clustering alone does not predict churn.

Q4: What is dimensionality reduction and why would you use it?

Show answer

Dimensionality reduction transforms a dataset with many features into one with fewer features while preserving as much meaningful variance or structure as possible. You would use it to visualise high-dimensional data in 2D or 3D, to remove correlated or noisy features before training a model, to speed up training on large feature spaces, or as a preprocessing step to mitigate the curse of dimensionality (where data becomes increasingly sparse as dimensions grow).

What's Next¶

You've covered classification vs regression, clustering with k-means and silhouette scoring, dimensionality reduction with PCA, semi-supervised and self-supervised learning paradigms, and the step-by-step decision framework for choosing the right approach. Next up: 03-train-test-split-and-leakage — where you'll learn why proper train/test splitting is non-negotiable, how data leakage silently destroys model reliability, and how sklearn Pipelines make correct behaviour the default rather than an afterthought.

Optional Deep Dive

Read scikit-learn's "Choosing the right estimator" flowchart at https://scikit-learn.org/stable/tutorial/machine_learning_map/ — it maps the decision framework from this note onto specific sklearn algorithms, giving you a concrete algorithm recommendation for each problem type and dataset size.

Previous: What Is ML? | Next: Train/Test Split and Leakage