Train/Test Split and Data Leakage¶

This is not a technicality. It is the single rule that separates honest ML from fraudulent ML — and most production ML disasters trace back to a violation of it.

The rule: a model must never be evaluated on data it was trained on, and training must never be contaminated with information that would not exist at prediction time. Break the first rule and your accuracy score is fiction. Break the second and you ship a model that works in development and collapses in production.

Learning Objectives¶

Explain why evaluating on training data produces misleadingly good results
Implement train/test split correctly, including stratified splits for imbalanced data
Describe the three-way train/validation/test split and explain when each set is used
Define data leakage precisely and distinguish its two main forms
Identify leakage in realistic scenarios
Use sklearn Pipelines to prevent leakage mechanically

Why Splitting Is a Fundamental Requirement¶

The memorisation problem¶

When a model trains, it adjusts its parameters to minimise error on the training data. Some of that adjustment captures real patterns — the relationship between features and target. Some of it captures noise: quirks specific to those 1,000 training rows that do not exist in the broader world.

If you evaluate the model on the same data it trained on, you measure both. The model looks better than it is, because it has partially memorised the training set.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

model = DecisionTreeClassifier(random_state=42)
model.fit(X, y)  # Train on ALL data — no split

# Evaluate on the same data used for training
train_accuracy = accuracy_score(y, model.predict(X))
print(f"Accuracy on training data: {train_accuracy:.4f}")
# Output: Accuracy on training data: 1.0000

# A perfect score — the model has memorised the training set
# This tells us nothing about real-world performance

Now do it correctly:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc  = accuracy_score(y_test,  model.predict(X_test))

print(f"Accuracy on train: {train_acc:.4f}")  # Output: Accuracy on train: 1.0000
print(f"Accuracy on test:  {test_acc:.4f}")   # Output: Accuracy on test:  0.9035

# The gap between train and test accuracy is the overfitting signal.
# 100% train, 90% test — the model has memorised, not fully generalised.

Success

The gap between training accuracy and test accuracy is one of the most informative signals in ML. A large gap means the model is overfitting — it has learned noise specific to the training set. Your goal is to close that gap while keeping test accuracy high.

The epistemological argument¶

Here is why splitting is not just a convention but a logical requirement.

You want to know: "how will this model perform on data from the future?" The only honest way to answer that is to simulate the future — hold out data the model has never seen and measure performance on it. The test set is a simulation of deployment.

If you evaluate on training data, you are asking: "how well did the model learn its lessons?" That is a different question, and a much less useful one. You do not care how well the model learned — you care how well it will perform when you give it a new customer, a new transaction, a new email.

Basic Train/Test Split¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,     # 20% held out for testing
    random_state=42,   # reproducibility
)

print(f"Train: {X_train.shape[0]} samples")  # Output: Train: 455 samples
print(f"Test:  {X_test.shape[0]} samples")   # Output: Test:  114 samples

Stratified split for classification¶

When your classes are imbalanced, a random split might put all rare-class samples in train — leaving none in test, or vice versa. Stratified split preserves the class proportion in both sets.

from sklearn.model_selection import train_test_split
import numpy as np

# Imbalanced dataset: 90% class 0, 10% class 1
np.random.seed(42)
y_imbalanced = np.array([0] * 900 + [1] * 100)
X_imbalanced = np.random.randn(1000, 5)

# Without stratify — class proportions may shift
X_tr, X_te, y_tr, y_te = train_test_split(
    X_imbalanced, y_imbalanced, test_size=0.2, random_state=42
)
print("Without stratify — test class 1 %:", y_te.mean().round(3))
# Output: Without stratify — test class 1 %: 0.095

# With stratify — class proportions preserved
X_tr, X_te, y_tr, y_te = train_test_split(
    X_imbalanced, y_imbalanced, test_size=0.2, random_state=42, stratify=y_imbalanced
)
print("With stratify    — test class 1 %:", y_te.mean().round(3))
# Output: With stratify    — test class 1 %: 0.1

# Both sets now reflect the true 10% class 1 proportion

Tip

Always use stratify=y for classification problems. For regression, there is no direct equivalent — but if your target is heavily skewed, consider binning it, stratifying on the bins, then discarding the bins. This gives you a more representative split.

The Three-Way Split: Train / Validation / Test¶

A two-way split is often not enough. Here is why.

You train a model, evaluate on test, tune hyperparameters, evaluate on test again. Now you have inadvertently used the test set to make decisions. The test set has influenced the model — indirectly, through your choices. It is no longer a clean simulation of future data.

The solution: add a validation set.

All data
│
├── Training set (60–70%)     ← model learns from this
│
├── Validation set (15–20%)   ← you tune hyperparameters and compare models
│
└── Test set (15–20%)         ← touched once, at the very end, for final reporting

from sklearn.model_selection import train_test_split

# Step 1: carve off the test set first — lock it away
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 2: split the remainder into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
    # 0.25 × 0.80 = 0.20 of total → 60/20/20 split
)

print(f"Train:      {X_train.shape[0]}")  # Output: Train:      341
print(f"Validation: {X_val.shape[0]}")    # Output: Validation: 114
print(f"Test:       {X_test.shape[0]}")   # Output: Test:       114

Warning

The test set must be touched exactly once: after all modelling decisions have been finalised. If you look at test set results during development and then change your model, you have used the test set as a validation set. Your final reported score is now optimistic. In academic papers this is called p-hacking. In production it means your model underperforms expectations at launch.

Cross-validation as an alternative to a fixed validation set¶

When your dataset is small, a fixed validation set wastes data and gives high-variance estimates. Cross-validation uses all the data for both training and validation — at different times.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

data = load_breast_cancer()
X, y = data.data, data.target

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=10000, random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="accuracy")

print(f"CV scores:  {scores.round(4)}")
# Output: CV scores:  [0.9825 0.9561 0.9649 0.9649 0.9561]

print(f"Mean ± std: {scores.mean():.4f} ± {scores.std():.4f}")
# Output: Mean ± std: 0.9649 ± 0.0094

Success

Cross-validation gives you both a performance estimate and a confidence interval. A mean of 96.5% ± 1% is much more informative than a single validation score of 97%, because you know how much that score might vary. Report both numbers.

Time-Based Splits¶

For time series data, random splits are wrong. Shuffling splits future data into the past, which is leakage — the model effectively sees the future during training.

import pandas as pd
import numpy as np

# Simulate a time series dataset
np.random.seed(42)
dates = pd.date_range("2023-01-01", "2024-12-31", freq="D")
n = len(dates)

ts_df = pd.DataFrame({
    "date":   dates,
    "sales":  np.cumsum(np.random.randn(n)) + 100,
    "promo":  np.random.choice([0, 1], n, p=[0.8, 0.2]),
})

# Correct: split by time, never shuffle
cutoff = "2024-09-01"
train = ts_df[ts_df["date"] < cutoff]
test  = ts_df[ts_df["date"] >= cutoff]

print(f"Train: {train.shape[0]} days — {train.date.min().date()} to {train.date.max().date()}")
print(f"Test:  {test.shape[0]} days  — {test.date.min().date()} to {test.date.max().date()}")
# Output:
# Train: 608 days — 2023-01-01 to 2024-08-31
# Test:  122 days  — 2024-09-01 to 2024-12-31

Warning

Never use train_test_split with default settings on time series data — it shuffles rows. Always split on a time boundary. Better still, use sklearn's TimeSeriesSplit for cross-validation on time-ordered data.

Data Leakage¶

Data leakage is when information that would not be available at real prediction time — when the model is deployed and making live decisions — is used during training or evaluation.

Leakage makes your model look better than it actually is. The model learns a shortcut that does not exist in production. When you deploy, the shortcut disappears, and performance collapses.

Warning

Leakage is insidious because it produces the best results during development. If a feature makes your accuracy jump from 80% to 99%, that is a red flag, not a cause for celebration. Question it.

Type 1: Target leakage¶

A feature in your training data was created using information about the target, or the feature is only observed after the target event occurs.

Scenario — predicting hospital readmission:

Feature	Leaky?	Why
`age`, `diagnosis`, `length_of_stay`	No	Known at admission time
`discharge_medication_count`	Maybe	Known at discharge, but readmission is after discharge
`readmission_notes`	Yes	Only exists if patient was readmitted
`follow_up_appointment_booked`	Depends	If booked at discharge, fine. If booked after readmission, leaky.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 1000

# Simulate a loan default dataset with a leaky feature
df = pd.DataFrame({
    "income":          np.random.normal(50000, 15000, n),
    "credit_score":    np.random.randint(300, 850, n),
    "loan_amount":     np.random.uniform(5000, 50000, n),
    "default":         np.random.choice([0, 1], n, p=[0.85, 0.15]),
})

# LEAKY FEATURE: collections_agency_contacted is only true when someone has already defaulted
df["collections_contacted"] = (df["default"] == 1) & (np.random.rand(n) > 0.1)

# Model WITH leakage
X_leaky = df[["income", "credit_score", "loan_amount", "collections_contacted"]]
y = df["default"]

model = RandomForestClassifier(n_estimators=50, random_state=42)
score_leaky = cross_val_score(model, X_leaky, y, cv=5, scoring="accuracy").mean()
print(f"Accuracy WITH leakage: {score_leaky:.4f}")   # Output: ~0.97 — suspiciously high

# Model WITHOUT leakage
X_clean = df[["income", "credit_score", "loan_amount"]]
score_clean = cross_val_score(model, X_clean, y, cv=5, scoring="accuracy").mean()
print(f"Accuracy WITHOUT leakage: {score_clean:.4f}")  # Output: ~0.87 — realistic

Type 2: Train-test contamination¶

Information from the test set leaks into the training process — typically through preprocessing steps that are fitted on the full dataset before splitting.

The most common form: scaling (StandardScaler, MinMaxScaler) applied before train_test_split.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# WRONG: scale before splitting
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X)   # uses statistics from test rows
X_tr_bad, X_te_bad, y_tr, y_te = train_test_split(X_scaled_wrong, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_tr_bad, y_tr)
print(f"Accuracy (scaled before split): {accuracy_score(y_te, model.predict(X_te_bad)):.4f}")
# Output: ~0.9737

# RIGHT: split first, then scale — fit scaler on train only
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)   # fit on train only
X_te_scaled = scaler.transform(X_te)       # transform test with train statistics

model = LogisticRegression(max_iter=1000)
model.fit(X_tr_scaled, y_tr)
print(f"Accuracy (split before scale):  {accuracy_score(y_te, model.predict(X_te_scaled)):.4f}")
# Output: ~0.9561

# The correct approach gives a slightly lower (more honest) score.
# In high-stakes settings this gap is larger and more consequential.

Warning

The rule: fit all preprocessing on training data only. Transform both train and test using the parameters learned from train. This applies to: scalers, imputers, encoders, PCA, feature selectors — any transformer with a fit() step.

Pipelines: Mechanical Leakage Prevention¶

The correct approach — split first, then fit scaler on train — is easy to get wrong when there are many preprocessing steps. Pipelines make it impossible to get wrong.

A sklearn Pipeline chains preprocessing steps and a model into a single object. When you call pipeline.fit(X_train, y_train), every transformer in the chain is fitted on X_train only. When you call pipeline.transform(X_test) or pipeline.predict(X_test), the fitted transformers are applied to the test data — no refitting.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report

data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

# Split BEFORE building the pipeline — always
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build the pipeline: each step is a (name, transformer) tuple
pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # fit on train, transform test
    ("scaler",  StandardScaler()),                   # fit on train, transform test
    ("model",   LogisticRegression(max_iter=10000, random_state=42)),
])

# fit() runs all transformers on X_train, then fits the model
pipe.fit(X_train, y_train)

# predict() runs all transformers on X_test (using train statistics), then predicts
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["malignant", "benign"]))
# Output:
#               precision    recall  f1-score   support
#    malignant       0.97      0.93      0.95        42
#       benign       0.96      0.99      0.97        72
#     accuracy                           0.96       114

# Cross-validate the pipeline — each fold handles its own preprocessing
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"CV mean: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Output: CV mean: 0.9649 ± 0.0094

Success

If you build your preprocessing inside a Pipeline and pass the pipeline to cross_val_score, leakage is structurally impossible. Each cross-validation fold fits transformers on its own training fold and applies them to its own validation fold. This is the correct, professional way to run ML experiments.

Leakage Spotting Checklist¶

Run this checklist on any dataset before training:

1. Does any feature contain information about the outcome?
   └─ e.g. "loan_written_off" in a default prediction model

2. Is any feature only observable after the event you are predicting?
   └─ e.g. "cancellation_date" in a churn model

3. Was any preprocessing fitted on the full dataset before splitting?
   └─ Scalers, imputers, encoders, PCA

4. Are future data points appearing in the training set?
   └─ Shuffling time series data

5. Is a high-cardinality ID column included as a feature?
   └─ customer_id, transaction_id, product_id

6. Are target-encoded features computed on the full dataset?
   └─ Mean-encode category AFTER splitting, not before

7. Did the model accuracy jump suspiciously high?
   └─ 95%+ on a genuinely hard problem → investigate for leakage

Interview Questions¶

Q1: Why is it a problem to evaluate a model on its training data?

Show answer

Because the model has already seen that data and may have partially memorised it. Evaluation on training data measures how well the model learned its training set, not how well it will generalise to new data — which is the actual goal. Training data evaluation inflates performance metrics and gives a false sense of model quality. The honest way to estimate real-world performance is to evaluate on held-out test data the model has never seen.

Q2: What is data leakage? Give two concrete examples.

Show answer

Data leakage is when information that would not be available at real prediction time is used during training or evaluation, causing the model to look better than it actually is.

Example 1 — Target leakage: In a credit default model, including "collections_agency_contacted" as a feature. This field is only populated after a customer has defaulted — it is not available at the time you would actually make the prediction (when deciding whether to approve the loan). The model learns to use a leaky shortcut that disappears in production.

Example 2 — Preprocessing contamination: Fitting a StandardScaler on the full dataset before the train/test split. The scaler's mean and standard deviation are computed using test-set rows, so the test set has influenced the training process. This is a subtle but real form of leakage.

Q3: Why does scaling before the train/test split count as leakage?

Show answer

When you call scaler.fit(X) on the full dataset, the scaler computes the mean and standard deviation of each feature using all rows — including rows that end up in the test set. When you later scale the training data with those statistics, the training process has been influenced by test-set values. The correct approach: split first, then fit the scaler on training data only and apply it to both train and test.

Q4: How do sklearn Pipelines prevent leakage?

Show answer

A Pipeline chains transformers and a model so that when you call pipeline.fit(X_train, y_train), each transformer is fitted only on X_train. When you call pipeline.predict(X_test) or pipeline.transform(X_test), the pre-fitted transformers (using train statistics) are applied to the test data — they are not refitted. When you pass a pipeline to cross_val_score, each fold refits all transformers on its own training fold, preventing any cross-fold contamination. This makes correct behaviour the default.

What's Next¶

You've covered train/test splitting with stratification, cross-validation for reliable evaluation, time-based splits for sequential data, both types of data leakage (target leakage and preprocessing contamination), and how sklearn Pipelines prevent leakage structurally. Next up: 04-scikit-learn-workflow — where you'll put these concepts into a complete end-to-end sklearn workflow: loading data, building ColumnTransformer pipelines for mixed feature types, training and evaluating multiple models, and following the checklist that catches the most common production mistakes.

Optional Deep Dive

Read the sklearn User Guide section on cross-validation at https://scikit-learn.org/stable/modules/cross_validation.html — it covers the full range of cross-validation strategies (stratified, grouped, time series) and explains the mathematical reasoning behind why each is appropriate for different data structures.

Previous: Supervised vs Unsupervised | Next: Scikit-learn Workflow