Pipelines and Data Leakage¶

Data leakage is the most expensive mistake in applied machine learning. A leaky model looks exceptional during development — 95% accuracy, strong cross-validation scores — and then fails in production because it was trained on information it will never have access to when making real predictions. The model did not learn to generalise; it learned to cheat. sklearn Pipelines are the primary engineering tool for preventing leakage and for building preprocessing that can be safely deployed.

Learning Objectives¶

By the end of this note you will be able to:

Define data leakage precisely and identify the three most common ways it enters a pipeline
Explain why fitting a scaler or encoder on the full dataset before the train/test split is leakage
Build a Pipeline that encapsulates preprocessing and modelling steps
Build a ColumnTransformer that applies different transforms to different column types
Fit a pipeline on training data only and use it to transform validation and test data
Describe how to persist and deploy a fitted sklearn pipeline

What Data Leakage Is¶

Leakage occurs when your model has access to information during training that it would not have access to at prediction time.

The two most common forms:

Target leakage — a feature is derived from the target or from events that happen after the target is determined. Example: predicting whether a patient will be hospitalised this month, and including "number of prescriptions written this month" as a feature. The prescription count reflects the hospitalisation — it does not cause it.

Train/test contamination — a preprocessing step (scaling, imputation, encoding) is fitted on the full dataset before the train/test split. The test set statistics leak into the training process, making your evaluation optimistically biased.

The Most Common Leakage Mistake

This pattern is everywhere in beginner tutorials and it is wrong:

# WRONG — scaler sees test data statistics during fit
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)          # fits on all data including test rows
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

The correct order: split first, then fit the scaler on training data only, then transform both sets using the training-fitted scaler.

The Correct Way — Without a Pipeline¶

This pattern is correct but error-prone. It requires discipline to consistently apply transform (not fit_transform) on the test set.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
n = 500
loan_data = pd.DataFrame({
    "annual_income": np.random.normal(60000, 20000, n),
    "loan_amount": np.random.uniform(5000, 50000, n),
    "credit_score": np.random.randint(300, 850, n),
    "default": np.random.binomial(1, 0.2, n)
})

X = loan_data.drop("default", axis=1)
y = loan_data["default"]

# Step 1: split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: fit scaler on TRAINING DATA ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform on train

# Step 3: transform test using TRAINING statistics
X_test_scaled = scaler.transform(X_test)          # only transform, no fit

# Step 4: train and evaluate
model = LogisticRegression(max_iter=500)
model.fit(X_train_scaled, y_train)
print(f"Test accuracy: {model.score(X_test_scaled, y_test):.3f}")

This works, but it is fragile. Every new preprocessing step you add requires you to manually remember to fit on train and transform on test. sklearn Pipelines automate this correctly.

sklearn Pipeline — Encapsulate the Entire Workflow¶

A Pipeline chains preprocessing steps and a final estimator. When you call pipeline.fit(X_train, y_train), each step calls fit_transform on the training data and passes the output to the next step. When you call pipeline.predict(X_test), each step calls only transform — the test set is never seen during fitting.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import classification_report

np.random.seed(42)
n = 800
credit_data = pd.DataFrame({
    "annual_income": np.random.exponential(50000, n),
    "debt_to_income": np.random.uniform(0.1, 0.9, n),
    "num_credit_inquiries": np.random.poisson(3, n),
    "credit_score": np.random.randint(300, 850, n),
    "default": np.random.binomial(1, 0.25, n)
})

X = credit_data.drop("default", axis=1)
y = credit_data["default"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the pipeline — steps are (name, estimator) tuples
credit_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression(max_iter=500, random_state=42))
])

# Fit ONLY on training data
credit_pipeline.fit(X_train, y_train)

# Evaluate — pipeline applies scaler.transform (not fit_transform) internally
y_pred = credit_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# Output: precision/recall/f1 per class

# make_pipeline is a shortcut that auto-names steps from class names
quick_pipeline = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500))
quick_pipeline.fit(X_train, y_train)

Access Individual Steps from a Fitted Pipeline

After fitting, you can access the fitted scaler via credit_pipeline.named_steps['scaler'] or credit_pipeline['scaler']. This is useful for inspecting learned parameters like scaler.mean_ and scaler.scale_.

ColumnTransformer — Different Transforms for Different Columns¶

Real datasets have mixed types: some numeric columns need scaling, some categorical columns need one-hot encoding, some need target encoding. ColumnTransformer applies different transformers to different column subsets and concatenates the results.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score

np.random.seed(0)
n = 600

customer_churn = pd.DataFrame({
    "customer_tenure_days": np.random.randint(1, 1825, n),
    "monthly_spend": np.random.exponential(2000, n),
    "num_products": np.random.randint(1, 5, n),
    "days_since_last_login": np.random.randint(0, 180, n),
    "city": np.random.choice(["Mumbai", "Delhi", "Bangalore", "Chennai", "Pune"], n),
    "plan_type": np.random.choice(["Basic", "Standard", "Premium"], n),
    "payment_method": np.random.choice(["Credit Card", "Debit Card", "UPI", "Net Banking"], n),
    "churn": np.random.binomial(1, 0.25, n)
})

# Introduce some missing values
customer_churn.loc[np.random.choice(n, 30, replace=False), "monthly_spend"] = np.nan
customer_churn.loc[np.random.choice(n, 20, replace=False), "days_since_last_login"] = np.nan

X = customer_churn.drop("churn", axis=1)
y = customer_churn["churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define column groups
numeric_cols = ["customer_tenure_days", "monthly_spend", "num_products", "days_since_last_login"]
categorical_cols = ["city", "plan_type", "payment_method"]

# Build sub-pipelines for each column type
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),   # handle missing values first
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first", sparse_output=False))
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("numeric", numeric_pipeline, numeric_cols),
    ("categorical", categorical_pipeline, categorical_cols)
])

# Full pipeline: preprocessing + model
churn_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("model", GradientBoostingClassifier(n_estimators=100, random_state=42))
])

# Fit on training data only — all transformers fit only on X_train
churn_pipeline.fit(X_train, y_train)

# Cross-validate (pipeline handles train/val split correctly in each fold)
cv_scores = cross_val_score(churn_pipeline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"CV ROC-AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Test set evaluation
test_auc = cross_val_score(
    churn_pipeline,
    X_test, y_test,
    cv=3,
    scoring="roc_auc"
).mean()
print(f"Test ROC-AUC (approx): {test_auc:.3f}")

remainder='drop' vs remainder='passthrough'

By default, ColumnTransformer drops any columns not mentioned in its transformers list. Use remainder='passthrough' to include them unchanged. In practice, be explicit — list every column you intend to use and let the rest drop. Columns you do not handle intentionally often carry identifiers or targets that should not be in your feature matrix.

Inspecting a Fitted Pipeline¶

A fitted pipeline is a single serialisable object that contains the entire preprocessing and modelling history.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import joblib

# (Assume churn_pipeline is the fitted pipeline from above)

# Get feature names after preprocessing
preprocessor_fitted = churn_pipeline.named_steps["preprocessing"]
ohe_feature_names = (
    preprocessor_fitted
    .named_transformers_["categorical"]
    .named_steps["encoder"]
    .get_feature_names_out(categorical_cols)
)
all_feature_names = numeric_cols + list(ohe_feature_names)
print(f"Total features after preprocessing: {len(all_feature_names)}")
# Output: Total features after preprocessing: 18  (4 numeric + OHE columns)

# Save the fitted pipeline to disk
joblib.dump(churn_pipeline, "churn_model_v1.pkl")

# Load and use at inference time — apply to new data, no refitting
loaded_pipeline = joblib.load("churn_model_v1.pkl")
new_customer = pd.DataFrame({
    "customer_tenure_days": [365],
    "monthly_spend": [2500.0],
    "num_products": [2],
    "days_since_last_login": [14],
    "city": ["Hyderabad"],      # city not in training data → handled by handle_unknown='ignore'
    "plan_type": ["Standard"],
    "payment_method": ["UPI"]
})

churn_probability = loaded_pipeline.predict_proba(new_customer)[0][1]
print(f"Churn probability: {churn_probability:.3f}")

Version Your Pipeline with the Model

Save the pipeline version, training date, and dataset hash alongside the .pkl file. When a production model degrades, you need to know which pipeline version to roll back to and which training data it saw.

Data Leakage — The Full Taxonomy¶

Type 1 — Train/Test Contamination¶

Fitting any transformer on data that includes test rows.

# WRONG
scaler.fit_transform(X)          # X includes test rows
X_train, X_test = split(X)       # too late, contamination happened

# RIGHT
X_train, X_test = split(X)       # split first
scaler.fit(X_train)              # fit only on training rows
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Type 2 — Target Leakage¶

A feature is derived from the target or from events that happen after the prediction point.

# WRONG — "num_claims_this_year" is partially caused by the fraud event being predicted
fraud_df = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "num_claims_this_year": [5, 1, 8],  # includes the fraudulent claim you are predicting
    "is_fraud": [1, 0, 1]               # target
})

# RIGHT — use only pre-event features
fraud_df_clean = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "num_claims_prior_year": [3, 1, 5],  # last year's count — available at prediction time
    "account_age_days": [730, 365, 1095],
    "is_fraud": [1, 0, 1]
})

Type 3 — Temporal Leakage (Time-Series Data)¶

Using future information to predict the past in time-series problems.

import pandas as pd
import numpy as np

dates = pd.date_range("2024-01-01", periods=100, freq="D")
sales = pd.DataFrame({
    "date": dates,
    "daily_sales": np.random.randint(100, 500, 100)
})

# WRONG — rolling mean uses data from the future
sales["rolling_7d_mean"] = sales["daily_sales"].rolling(window=7, center=True).mean()
#                                                                  ^^^^^^^^^^^^
# center=True means the window is centered — it uses 3 future days for every row

# RIGHT — rolling mean uses only past data
sales["rolling_7d_mean"] = sales["daily_sales"].rolling(window=7, min_periods=1).mean()
# With min_periods=1, the window looks only backward (default behavior, no center)

Leakage Is Often Hard to Spot

Leakage frequently hides in feature engineering steps that look innocent. A moving average using center=True. An imputation step fitted before the split. A target encoder fitted on all rows. The question to ask for every feature: "At inference time, what data would I actually have to compute this feature?" If the answer includes anything from after the prediction point or from the test set, it is leakage.

Leakage Checklist¶

Run through this before finalising any pipeline:

Check	Question
Split order	Did you split the data before fitting any transformer?
Imputation	Did you fit the imputer on training data only?
Scaling	Did you fit the scaler on training data only?
Encoding	Did you fit all encoders on training data only?
Target encoding	Did you use cross-validation or a leave-one-out method?
Rolling features	Do your rolling windows use only past data (no `center=True`)?
Reference dates	Is your reference date fixed, not computed from the data's date range?
Target features	Does any feature encode the target or a proxy of the target?
Identifier columns	Have you dropped customer_id, transaction_id, and similar columns?

Key Takeaway

A sklearn Pipeline is not just a convenience — it is a correctness tool. It guarantees that every transformer in your pipeline is fitted only on training data and applied (not refitted) to validation, test, and production data. If your preprocessing is not inside a Pipeline, it is one mistake away from leakage.

Back: Datetime and Text Features | Next: Exercises