Cross-Validation¶

A single train/test split is a coin flip. Depending on which rows end up in each set, your accuracy estimate might be 5 percentage points higher or lower than the true generalisation performance. Cross-validation replaces that coin flip with a structured average across multiple splits — giving you a number you can actually trust, along with a measure of how unstable your estimate is.

Learning Objectives¶

Explain why a single split produces high-variance estimates
Implement K-Fold and StratifiedKFold cross-validation correctly
Read and interpret mean ± std output from cross-validation
Apply TimeSeriesSplit to temporal data and explain why standard K-Fold breaks on time series
Set up nested cross-validation for unbiased evaluation when tuning hyperparameters

Why a Single Split Is Not Enough¶

Imagine you split your 1000-row dataset 80/20. Your test set is 200 rows — a random sample of the full distribution. If you had drawn a slightly different 200 rows, your accuracy could easily be 2–5 points different, entirely by chance.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=None)
model = LogisticRegression(max_iter=1000)

# Run 10 different random splits — same model, same data
split_scores = []
for seed in range(10):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model.fit(X_train, y_train)
    split_scores.append(accuracy_score(y_test, model.predict(X_test)))

print(f"Scores across 10 splits: {[round(s, 3) for s in split_scores]}")
print(f"Range: {max(split_scores) - min(split_scores):.3f}")
# Output: Scores across 10 splits: [0.835, 0.870, 0.810, 0.855, 0.825, ...]
# Output: Range: 0.060   ← 6 percentage points of variance from the split alone

That variance is noise from the split, not signal about the model. Cross-validation averages it out.

Info

The variance problem gets worse with smaller datasets. With 200 rows and a 20% test split, your 40-row test set is almost meaningless on its own. Cross-validation is not optional for small datasets.

K-Fold Cross-Validation¶

K-Fold divides the data into k equal-sized folds. The model trains on k-1 folds and evaluates on the held-out fold, rotating through all k folds. You get k scores and average them.

Fold 1:  [TEST] [train] [train] [train] [train]
Fold 2:  [train] [TEST] [train] [train] [train]
Fold 3:  [train] [train] [TEST] [train] [train]
Fold 4:  [train] [train] [train] [TEST] [train]
Fold 5:  [train] [train] [train] [train] [TEST]

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

cancer_data = load_breast_cancer()
X_cancer, y_cancer = cancer_data.data, cancer_data.target

cancer_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

cv_scores = cross_val_score(
    cancer_pipeline,
    X_cancer,
    y_cancer,
    cv=5,
    scoring="f1"
)

print(f"F1 per fold: {cv_scores.round(3)}")
print(f"Mean F1:     {cv_scores.mean():.3f}")
print(f"Std F1:      {cv_scores.std():.3f}")
print(f"Result:      {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Output: F1 per fold: [0.971 0.943 0.957 0.971 0.964]
# Output: Mean F1:     0.961
# Output: Std F1:      0.011
# Output: Result:      0.961 ± 0.011

Reading the output:

0.961 — the model's expected F1 on unseen data
± 0.011 — how stable that estimate is; low std means the model performs consistently regardless of which data it trains on
A high std (e.g., ± 0.08) signals a model that is sensitive to which training examples it sees — worth investigating

Tip

k=5 and k=10 are standard choices. k=5 is faster; k=10 gives a slightly lower-bias estimate. For small datasets (< 1000 rows), consider k=10 or even leave-one-out (LOOCV). For large datasets, k=5 is usually sufficient.

Getting More Detail with `cross_validate`¶

cross_val_score returns only the validation scores. cross_validate returns both train and validation scores — which lets you detect overfitting.

from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    cancer_pipeline,
    X_cancer,
    y_cancer,
    cv=5,
    scoring="f1",
    return_train_score=True
)

print(f"Train F1: {cv_results['train_score'].mean():.3f} ± {cv_results['train_score'].std():.3f}")
print(f"Val F1:   {cv_results['test_score'].mean():.3f} ± {cv_results['test_score'].std():.3f}")
# Output: Train F1: 1.000 ± 0.000
# Output: Val F1:   0.961 ± 0.011

A large gap between train and validation score (e.g., 1.000 vs 0.750) is a clear sign of overfitting.

Warning

The test_score key in cross_validate output refers to the validation fold, not your held-out test set. This naming is confusing but consistent in scikit-learn. Your actual test set is separate and untouched.

StratifiedKFold for Imbalanced Classes¶

Standard K-Fold splits randomly. On an imbalanced dataset — say, 95% class 0, 5% class 1 — random splits can produce a fold with zero positive examples, which makes the score meaningless.

StratifiedKFold guarantees that each fold has the same class proportions as the full dataset.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# Simulated imbalanced dataset
np.random.seed(42)
X_imbalanced = np.random.randn(1000, 10)
y_imbalanced = np.random.choice([0, 1], size=1000, p=[0.95, 0.05])

fraud_model = GradientBoostingClassifier(n_estimators=50, random_state=42)

# Stratified K-Fold preserves class balance in each fold
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

stratified_scores = cross_val_score(
    fraud_model,
    X_imbalanced,
    y_imbalanced,
    cv=stratified_cv,
    scoring="f1"          # F1 is appropriate for imbalanced data
)

print(f"Stratified CV F1: {stratified_scores.mean():.3f} ± {stratified_scores.std():.3f}")

Warning

cross_val_score with cv=5 (an integer) uses StratifiedKFold automatically for classifiers. For regressors, it uses plain KFold. Always pass an explicit StratifiedKFold object when the imbalance is severe — do not rely on the default behaviour.

TimeSeriesSplit — When You Cannot Shuffle¶

Shuffling a time series before splitting is a form of data leakage. If your training fold contains data from December and your validation fold contains data from March, the model will have "seen the future" during training. In production, it never gets that advantage.

TimeSeriesSplit enforces chronological ordering: training always happens on older data, validation always happens on newer data.

Split 1:  [train Jan-Mar] | [val Apr]
Split 2:  [train Jan-Apr] | [val May]
Split 3:  [train Jan-May] | [val Jun]
Split 4:  [train Jan-Jun] | [val Jul]
Split 5:  [train Jan-Jul] | [val Aug]

Each validation fold is always in the future relative to its training fold.

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Ridge

# Monthly sales data for 36 months
np.random.seed(42)
monthly_sales = pd.DataFrame({
    "month":       pd.date_range("2021-01", periods=36, freq="MS"),
    "sales":       500 + np.cumsum(np.random.randn(36) * 20),
    "promo_spend": np.random.uniform(1000, 5000, 36),
    "month_index": range(36)
})

X_sales = monthly_sales[["promo_spend", "month_index"]].values
y_sales = monthly_sales["sales"].values

sales_model = Ridge(alpha=1.0)
tscv = TimeSeriesSplit(n_splits=5)

ts_scores = cross_val_score(
    sales_model,
    X_sales,
    y_sales,
    cv=tscv,
    scoring="neg_mean_absolute_error"
)

print(f"MAE per split: {(-ts_scores).round(1)}")
print(f"Mean MAE:      {(-ts_scores).mean():.1f}")
# Output: MAE per split: [18.3 22.7 15.4 31.2 19.8]
# Output: Mean MAE:      21.5

Warning

Never use standard K-Fold on time series data. Shuffling allows future data to appear in training folds, inflating your evaluation score. The model will look better than it actually is — and fail the moment it hits production.

Info

TimeSeriesSplit is conservative by default: later splits have much more training data than earlier splits, which can make variance across splits look large. This is expected and normal — it reflects the growing training set, not model instability.

Nested Cross-Validation¶

Here is a subtle but important problem: if you use cross-validation to tune hyperparameters and then report the cross-validation score from that process as your final result, you have overfit to the validation folds. The CV score is now optimistic because you selected the model that happened to do best on those exact folds.

Nested CV separates hyperparameter tuning (inner loop) from model evaluation (outer loop):

Outer fold 1:
  └── Inner CV on outer training data → best hyperparameters
  └── Evaluate best model on outer test fold → score 1

Outer fold 2:
  └── Inner CV on outer training data → best hyperparameters
  └── Evaluate best model on outer test fold → score 2

...

Final result: mean of outer fold scores

from sklearn.model_selection import (
    cross_val_score,
    GridSearchCV,
    KFold
)
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

cancer_data = load_breast_cancer()
X_cancer, y_cancer = cancer_data.data, cancer_data.target

svc_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("svc", SVC())
])

param_grid = {
    "svc__C": [0.1, 1, 10],
    "svc__kernel": ["linear", "rbf"]
}

# Inner loop: tune hyperparameters
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
inner_search = GridSearchCV(svc_pipeline, param_grid, cv=inner_cv, scoring="f1")

# Outer loop: estimate true generalisation performance
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(inner_search, X_cancer, y_cancer, cv=outer_cv, scoring="f1")

print(f"Nested CV F1: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")
# Output: Nested CV F1: 0.974 ± 0.016

Tip

Nested CV is computationally expensive. For a quick sanity check, compare the nested CV score to the best CV score from a simple GridSearchCV. If the simple CV score is significantly higher, it means selection bias is inflating your estimate — use nested CV for the final report.

Choosing the Right CV Strategy¶

Situation	Recommended approach
Standard classification/regression, balanced classes	`KFold(n_splits=5)` or just `cv=5`
Imbalanced classes	`StratifiedKFold(n_splits=5)`
Time series	`TimeSeriesSplit(n_splits=5)`
Small dataset (< 500 rows)	`KFold(n_splits=10)` or `LeaveOneOut`
Hyperparameter tuning + evaluation	Nested CV

Success

The mean ± std pattern is your standard way to report cross-validation results. The mean is the performance estimate. The std is the confidence in that estimate. A model with mean 0.82 ± 0.02 is more trustworthy than a model with mean 0.85 ± 0.09.

What's Next¶

You've covered standard K-Fold and StratifiedKFold cross-validation, TimeSeriesSplit for sequential data, nested cross-validation for unbiased hyperparameter tuning, the mean ± std reporting format, and the strategy selection table for each data situation. Next up: 03-regression-evaluation — where you'll learn when to use MAE versus RMSE versus R² versus MAPE, build the three-plot residual diagnostic suite, and compare models against a DummyRegressor baseline to contextualise whether any improvement is actually meaningful.

Optional Deep Dive

Read the sklearn documentation on cross_validate (as distinct from cross_val_score) at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html — it returns fit time, score time, and multiple metrics per fold simultaneously, which gives you the full picture you need for production model evaluation rather than a single aggregate score.

Previous: Evaluation Overview | Next: Regression Evaluation

Cross-Validation¶

Learning Objectives¶

Why a Single Split Is Not Enough¶

K-Fold Cross-Validation¶

Getting More Detail with cross_validate¶

StratifiedKFold for Imbalanced Classes¶

TimeSeriesSplit — When You Cannot Shuffle¶

Nested Cross-Validation¶

Choosing the Right CV Strategy¶

What's Next¶

Getting More Detail with `cross_validate`¶