Scikit-learn Workflow¶

Scikit-learn is built around one idea: every algorithm, regardless of how different it is mathematically, speaks the same interface. A logistic regression, a random forest, and a support vector machine are all called the same way. Once you learn the API contract, you can swap algorithms with a single line of code and focus your energy on what actually matters — data quality, features, and evaluation.

This note walks you through the complete sklearn workflow end-to-end, from raw data to a defensible evaluation score.

Learning Objectives¶

Explain the sklearn estimator API: fit, predict, transform, fit_transform, score
Distinguish transformers, estimators, and pipelines
Build a complete end-to-end ML pipeline with preprocessing and a model
Use cross_val_score and StratifiedKFold for reliable evaluation
Handle mixed-type data (numeric + categorical) inside a single pipeline
Interpret evaluation metrics appropriate to the task type

The Estimator API Contract¶

Sklearn has a single, consistent interface across all of its objects. Learn this once and you can use any algorithm.

Method	Who has it	What it does
`fit(X, y)`	All estimators	Learn parameters from training data
`predict(X)`	Models (classifiers, regressors)	Return predictions for new data
`predict_proba(X)`	Classifiers that support it	Return class probabilities
`transform(X)`	Transformers (scalers, encoders)	Apply the learned transformation
`fit_transform(X)`	Transformers	Fit then transform in one step (train only)
`score(X, y)`	Most estimators	Compute default metric (accuracy or R²)

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

# Transformers: fit learns parameters, transform applies them
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learn mean/std, then scale
X_test_scaled  = scaler.transform(X_test)        # apply learned mean/std, no refit

# Models: fit learns from data, predict applies
model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train_scaled, y_train)   # learn weights
y_pred = model.predict(X_test_scaled)  # apply weights to new data
proba  = model.predict_proba(X_test_scaled)  # class probabilities

print("Prediction shape:", y_pred.shape)      # Output: (114,)
print("Probability shape:", proba.shape)      # Output: (114, 2)
print("Classes:", model.classes_)             # Output: [0 1]

Info

The consistency of this API is why sklearn dominates industry for tabular ML. You can write a function that accepts any sklearn-compatible model, and it works with logistic regression, random forests, gradient boosting, SVMs — all of them. This is the power of a well-designed interface.

Complete End-to-End Workflow¶

Here is a full ML workflow on the Titanic dataset — a realistic example with missing values, categorical features, and class imbalance.

Step 1: Load and inspect¶

import pandas as pd
import numpy as np

# Load Titanic data — a classic binary classification dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

# If offline, use this minimal synthetic version:
np.random.seed(42)
n = 891
titanic = pd.DataFrame({
    "Pclass":   np.random.choice([1, 2, 3], n, p=[0.24, 0.21, 0.55]),
    "Age":      np.where(np.random.rand(n) < 0.2, np.nan,
                         np.random.normal(30, 14, n).clip(1, 80)),
    "SibSp":    np.random.poisson(0.5, n),
    "Fare":     np.random.exponential(32, n).round(2),
    "Sex":      np.random.choice(["male", "female"], n, p=[0.65, 0.35]),
    "Embarked": np.random.choice(["S", "C", "Q", None], n, p=[0.72, 0.18, 0.08, 0.02]),
    "Survived": np.random.choice([0, 1], n, p=[0.62, 0.38]),
})

print(titanic.dtypes)
# Output:
# Pclass       int64
# Age         float64
# SibSp        int64
# Fare        float64
# Sex          object
# Embarked     object
# Survived      int64

print(titanic.isnull().sum())
# Output:
# Age         ~178 (20% missing)
# Embarked    ~2

Step 2: Define features and target¶

feature_cols = ["Pclass", "Age", "SibSp", "Fare", "Sex", "Embarked"]
target_col   = "Survived"

X = titanic[feature_cols]
y = titanic[target_col]

print("Class balance:")
print(y.value_counts(normalize=True).round(3))
# Output:
# 0    0.621
# 1    0.379

Step 3: Split¶

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # preserve class balance in both sets
)

print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")
# Output: Train: 712, Test: 179

Step 4: Build preprocessing pipeline for mixed data¶

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Separate columns by type
numeric_cols     = ["Age", "Fare", "SibSp"]
categorical_cols = ["Pclass", "Sex", "Embarked"]

# Preprocessing for numeric features:
# 1. Impute missing values with median
# 2. Scale to zero mean, unit variance
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
])

# Preprocessing for categorical features:
# 1. Impute missing values with the most frequent value
# 2. One-hot encode categories to integers
categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

# ColumnTransformer applies different pipelines to different columns
preprocessor = ColumnTransformer([
    ("num", numeric_transformer,     numeric_cols),
    ("cat", categorical_transformer, categorical_cols),
])

Step 5: Build full model pipeline¶

from sklearn.linear_model import LogisticRegression

model_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   LogisticRegression(max_iter=10000, random_state=42)),
])

model_pipe.fit(X_train, y_train)

Step 6: Evaluate¶

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
)

y_pred      = model_pipe.predict(X_test)
y_proba     = model_pipe.predict_proba(X_test)[:, 1]  # probability of class 1

print("Accuracy:", accuracy_score(y_test, y_pred).round(4))
# Output: Accuracy: 0.7989

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Died", "Survived"]))
# Output:
#               precision    recall  f1-score   support
#         Died       0.83      0.87      0.85       111
#     Survived       0.75      0.69      0.72        68
#     accuracy                           0.80       179

print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(4))
# Output: ROC-AUC: 0.8541

Tip

ROC-AUC is often more informative than accuracy for classification. Accuracy collapses when classes are imbalanced — a model that predicts "died" for everyone gets 62% accuracy on Titanic. AUC measures the model's ability to rank positives above negatives, regardless of threshold.

Reading the Classification Report¶

# The classification report gives four numbers per class
#
# precision: of all predicted positive, how many were actually positive?
#            TP / (TP + FP)
#
# recall:    of all actual positives, how many did the model find?
#            TP / (TP + FN)
#
# f1-score:  harmonic mean of precision and recall
#            2 * (precision * recall) / (precision + recall)
#
# support:   number of true instances in this class
#
# When to optimise for precision: when false positives are costly
#   (e.g. flagging a good customer as fraudulent)
#
# When to optimise for recall: when false negatives are costly
#   (e.g. missing a cancer diagnosis)

# Confusion matrix: rows are actual, columns are predicted
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(cm)
# Output:
# [[TN  FP]
#  [FN  TP]]
#
# Top-left: correctly predicted "Died"
# Bottom-right: correctly predicted "Survived"
# Top-right: predicted "Survived" but actually "Died" (false positives)
# Bottom-left: predicted "Died" but actually "Survived" (false negatives)

Cross-Validation for Reliable Evaluation¶

A single train/test split gives you one number. Cross-validation gives you a distribution — how much does performance vary across different subsets of the data? That variance is information.

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    model_pipe,
    X, y,
    cv=cv,
    scoring="roc_auc",
)

print("Fold AUC scores:", scores.round(4))
# Output: Fold AUC scores: [0.8534 0.8412 0.8701 0.8623 0.8488]

print(f"Mean AUC: {scores.mean():.4f}")  # Output: Mean AUC: 0.8552
print(f"Std AUC:  {scores.std():.4f}")   # Output: Std AUC:  0.0100

Interpreting cross-validation output¶

Mean = 0.8552, Std = 0.0100

- Good mean, low variance → stable, reliable model
- Good mean, high variance → model works but results depend heavily on which data you use
- Low mean → model needs improvement (features, algorithm, hyperparameters)
- Very high mean (> 0.99) on a real dataset → check for leakage

Warning

A standard deviation close to zero is not always good. If all 5 folds give exactly 0.97, it might mean your dataset has very low variance — or that leakage is making every fold trivially easy to predict. Investigate either case.

Comparing Multiple Models¶

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

candidates = {
    "Logistic Regression": LogisticRegression(max_iter=10000, random_state=42),
    "Decision Tree":       DecisionTreeClassifier(max_depth=5, random_state=42),
    "Random Forest":       RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM":                 SVC(probability=True, random_state=42),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = {}

for name, clf in candidates.items():
    pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("classifier",   clf),
    ])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="roc_auc")
    results[name] = scores

    print(f"{name:<25} AUC: {scores.mean():.4f} ± {scores.std():.4f}")

# Output (approximate):
# Logistic Regression       AUC: 0.8552 ± 0.0100
# Decision Tree             AUC: 0.8101 ± 0.0231
# Random Forest             AUC: 0.8712 ± 0.0119
# SVM                       AUC: 0.8634 ± 0.0142

Tip

Choose your model based on cross-validation results, not test set results. After comparing models on cross-validation, pick one, retrain it on all training data, and evaluate it once on the test set. That single final test set evaluation is your honest reported performance.

Common Workflow Mistakes¶

Warning

Calling fit on test data. Even for a transformer: scaler.fit_transform(X_test) recomputes the scaler on test data. This is leakage and will go silently wrong. Always: fit on train, transform test.

Warning

Forgetting to include preprocessing in cross-validation. If you scale before passing to cross_val_score, all five folds use the same scaler fitted on the full dataset — that scaler has seen validation fold data. Always wrap preprocessing in a Pipeline before cross-validation.

Warning

Interpreting a high score() result without checking what metric it uses. LogisticRegression.score() returns accuracy. On an imbalanced dataset, accuracy is misleading. Always specify the metric explicitly: cross_val_score(..., scoring="roc_auc").

Warning

Using fit_transform on test data. fit_transform(X_test) is equivalent to fit(X_test).transform(X_test) — it refits the transformer on the test set. For test data, always use .transform() only.

The Workflow Checklist¶

Before you submit or present any ML result, verify each item:

[ ] Problem is clearly defined (what decision does the model support?)
[ ] Target column is correct and not leaky
[ ] Identifier columns removed from features
[ ] Data split BEFORE any preprocessing
[ ] Preprocessing is inside a Pipeline (or manually done on train, applied to test)
[ ] Model evaluated on held-out data (test set or cross-validation)
[ ] Reported metric is appropriate for the task and class balance
[ ] Test set was touched only once
[ ] Model limitations are stated (where does it fail? what is the error rate?)

Interview Questions¶

Q1: What does fit() do? What does transform() do?

Show answer

fit() learns parameters from training data — for a StandardScaler, it computes the mean and standard deviation; for a LogisticRegression, it learns the weights. These parameters are stored in the object.

transform() applies the previously learned transformation to new data without relearning parameters. For a StandardScaler, it subtracts the mean and divides by the standard deviation computed during fit().

fit_transform() is a convenience that calls both in sequence. It should only be used on training data.

Q2: Why use a Pipeline instead of preprocessing manually?

Show answer

Two reasons. First, correctness: a Pipeline ensures that preprocessing is always fitted on training data only. When you pass a pipeline to cross_val_score, each fold refits the transformers on its own training fold — preventing leakage. If you preprocess manually before cross-validation, all folds see the same transformers fitted on the full dataset, which is leakage.

Second, reproducibility: a pipeline bundles all preprocessing and modelling decisions into a single serialisable object. You can save and load it with joblib.dump(), and know that predictions will always use the same preprocessing steps applied in the same order.

Q3: You cross-validate a model and get mean AUC 0.92 ± 0.18. What does that tell you?

Show answer

The high standard deviation (0.18) is concerning. It means model performance varies enormously across folds — some folds may score 0.74 and others 1.00. This suggests the model is unstable: it works well on some data distributions and poorly on others. Possible causes include: small dataset making each fold very different; class imbalance causing some folds to have very few positive examples; or leakage affecting some folds but not others. Investigate before trusting the mean.

Q4: What is ColumnTransformer and why do you need it?

Show answer

ColumnTransformer applies different preprocessing pipelines to different columns simultaneously. In most real datasets, numeric and categorical features need different treatment — numerics need imputation and scaling, categoricals need imputation and one-hot encoding. ColumnTransformer lets you specify which transformer to apply to which columns, then combines the outputs into a single feature matrix. Without it, you would have to manually split, transform, and recombine columns — which is error-prone and breaks the Pipeline pattern.

What's Next¶

You've covered the complete sklearn workflow from data loading through ColumnTransformer pipelines to model comparison and the pre-submission checklist — including fit/transform semantics, cross-validation interpretation, and classification report reading. Next up: 05-exercises — where you'll apply the full workflow to real datasets, building pipelines from scratch and diagnosing common mistakes under exam-style constraints.

Optional Deep Dive

Work through the sklearn tutorial "An introduction to machine learning with scikit-learn" at https://scikit-learn.org/stable/tutorial/basic/tutorial.html — it uses the Iris and digits datasets to walk through the exact fit/predict/score workflow covered here, reinforcing the API patterns with hands-on examples you can run in a notebook.

Previous: Train/Test Split and Leakage | Next: Exercises