Modeling Pipeline — Churn Prediction¶

You have a clean dataset and a preprocessing pipeline. Now you build, compare, and select a model. The workflow is always the same: start with a dumb baseline so you know what "beating nothing" looks like, train candidates, compare on cross-validation (not test set), tune the best candidate, and select the final model with a written rationale.

The test set is touched exactly once — in 05-evaluation-and-report.md. Not here.

Step 1 — Majority Class Baseline¶

Every project starts here. If your model cannot beat a classifier that predicts the majority class for every row, you have not built anything useful.

from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import cross_val_score

baseline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier",   DummyClassifier(strategy="most_frequent", random_state=42)),
])

# Cross-validate on training data only
baseline_scores = cross_val_score(
    baseline, X_train, y_train,
    cv=5,
    scoring="f1",          # F1 on the positive class (churn=1)
)

print(f"Baseline F1 (CV): {baseline_scores.mean():.3f} ± {baseline_scores.std():.3f}")
# Output: Baseline F1 (CV): 0.000 ± 0.000

The majority class classifier predicts "no churn" for every customer. F1 for the churn class is 0 because recall is 0 — it never predicts a churn. Any model you build must beat this.

Info

A more informative baseline is a single-feature classifier: predict churn if contract_type == "Month-to-Month". Try it. It will score around 0.55 F1. That is the real bar — beating a rule-of-thumb heuristic, not just a random guesser.

# Heuristic baseline: predict churn if Month-to-Month contract
heuristic_preds = (X_train["contract_type"] == "Month-to-Month").astype(int)
print(f"Heuristic F1: {f1_score(y_train, heuristic_preds):.3f}")
# Output: Heuristic F1: 0.556  (approximate)

Step 2 — Candidate Models¶

Train three models using 5-fold cross-validation. All share the same preprocessor from 03-feature-engineering.md.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

candidates = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest":       RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting":   GradientBoostingClassifier(n_estimators=100, random_state=42),
}

cv_results = {}
for name, clf in candidates.items():
    pipe = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("classifier",   clf),
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="f1")
    cv_results[name] = scores
    print(f"{name:25s}  F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Output (approximate — exact values vary slightly):
# Logistic Regression       F1: 0.631 ± 0.028
# Random Forest             F1: 0.668 ± 0.024
# Gradient Boosting         F1: 0.692 ± 0.021

Success

Gradient Boosting leads. It also has the lowest standard deviation, meaning it is more consistent across folds. These are both good signals. Record these numbers — you will reference them in your report.

Step 3 — Hyperparameter Tuning¶

Tune the best candidate (Gradient Boosting) with RandomizedSearchCV. Use RandomizedSearchCV rather than GridSearchCV — for large search spaces, random sampling finds good results faster.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    "classifier__n_estimators":   randint(50, 300),
    "classifier__max_depth":      randint(2, 8),
    "classifier__learning_rate":  uniform(0.01, 0.29),   # samples from [0.01, 0.30]
    "classifier__subsample":      uniform(0.6, 0.4),      # samples from [0.6, 1.0]
    "classifier__min_samples_split": randint(2, 20),
}

gb_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier",   GradientBoostingClassifier(random_state=42)),
])

search = RandomizedSearchCV(
    gb_pipeline,
    param_distributions=param_dist,
    n_iter=40,              # 40 random combinations
    cv=5,
    scoring="f1",
    n_jobs=-1,              # use all CPU cores
    random_state=42,
    verbose=1,
)

search.fit(X_train, y_train)

print(f"\nBest CV F1:  {search.best_score_:.3f}")
print(f"Best params: {search.best_params_}")

# Output (approximate):
# Fitting 5 folds for each of 40 candidates, totalling 200 fits
#
# Best CV F1:  0.714
# Best params: {
#   'classifier__learning_rate': 0.087,
#   'classifier__max_depth': 4,
#   'classifier__min_samples_split': 6,
#   'classifier__n_estimators': 187,
#   'classifier__subsample': 0.81
# }

Warning

The CV score from RandomizedSearchCV is optimistic. You searched over 40 parameter combinations — some improvement is due to chance. The test set score in the next file is the only honest number. Do not report the CV score as your model's performance.

Step 4 — Final Model Selection¶

# The best estimator from RandomizedSearchCV is already a fitted Pipeline
best_model = search.best_estimator_

# Collect all CV results for the final comparison table
import pandas as pd

results_df = pd.DataFrame({
    "Model":    ["Majority Class Baseline", "Heuristic (M2M contract)"] + list(candidates.keys()) + ["GB Tuned"],
    "CV F1":    [0.000, 0.556, 0.631, 0.668, 0.692, search.best_score_],
    "Selected": [False, False, False, False, False, True],
})
print(results_df.to_string(index=False))

# Output:
#                    Model  CV F1  Selected
#  Majority Class Baseline  0.000     False
#  Heuristic (M2M contract) 0.556     False
#      Logistic Regression  0.631     False
#            Random Forest  0.668     False
#        Gradient Boosting  0.692     False
#                 GB Tuned  0.714      True

Model selection rationale (write this in your notebook):

Gradient Boosting was selected over Logistic Regression and Random Forest because it achieved the highest cross-validated F1 on the churn class (0.692 before tuning, 0.714 after). Its lower variance across folds compared to Random Forest indicates it is less sensitive to which rows end up in which fold. Logistic Regression is faster and more interpretable but sacrifices approximately 8 F1 points, which in a business context translates to missing more at-risk customers.

Tip

Always write your model selection rationale in prose, not just a table. Interviewers and reviewers ask "why did you pick that model?" If your answer is "it had the highest number," you have not done model selection — you have done model picking. The rationale should reference the business context, the metric, and the tradeoff you accepted.

Step 5 — What Not to Do Here¶

These are the mistakes that actually appear in student project reviews:

Fitting on test data. Even calling preprocessor.fit_transform(X_test) in a cell you never use in training is a red flag. Reviewers look for it.
Reporting test set scores during model selection. Once you look at the test set, it is no longer a test set — it is a validation set. Keep it sealed until the final evaluation.
Skipping cross-validation. Picking the model with the best single train/test split score is not model selection. A lucky split can make a weak model look strong.
Tuning on cross-validated score from the wrong pipeline. If your param_dist keys do not match the step names in your pipeline (e.g., "n_estimators" instead of "classifier__n_estimators"), RandomizedSearchCV will silently use default parameters and you will think you tuned something when you did not. Always print search.best_params_ and verify the keys look correct.

03-feature-engineering | 05-evaluation-and-report