Machine Learning Cheat Sheet¶
A dense reference for practitioners. Each entry covers when to reach for a tool, what to watch out for, and a runnable code snippet.
Core Concepts¶
Bias-Variance Tradeoff¶
High bias = model too simple, underfits training data, high training error. High variance = model memorizes training data, performs poorly on unseen data, large gap between train and test error. The goal is to find the complexity level where both are low enough that generalization is good.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
import numpy as np
# High bias: linear model on nonlinear data
linear_model = LinearRegression()
linear_scores = cross_val_score(linear_model, X_train, y_train, cv=5, scoring='r2')
print(f"Linear CV R²: {linear_scores.mean():.3f} ± {linear_scores.std():.3f}")
# High variance: unconstrained tree overfits
deep_tree = DecisionTreeRegressor(max_depth=None, random_state=42)
tree_scores = cross_val_score(deep_tree, X_train, y_train, cv=5, scoring='r2')
print(f"Deep Tree CV R²: {tree_scores.mean():.3f} ± {tree_scores.std():.3f}")
# Balanced: constrained tree
pruned_tree = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10, random_state=42)
pruned_scores = cross_val_score(pruned_tree, X_train, y_train, cv=5, scoring='r2')
print(f"Pruned Tree CV R²: {pruned_scores.mean():.3f} ± {pruned_scores.std():.3f}")
Overfitting / Underfitting Signals¶
Watch the gap between training score and validation score.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
val_acc = accuracy_score(y_val, model.predict(X_val))
print(f"Train accuracy: {train_acc:.3f}")
print(f"Val accuracy: {val_acc:.3f}")
print(f"Gap: {train_acc - val_acc:.3f}")
# Output:
# Train accuracy: 0.998 ← suspiciously perfect
# Val accuracy: 0.834 ← large gap → overfitting
# Gap: 0.164
Warning
A near-perfect training score with a large train/val gap is the clearest signal of overfitting. Reduce model complexity, add regularization, or get more data.
The scikit-learn fit/predict API Pattern¶
Every estimator in scikit-learn follows the same interface. Learn this once and it applies everywhere.
from sklearn.linear_model import LogisticRegression
# 1. Instantiate with hyperparameters
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
# 2. Fit on training data
model.fit(X_train, y_train)
# 3. Predict
y_pred = model.predict(X_test) # class labels
y_pred_proba = model.predict_proba(X_test) # probability per class
# 4. Score
score = model.score(X_test, y_test) # default metric for the estimator type
print(f"Accuracy: {score:.3f}")
Data Splitting¶
train_test_split with stratify¶
Use stratify whenever the target is categorical and you want each split to reflect the class distribution of the full dataset. Skipping it on imbalanced data leads to splits with missing or underrepresented classes.
from sklearn.model_selection import train_test_split
import pandas as pd
X = df.drop(columns=['churn'])
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
stratify=y, # preserves class ratio in both splits
random_state=42
)
print(y_train.value_counts(normalize=True))
# Output:
# 0 0.855
# 1 0.145
print(y_test.value_counts(normalize=True))
# Output:
# 0 0.855
# 1 0.145
cross_val_score¶
Use when your dataset is too small to hold out a fixed validation set. Gives a less noisy estimate of generalization performance than a single train/val split.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted', n_jobs=-1)
print(f"F1 per fold: {cv_scores.round(3)}")
print(f"Mean: {cv_scores.mean():.3f} Std: {cv_scores.std():.3f}")
# Output:
# F1 per fold: [0.821 0.834 0.818 0.841 0.829]
# Mean: 0.829 Std: 0.008
KFold and StratifiedKFold¶
KFold for regression. StratifiedKFold for classification — ensures each fold has proportional class representation.
from sklearn.model_selection import KFold, StratifiedKFold, cross_validate
from sklearn.linear_model import Ridge
# Regression — KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(Ridge(alpha=1.0), X, y, cv=kf,
scoring=['r2', 'neg_mean_absolute_error'])
print(f"R²: {results['test_r2'].mean():.3f}")
# Classification — StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.linear_model import LogisticRegression
clf_results = cross_validate(LogisticRegression(max_iter=1000), X_cls, y_cls,
cv=skf, scoring='roc_auc')
print(f"ROC-AUC: {clf_results['test_score'].mean():.3f}")
Preprocessing¶
StandardScaler¶
Use when the algorithm is sensitive to feature scale and the data is roughly Gaussian (e.g., Logistic Regression, SVM, KNN, linear models). Centers to mean=0, scales to std=1.
Warning
Always fit the scaler on training data only. Fitting on the full dataset leaks test statistics into training.
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform on train
X_test_scaled = scaler.transform(X_test) # transform only on test
print(f"Mean (train): {X_train_scaled.mean(axis=0).round(4)}") # ≈ 0
print(f"Std (train): {X_train_scaled.std(axis=0).round(4)}") # ≈ 1
MinMaxScaler¶
Scales features to a fixed range, default [0, 1]. Use when you need bounded output (neural networks, image pixel values). Sensitive to outliers — one extreme point compresses everything else.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Min: {X_train_scaled.min(axis=0)}") # Output: [0. 0. 0. ...]
print(f"Max: {X_train_scaled.max(axis=0)}") # Output: [1. 1. 1. ...]
RobustScaler¶
Uses median and IQR instead of mean and std. The right choice when your data has significant outliers that would distort StandardScaler.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(quantile_range=(25.0, 75.0))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
LabelEncoder vs OrdinalEncoder¶
LabelEncoder encodes the target column (1D). OrdinalEncoder encodes feature columns (2D) with optional category ordering.
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import pandas as pd
# LabelEncoder — for target y only
le = LabelEncoder()
y_encoded = le.fit_transform(y_raw)
print(le.classes_) # Output: ['High' 'Low' 'Medium']
print(le.transform(['Low'])) # Output: [1]
# OrdinalEncoder — for ordered features
size_order = [['Small', 'Medium', 'Large', 'XLarge']]
oe = OrdinalEncoder(categories=size_order)
X[['shirt_size']] = oe.fit_transform(X[['shirt_size']])
OneHotEncoder¶
Converts nominal (unordered) categorical variables into binary columns. Use when the algorithm cannot interpret integer-encoded categories (linear models, SVMs). Tree models can handle label-encoded categoricals directly.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop='first')
encoded = ohe.fit_transform(X[['city', 'payment_method']])
feature_names = ohe.get_feature_names_out(['city', 'payment_method'])
encoded_df = pd.DataFrame(encoded, columns=feature_names)
print(encoded_df.head())
# Output:
# city_Mumbai city_Pune payment_method_UPI ...
# 0 1.0 0.0 1.0
Regression Algorithms¶
LinearRegression¶
Baseline for any regression task. Use when the relationship between features and target is approximately linear. No hyperparameters to tune — if it underperforms, that's signal to try a more complex model.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
Ridge (L2 Regularization)¶
Linear regression with L2 penalty on coefficient magnitude. Use when you have many correlated features — Ridge shrinks coefficients toward zero but rarely to exactly zero. alpha controls regularization strength; higher alpha = stronger shrinkage.
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Key hyperparameter: alpha
for alpha in [0.01, 0.1, 1.0, 10.0, 100.0]:
model = Ridge(alpha=alpha)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print(f"alpha={alpha:6.2f} CV R²={scores.mean():.3f}")
best_model = Ridge(alpha=1.0)
best_model.fit(X_train, y_train)
Lasso (L1 Regularization)¶
Linear regression with L1 penalty. Drives some coefficients to exactly zero — performs built-in feature selection. Prefer over Ridge when you believe only a subset of features are relevant.
from sklearn.linear_model import Lasso
import pandas as pd
model = Lasso(alpha=0.1, max_iter=10000)
model.fit(X_train, y_train)
coef_series = pd.Series(model.coef_, index=feature_names)
selected = coef_series[coef_series != 0]
print(f"Features selected by Lasso: {len(selected)} of {len(feature_names)}")
print(selected.sort_values(key=abs, ascending=False))
ElasticNet¶
Combines L1 and L2 penalties. Use when you want Lasso's feature selection but with more stability when features are correlated (pure Lasso arbitrarily picks one of a correlated group). l1_ratio=1 is Lasso, l1_ratio=0 is Ridge.
from sklearn.linear_model import ElasticNet
model = ElasticNet(
alpha=0.1, # overall regularization strength
l1_ratio=0.5, # 50% L1, 50% L2
max_iter=10000,
random_state=42
)
model.fit(X_train, y_train)
print(f"R²: {model.score(X_test, y_test):.3f}")
DecisionTreeRegressor¶
Non-parametric, captures nonlinear relationships and interactions. Prone to overfitting without depth limits. Use as a baseline for tree-based approaches before moving to ensembles.
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(
max_depth=5, # primary regularizer — start here
min_samples_leaf=10, # require at least 10 samples per leaf
min_samples_split=20, # require at least 20 samples to split a node
random_state=42
)
model.fit(X_train, y_train)
print(f"Train R²: {model.score(X_train, y_train):.3f}")
print(f"Test R²: {model.score(X_test, y_test):.3f}")
RandomForestRegressor¶
Ensemble of decision trees trained on bootstrap samples with random feature subsets. Reduces variance substantially vs. a single tree. A reliable default for tabular regression.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(
n_estimators=200, # more trees = lower variance, diminishing returns after ~200
max_depth=None, # let trees grow fully; bootstrap + feature randomness controls variance
max_features='sqrt', # features considered at each split — 'sqrt' is standard
min_samples_leaf=5,
n_jobs=-1,
random_state=42
)
model.fit(X_train, y_train)
print(f"OOB R²: {model.oob_score_:.3f}") # requires oob_score=True
print(f"Test R²: {model.score(X_test, y_test):.3f}")
Classification Algorithms¶
LogisticRegression¶
Use for binary and multiclass classification when you need probability outputs and model interpretability. Despite the name, it is a classifier. Requires scaled features.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
C=1.0, # inverse of regularization strength; smaller C = more regularization
penalty='l2', # 'l1', 'l2', 'elasticnet', or None
solver='lbfgs', # 'saga' for l1/elasticnet; 'lbfgs' for l2
max_iter=1000,
class_weight='balanced', # useful for imbalanced targets
random_state=42
)
model.fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.3f}")
print(f"Coefficients shape: {model.coef_.shape}")
KNeighborsClassifier¶
Non-parametric: classifies by majority vote among k nearest neighbors. No training step — prediction is expensive on large datasets. Features must be scaled. Works well with small datasets and low-dimensional feature spaces.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
# Find best k
for k in [3, 5, 7, 11, 15]:
knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean', weights='uniform')
scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
print(f"k={k:2d} CV Accuracy={scores.mean():.3f}")
best_knn = KNeighborsClassifier(n_neighbors=7, weights='distance')
best_knn.fit(X_train_scaled, y_train)
DecisionTreeClassifier¶
Interpretable, handles mixed feature types, no scaling needed. Prone to overfitting. Use when explainability is required or as a component in an ensemble.
from sklearn.tree import DecisionTreeClassifier, export_text
model = DecisionTreeClassifier(
max_depth=4,
min_samples_leaf=15,
criterion='gini', # 'gini' or 'entropy'
class_weight='balanced',
random_state=42
)
model.fit(X_train, y_train)
# Human-readable tree
print(export_text(model, feature_names=list(feature_names), max_depth=3))
RandomForestClassifier¶
Strong general-purpose baseline. Handles high-dimensional data, implicit feature selection, and is robust to outliers. No scaling required.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=200,
max_depth=None,
max_features='sqrt',
class_weight='balanced', # addresses class imbalance
oob_score=True,
n_jobs=-1,
random_state=42
)
model.fit(X_train, y_train)
print(f"OOB accuracy: {model.oob_score_:.3f}")
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
GradientBoostingClassifier¶
Builds trees sequentially, each correcting the errors of the previous. Often the top performer on structured/tabular data. Slower to train than Random Forest, more hyperparameters to tune.
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=200, # number of boosting stages
learning_rate=0.05, # shrinkage — lower rate needs more trees
max_depth=4, # shallow trees work best for boosting
subsample=0.8, # stochastic boosting reduces variance
min_samples_leaf=10,
random_state=42
)
model.fit(X_train, y_train)
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
Tip
For large datasets, use HistGradientBoostingClassifier from scikit-learn or XGBoost/LightGBM — they are 10–100x faster than GradientBoostingClassifier.
SVC (Support Vector Classifier)¶
Maximizes the margin between classes. Effective in high-dimensional spaces and when classes are separable. Requires feature scaling. Slow on large datasets (>50k rows).
from sklearn.svm import SVC
model = SVC(
C=1.0, # regularization — larger C = smaller margin, fewer misclassifications
kernel='rbf', # 'linear', 'poly', 'rbf', 'sigmoid'
gamma='scale', # 'scale' = 1/(n_features * X.var()), 'auto' = 1/n_features
probability=True, # enables predict_proba — adds overhead
class_weight='balanced',
random_state=42
)
model.fit(X_train_scaled, y_train)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
Clustering¶
KMeans with Elbow Method¶
Partitions data into k spherical clusters by minimizing inertia (sum of squared distances to cluster centroids). Assumes clusters are roughly equal-sized and convex. Sensitive to outliers and initialization.
Use elbow method or silhouette score to choose k — there is no ground truth label to validate against.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_features)
# Elbow method
inertias = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
plt.plot(k_range, inertias, marker='o')
plt.xlabel('Number of clusters k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Fit with chosen k
best_km = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
cluster_labels = best_km.fit_predict(X_scaled)
customer_features['segment'] = cluster_labels
print(customer_features.groupby('segment').mean())
DBSCAN¶
Density-Based Spatial Clustering of Applications with Noise. Discovers clusters of arbitrary shape and automatically labels outliers as noise (label = -1). Does not require specifying k.
Use when: clusters are non-spherical, you have significant noise/outliers, you don't know the number of clusters.
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(location_data)
db = DBSCAN(
eps=0.5, # maximum distance between two points to be neighbors
min_samples=5, # minimum points to form a core point
metric='euclidean'
)
labels = db.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise} ({100 * n_noise / len(labels):.1f}%)")
Tip
Use NearestNeighbors to find a good eps value: sort the distances to the kth neighbor and look for the elbow in that plot.
AgglomerativeClustering¶
Hierarchical bottom-up clustering. Starts with each point as its own cluster and merges the most similar pair at each step. No assumption about cluster shape. Computationally expensive for large datasets (O(n² log n)).
Use when: you need a hierarchy of clusters, or cluster count is uncertain and you want to inspect the dendrogram.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Dendrogram for choosing n_clusters
linkage_matrix = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix, truncate_mode='level', p=5)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()
# Fit with chosen number of clusters
model = AgglomerativeClustering(
n_clusters=4,
linkage='ward', # 'ward', 'complete', 'average', 'single'
metric='euclidean'
)
labels = model.fit_predict(X_scaled)
Regression Metrics¶
MAE, MSE, RMSE, R²¶
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | mean( | y - ŷ |
| MSE | mean((y - ŷ)²) | Penalizes large errors more heavily |
| RMSE | sqrt(MSE) | Same units as target, emphasizes large errors |
| R² | 1 - SS_res/SS_tot | Proportion of variance explained; 1.0 is perfect |
from sklearn.metrics import (
mean_absolute_error,
mean_squared_error,
r2_score
)
import numpy as np
y_test = [100, 200, 150, 300, 250]
y_pred = [110, 195, 160, 280, 260]
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.2f}") # Output: MAE: 12.00
print(f"MSE: {mse:.2f}") # Output: MSE: 170.00
print(f"RMSE: {rmse:.2f}") # Output: RMSE: 13.04
print(f"R²: {r2:.3f}") # Output: R²: 0.964
Adjusted R²¶
Penalizes adding features that do not improve the model. Use instead of R² when comparing models with different numbers of features.
import numpy as np
def adjusted_r2(r2, n_samples, n_features):
"""
r2 : R² score from sklearn
n_samples : number of rows in the test set
n_features : number of features used by the model
"""
return 1 - (1 - r2) * (n_samples - 1) / (n_samples - n_features - 1)
r2 = r2_score(y_test, y_pred)
adj_r2 = adjusted_r2(r2, n_samples=len(y_test), n_features=X_test.shape[1])
print(f"R²: {r2:.4f}")
print(f"Adjusted R²: {adj_r2:.4f}")
Classification Metrics¶
Accuracy, Precision, Recall, F1¶
Use accuracy when classes are balanced. For imbalanced data, prefer precision, recall, and F1. Choose precision when false positives are expensive (spam filter). Choose recall when false negatives are expensive (cancer screening).
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
classification_report,
confusion_matrix,
roc_auc_score
)
y_test = [0, 0, 1, 1, 0, 1, 0, 1, 1, 0]
y_pred = [0, 1, 1, 0, 0, 1, 0, 1, 0, 0]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}") # TP / (TP + FP)
print(f"Recall: {recall_score(y_test, y_pred):.3f}") # TP / (TP + FN)
print(f"F1: {f1_score(y_test, y_pred):.3f}") # harmonic mean of P and R
# Output:
# Accuracy: 0.700
# Precision: 0.750
# Recall: 0.600
# F1: 0.667
confusion_matrix and classification_report¶
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
cm,
index=['Actual Negative', 'Actual Positive'],
columns=['Predicted Negative', 'Predicted Positive']
)
print(cm_df)
# Output:
# Predicted Negative Predicted Positive
# Actual Negative 4 1
# Actual Positive 2 3
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))
ROC-AUC¶
Measures the model's ability to discriminate between classes across all thresholds. AUC = 0.5 means no discrimination (random). AUC = 1.0 means perfect. Use roc_auc_score with probability predictions, not binary predictions.
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
y_pred_proba = model.predict_proba(X_test)[:, 1] # probability of positive class
auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC: {auc:.3f}")
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Hyperparameter Tuning¶
GridSearchCV¶
Exhaustively searches all combinations of specified hyperparameters. Use when the search space is small and you need reproducible, thorough results.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_leaf':[1, 5, 10],
'max_features': ['sqrt', 'log2']
}
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=1,
refit=True # refit best model on full training set
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")
best_model = grid_search.best_estimator_
print(f"Test F1: {f1_score(y_test, best_model.predict(X_test), average='weighted'):.3f}")
RandomizedSearchCV¶
Samples a fixed number of hyperparameter combinations from distributions. Use when the search space is large — far more efficient than GridSearch and often finds equally good solutions.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'learning_rate': uniform(0.01, 0.3),
'max_depth': randint(2, 8),
'subsample': uniform(0.6, 0.4), # samples from [0.6, 1.0]
'min_samples_leaf':randint(5, 50)
}
random_search = RandomizedSearchCV(
estimator=GradientBoostingClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=50, # number of combinations to sample
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.3f}")
Pipelines¶
Pipeline — canonical pattern¶
A Pipeline chains preprocessing and modeling steps so that fit and transform are called consistently, preventing data leakage from the test set during cross-validation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
pipeline = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
# Entire pipeline participates in cross-validation correctly
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"CV ROC-AUC: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Train final model
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
ColumnTransformer — mixed feature types¶
Apply different preprocessing to numeric and categorical columns in one step.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
numeric_features = ['age', 'annual_income', 'account_balance']
categorical_features = ['city', 'product_category', 'payment_method']
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=200, random_state=42))
])
pipeline.fit(X_train, y_train)
print(f"Test accuracy: {pipeline.score(X_test, y_test):.3f}")
# Hyperparameter tuning still works end-to-end
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [None, 5]
}
gs = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
gs.fit(X_train, y_train)
make_pipeline — shorthand¶
Identical to Pipeline but infers step names from class names automatically.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC
pipeline = make_pipeline(
RobustScaler(),
SVC(C=1.0, kernel='rbf', probability=True, random_state=42)
)
pipeline.fit(X_train, y_train)
print(pipeline.steps)
# Output: [('robustscaler', RobustScaler()), ('svc', SVC(probability=True, ...))]
Feature Importance¶
Impurity-Based Importance (.feature_importances_)¶
Available on all tree-based models. Fast to compute. Can be misleading when features have different cardinalities or when features are correlated — high-cardinality features (like IDs) are favored.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(10))
importance_df.head(15).plot.barh(x='feature', y='importance', figsize=(8, 6))
plt.title('Feature Importances (Impurity-Based)')
plt.gca().invert_yaxis()
plt.show()
permutation_importance¶
Model-agnostic. Measures the drop in score when a feature's values are randomly shuffled. More reliable than impurity-based importance, especially for correlated features.
from sklearn.inspection import permutation_importance
import pandas as pd
result = permutation_importance(
model, X_test, y_test,
n_repeats=10, # number of shuffles per feature
scoring='roc_auc',
random_state=42,
n_jobs=-1
)
perm_df = pd.DataFrame({
'feature': feature_names,
'importance_mean': result.importances_mean,
'importance_std': result.importances_std
}).sort_values('importance_mean', ascending=False)
print(perm_df.head(10))
SHAP (brief)¶
SHAP (SHapley Additive exPlanations) provides consistent, locally-accurate feature attributions for any model. The gold standard for model explainability. Requires pip install shap.
import shap
explainer = shap.TreeExplainer(model) # fast path for tree models
shap_values = explainer.shap_values(X_test)
# Global summary plot
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names)
# Single prediction explanation
shap.waterfall_plot(
shap.Explanation(values=shap_values[1][0],
base_values=explainer.expected_value[1],
data=X_test.iloc[0],
feature_names=feature_names)
)
Imbalanced Data¶
class_weight='balanced'¶
The simplest fix. Automatically adjusts sample weights inversely proportional to class frequency. Supported by LogisticRegression, SVC, DecisionTree, RandomForest, and GradientBoosting.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Without balancing
unweighted = LogisticRegression(max_iter=1000, random_state=42)
unweighted.fit(X_train_scaled, y_train)
print(classification_report(y_test, unweighted.predict(X_test_scaled)))
# With balancing
balanced = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
balanced.fit(X_train_scaled, y_train)
print(classification_report(y_test, balanced.predict(X_test_scaled)))
# Expect higher recall for minority class, lower precision
SMOTE (Synthetic Minority Oversampling)¶
Generates synthetic minority class samples by interpolating between existing minority examples. Use when class_weight is insufficient and you have enough minority class samples to interpolate (>= 6). Requires pip install imbalanced-learn.
Warning
Apply SMOTE only to training data, never to validation or test data. Doing otherwise inflates performance metrics.
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
# SMOTE must be inside a pipeline to avoid leaking into validation folds
pipeline = ImbPipeline(steps=[
('smote', SMOTE(sampling_strategy='auto', k_neighbors=5, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=200, random_state=42))
])
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"CV F1 with SMOTE: {scores.mean():.3f} ± {scores.std():.3f}")
Threshold Adjustment¶
The default decision threshold for predict() is 0.5. Lowering the threshold increases recall at the cost of precision. Use when the cost of false negatives is higher than false positives (e.g., fraud, churn, medical diagnosis).
from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np
model.fit(X_train_scaled, y_train)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
# Find threshold that maximizes F1
precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred_proba)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-9)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Default threshold (0.5) F1: {f1_score(y_test, y_pred_proba >= 0.5):.3f}")
print(f"Optimal threshold ({best_threshold:.2f}) F1: {f1_score(y_test, y_pred_proba >= best_threshold):.3f}")
y_pred_adjusted = (y_pred_proba >= best_threshold).astype(int)
Saving Models¶
joblib — preferred for scikit-learn¶
joblib serializes numpy arrays efficiently, making it faster than pickle for models that contain large arrays (most scikit-learn models). Use this by default.
import joblib
# Save
joblib.dump(model, 'churn_model_v1.joblib')
print("Model saved.")
# Save entire pipeline (always preferred — scaler + model together)
joblib.dump(pipeline, 'churn_pipeline_v1.joblib')
# Load
loaded_model = joblib.load('churn_model_v1.joblib')
y_pred = loaded_model.predict(X_test)
print(f"Predictions from loaded model: {y_pred[:5]}")
# Output: Predictions from loaded model: [0 1 0 0 1]
pickle — standard library fallback¶
Use when you cannot install joblib, or when interoperability with non-scikit-learn code is required. Slower for large numpy arrays. Format is Python-version-specific — document the Python version alongside the file.
import pickle
# Save
with open('churn_model_v1.pkl', 'wb') as f:
pickle.dump(model, f)
# Load
with open('churn_model_v1.pkl', 'rb') as f:
loaded_model = pickle.load(f)
y_pred = loaded_model.predict(X_test)
Warning
Never load a pickle file from an untrusted source. Pickle can execute arbitrary code on deserialization. For model serving in production, prefer ONNX, PMML, or a framework-native format.
Tip
Always save the model and preprocessing pipeline together as a single object. If you save them separately, you risk applying the wrong scaler to incoming data at inference time.