Mini Case Study — Customer Churn EDA¶
This case study runs the full EDA workflow on a synthetic customer dataset. Everything is generated in code — no external files needed. Work through each step yourself before reading the commentary.
The business question: Which customers are most at risk of churning, and what signals predict it?
The Dataset¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
rng = np.random.default_rng(42)
n = 500
# Simulate customer data with realistic patterns
tenure = rng.integers(1, 60, n) # months as a customer
age = rng.integers(20, 65, n)
segment = rng.choice(["Basic", "Premium"], n, p=[0.65, 0.35])
region = rng.choice(["North", "South", "East", "West"], n)
calls = rng.integers(0, 12, n) # support calls last 6 months
monthly_fee = np.where(segment == "Premium",
rng.uniform(800, 1500, n),
rng.uniform(200, 600, n))
# Churn probability depends on tenure, calls, and segment
churn_prob = (
0.5
- tenure * 0.008 # longer tenure → lower churn
+ calls * 0.06 # more calls → higher churn
- (segment == "Premium") * 0.15
+ rng.normal(0, 0.05, n)
)
churn_prob = np.clip(churn_prob, 0.02, 0.95)
churn = (rng.uniform(size=n) < churn_prob).astype(int)
df = pd.DataFrame({
"customer_id": range(1001, 1001 + n),
"age": age,
"tenure_months": tenure,
"segment": segment,
"region": region,
"support_calls": calls,
"monthly_fee": monthly_fee.round(2),
"churn": churn # 1 = churned, 0 = retained
})
# Inject realistic messiness
df.loc[rng.choice(df.index, 18, replace=False), "age"] = np.nan
df.loc[rng.choice(df.index, 9, replace=False), "support_calls"] = np.nan
df.loc[rng.choice(df.index, 4, replace=False), "region"] = None
print(df.shape)
print(df.dtypes)
Step 1 — First Contact¶
print(df.head(8))
print(f"\nRows: {df.shape[0]}, Columns: {df.shape[1]}")
print(f"Churn rate: {df['churn'].mean():.1%}")
What each row represents: one customer. Target variable:
churn(1 = left, 0 = stayed).
Step 2 — Data Quality Audit¶
# Missing values
missing = pd.DataFrame({
"count": df.isna().sum(),
"pct": (df.isna().mean() * 100).round(1)
}).query("count > 0")
print(missing)
# Output:
# count pct
# age 18 3.6
# support_calls 9 1.8
# region 4 0.8
# Duplicates
print(f"Duplicate customer_ids: {df['customer_id'].duplicated().sum()}") # 0
print(f"Fully duplicate rows: {df.duplicated().sum()}") # 0
Decisions:
age: 3.6% missing. Impute with median; addage_missingindicator flag.support_calls: 1.8% missing. Missing likely means zero calls — impute with 0.region: 0.8% missing (4 rows). Impute with "Unknown" category.
df["age_missing"] = df["age"].isna().astype(int)
df["age"] = df["age"].fillna(df["age"].median())
df["support_calls"] = df["support_calls"].fillna(0)
df["region"] = df["region"].fillna("Unknown")
print("Remaining NaN values:", df.isna().sum().sum()) # 0
Step 3 — Univariate Analysis¶
Numeric Features¶
print(df[["age", "tenure_months", "support_calls", "monthly_fee"]].describe(
percentiles=[0.1, 0.25, 0.5, 0.75, 0.9]
))
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
cols = ["age", "tenure_months", "support_calls", "monthly_fee"]
for ax, col in zip(axes.flatten(), cols):
sns.histplot(df[col], kde=True, ax=ax, color="#0D9488")
ax.set_title(col.replace("_", " ").title())
plt.suptitle("Univariate Distributions", fontsize=13, y=1.01)
plt.tight_layout()
plt.savefig("univariate.png")
Findings:
tenure_months: roughly uniform (1–60). No concentration near cancellation dates.support_calls: right-skewed. Most customers make 0–3 calls; a tail makes 8–12.monthly_fee: bimodal — two clear peaks matching the Basic and Premium segments.age: approximately normal, centred around 42.
Categorical Features¶
for col in ["segment", "region"]:
vc = df[col].value_counts()
print(f"\n{col}:\n{vc}")
print(f" Churn rate: {df.groupby(col)['churn'].mean().round(3)}")
Step 4 — Bivariate Analysis (Feature vs Target)¶
Numeric Features vs Churn¶
fig, axes = plt.subplots(1, 4, figsize=(16, 5))
cols = ["age", "tenure_months", "support_calls", "monthly_fee"]
for ax, col in zip(axes, cols):
sns.boxplot(data=df, x="churn", y=col, ax=ax,
palette={0: "#0D9488", 1: "#F59E0B"})
ax.set_title(col.replace("_", " ").title())
ax.set_xlabel("Churned")
plt.suptitle("Feature Distributions: Retained (0) vs Churned (1)", y=1.02)
plt.tight_layout()
plt.savefig("bivariate_target.png")
Findings:
for col in ["age", "tenure_months", "support_calls", "monthly_fee"]:
retained = df[df["churn"] == 0][col].mean()
churned = df[df["churn"] == 1][col].mean()
print(f"{col:20s} retained={retained:7.1f} churned={churned:7.1f} diff={churned-retained:+.1f}")
Expected output pattern:
- tenure_months: churned customers have ~15 months; retained have ~35. Strong negative relationship.
- support_calls: churned customers average ~5 calls; retained ~2. Strong positive relationship.
- monthly_fee: churned customers pay less (mostly Basic segment).
- age: minimal difference — probably not a strong predictor.
Categorical Features vs Churn¶
for col in ["segment", "region"]:
ct = pd.crosstab(df[col], df["churn"], normalize="index") * 100
ct.columns = ["Retained %", "Churned %"]
print(f"\n{col}:\n{ct.round(1)}")
Findings:
segment: Basic churns at ~38%; Premium churns at ~20%. Segment is a strong signal.region: Churn rates similar across regions (~29–33%). Region is probably weak.
Step 5 — Correlation Analysis¶
numeric_feats = ["age", "tenure_months", "support_calls", "monthly_fee", "age_missing"]
corr = df[numeric_feats + ["churn"]].corr()
fig, ax = plt.subplots(figsize=(7, 6))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f",
cmap="coolwarm", center=0, ax=ax)
ax.set_title("Correlation Matrix incl. Target")
plt.tight_layout()
plt.savefig("correlation.png")
Key correlations with churn:
tenure_months: strong negative (~-0.55). Longer-tenure customers churn less.support_calls: strong positive (~+0.50). More calls = higher churn.monthly_fee: moderate negative (~-0.30). Higher fee customers (Premium) churn less.age: near zero. Not predictive on its own.
Step 6 — Outlier Check¶
def iqr_bounds(s):
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
return q1 - 1.5 * iqr, q3 + 1.5 * iqr
for col in ["support_calls", "monthly_fee", "tenure_months"]:
lo, hi = iqr_bounds(df[col])
n_out = ((df[col] < lo) | (df[col] > hi)).sum()
print(f"{col}: {n_out} outliers (bounds: {lo:.1f} – {hi:.1f})")
No pathological outliers here — the ranges are consistent with the data generation. In a real dataset, always inspect the actual rows before deciding.
Written Insights¶
Using the Observation / Interpretation / Action format:
Insight 1 — Tenure is the dominant predictor
Observation: Churned customers have an average tenure of ~15 months vs ~35 months for retained customers.
Interpretation: New customers have not yet built switching costs or habitual use. The first year is the critical retention window.
Action: Include tenure as a top feature. Consider a tenure_bucket feature (0–6, 7–12, 13–24, 24+) to capture non-linear effects.
Insight 2 — Support calls signal dissatisfaction
Observation: Churned customers average ~5 support calls vs ~2 for retained. Customers with 8+ calls churn at over 70%.
Interpretation: High contact volume reflects unresolved problems. By the time someone has called 5 times, they are already disengaged.
Action: support_calls should be a primary feature. Consider an interaction feature: calls_per_tenure_month.
Insight 3 — Segment explains fee differences
Observation: Premium customers churn at 20% vs 38% for Basic. Monthly fee is correlated with churn only because segment is — not because high-paying customers are more loyal per se.
Interpretation: The product or service tier, not the price, drives retention. Premium customers likely have access to features that increase stickiness.
Action: Use segment directly. Explore whether there are Premium features that could be offered to Basic customers to reduce their churn.
Insight 4 — Region is not predictive Observation: Churn rates across North/South/East/West are 29–33% — effectively uniform. Interpretation: Churn is not a regional phenomenon in this dataset. Local factors do not dominate. Action: Do not include region as a primary feature. May still capture marginal signal as an interaction.
Insight 5 — Age has negligible effect Observation: Pearson r between age and churn is near zero. Age distributions of churned vs retained customers are nearly identical. Interpretation: This product churns for behavioural reasons (low engagement, problems) not demographic ones. Action: Exclude age from the initial model. Revisit if the model is poor on a specific age subgroup.
Modelling Recommendations¶
- Primary features:
tenure_months,support_calls,segment,monthly_fee - Engineer:
calls_per_tenure_month,tenure_bucket, binaryhigh_callsflag (calls > 6) - Handle class imbalance: if churn rate is ~30%, use
class_weight='balanced'in the model - Baseline model: Logistic Regression first — the relationships look roughly linear. Then compare against a tree-based model.
- Metric to optimise: Recall on the churned class — it is more costly to miss a churning customer than to incorrectly flag a retained one.
What this case study demonstrated
A structured EDA — even on 500 rows of synthetic data — produces clear, actionable findings in under an hour. Two features (tenure_months and support_calls) explain most of the churn signal. That is exactly the kind of focus that makes modelling efficient.