EDA and Cleaning — Churn Dataset¶

EDA is not exploration for its own sake. You are answering one question: what does this data tell me about why customers churn, and what will break my model if I ignore it? Every chart and check should sharpen or challenge a hypothesis.

By the end of this phase you will have a clean DataFrame, a written list of findings, and zero surprises waiting to bite you during modeling.

Step 1 — Structure and Types¶

Run these four lines before anything else. They take ten seconds and catch 80% of structural problems.

print(df.shape)
# Output: (1020, 9)

print(df.dtypes)
# Output:
# customer_id        object
# tenure_months       int64
# monthly_charges    float64
# num_products        int64
# support_calls       int64
# has_tech_support    int64
# contract_type      object
# payment_method     object
# churn               int64

print(df.nunique())
# Output:
# customer_id        1000   ← unique IDs, but we have 1020 rows → duplicates exist
# tenure_months        71
# monthly_charges     958
# num_products          4
# support_calls        10
# has_tech_support      2
# contract_type         3
# payment_method        4
# churn                 2

df.head(3)

Tip

nunique() on customer_id showing 1000 when the DataFrame has 1020 rows immediately tells you there are 20 duplicate rows. You spotted that in one line — no loop needed.

Step 2 — Missing Values¶

missing = df.isna().sum()
missing_pct = (missing / len(df) * 100).round(2)

pd.DataFrame({"count": missing, "pct": missing_pct}).query("count > 0")
# Output:
#                  count   pct
# monthly_charges     40  3.92

monthly_charges has ~4% missing. That is low enough to impute rather than drop rows.

Decision: impute with the median, stratified by contract_type (customers on different contracts have systematically different charge levels — a global median would introduce bias).

# Impute monthly_charges with contract-type-specific median
df["monthly_charges"] = df.groupby("contract_type")["monthly_charges"].transform(
    lambda x: x.fillna(x.median())
)

df["monthly_charges"].isna().sum()
# Output: 0

Warning

Do not impute with the global mean or median when a grouping variable explains the variance. Here, Two Year contract customers pay systematically more — filling their missing values with the overall median underestimates their charges and distorts the feature.

Step 3 — Duplicate Rows¶

n_dups = df.duplicated().sum()
print(n_dups)
# Output: 20

df = df.drop_duplicates().reset_index(drop=True)
print(df.shape)
# Output: (1000, 9)

Always reset the index after dropping rows. Gaps in the index cause silent bugs in positional slicing.

Success

After steps 2 and 3 you have a clean DataFrame: 1,000 rows, 9 columns, 0 missing values, 0 duplicates. Save a checkpoint: df_clean = df.copy().

Step 4 — Churn Rate (the single most important number)¶

churn_rate = df["churn"].mean()
print(f"Overall churn rate: {churn_rate:.1%}")
# Output: Overall churn rate: 29.4%   (exact value varies slightly with seed)

df["churn"].value_counts()
# Output:
# 0    706
# 1    294
# Name: churn, dtype: int64

The dataset is moderately imbalanced (~70/30). This is common in churn datasets. You do not need SMOTE or class weights at this scale, but you must use stratified splitting at train/test split time and report the churn-class F1, not accuracy.

Step 5 — Numeric Feature Distributions¶

import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = ["tenure_months", "monthly_charges", "num_products", "support_calls"]

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for ax, col in zip(axes.flat, numeric_cols):
    df[col].hist(bins=30, ax=ax, edgecolor="black", color="steelblue", alpha=0.8)
    ax.set_title(col)
    ax.set_xlabel("")
plt.suptitle("Numeric Feature Distributions", y=1.02)
plt.tight_layout()
plt.show()

What to look for: - tenure_months — roughly uniform. No transformation needed. - monthly_charges — right-skewed slightly. Fine for tree models; log-transform if using linear models. - support_calls — right-skewed, concentrated at 0–3. Important predictor. - num_products — discrete uniform 1–4. Treat as ordinal or numeric.

Step 6 — Each Feature vs Churn¶

This is the most valuable analysis in EDA. Look at churn rate within each feature's segments.

Numeric features — box plots by churn¶

fig, axes = plt.subplots(1, 4, figsize=(16, 5))
for ax, col in zip(axes, numeric_cols):
    df.boxplot(column=col, by="churn", ax=ax)
    ax.set_title(col)
    ax.set_xlabel("Churn")
plt.suptitle("")
plt.tight_layout()
plt.show()

What you should see: - Churned customers (churn=1) have shorter tenure_months — median around 15–20 vs 35–40 for retained. - Churned customers have higher support_calls — median around 5 vs 3. - monthly_charges difference is smaller but present. - num_products difference is modest.

Categorical features — churn rate by group¶

for col in ["contract_type", "payment_method", "has_tech_support"]:
    rates = df.groupby(col)["churn"].mean().sort_values(ascending=False)
    print(f"\n--- {col} ---")
    print(rates.round(3).to_string())

# Output:
--- contract_type ---
contract_type
Month-to-Month    0.432
One Year          0.112
Two Year          0.054

--- payment_method ---
payment_method
Electronic Check    0.358
Mailed Check        0.271
Bank Transfer       0.231
Credit Card         0.217

--- has_tech_support ---
has_tech_support
0    0.374
1    0.241

Success

This is the most actionable finding in the entire project: Month-to-Month customers churn at 43%, Two Year customers at 5%. Contract type will be the single most important feature in your model. Note this in your findings.

Step 7 — Correlation Matrix¶

corr = df[numeric_cols + ["churn"]].corr()

plt.figure(figsize=(7, 5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()

Expected findings: - support_calls has the strongest positive correlation with churn (~0.35–0.45). - tenure_months has a negative correlation with churn (~-0.30 to -0.40). - monthly_charges has a small positive correlation (~0.10). - num_products is roughly uncorrelated with churn.

Warning

Correlation only captures linear relationships. A feature with low correlation (like num_products) can still be useful in a tree model that captures non-linear splits. Do not drop features based on correlation alone.

Step 8 — Written Findings¶

Before moving to feature engineering, write three to five sentences summarising what you found. This becomes the "EDA findings" section of your final report.

Template — fill this in:

Churn rate: [X]%. The dataset has [n] rows after removing [k] duplicates and imputing
[col] with contract-type-specific medians.

Key drivers of churn:
1. Contract type is the strongest separator: Month-to-Month customers churn at [X]%
   vs [Y]% for Two Year customers.
2. Support calls correlate positively with churn (r = [X]). Customers with 5+ calls
   in the last 3 months churn at [X]%.
3. Short-tenure customers (< 12 months) have disproportionately high churn.

Features with weak signal: [list them]. They will be included in modeling but are
unlikely to drive predictions.

Tip

Write the findings before you run any models. It forces you to commit to a hypothesis that you can then validate (or invalidate) with the model's feature importance scores.

01-project-brief | 03-feature-engineering