Bivariate Analysis¶
Two variables in isolation only tell half the story. Bivariate analysis is where patterns become actionable — you learn that older customers spend more, that churn is concentrated in one region, that ad spend stops driving sales past a certain threshold. These are the insights that go into presentations and drive model features.
Learning Objectives¶
By the end of this, you will be able to:
- Choose the right chart and statistic for any pair of variable types
- Interpret scatter plots, box plots, and cross-tabs with confidence
- Distinguish correlation from causation and explain why it matters
- Analyze relationships between a feature and a target variable
The Three Combinations¶
Every bivariate question involves two variables. There are three type pairings, and each has its own toolkit:
| Pair | Visual | Statistic |
|---|---|---|
| Numeric vs Numeric | Scatter plot, line plot | Pearson/Spearman correlation |
| Categorical vs Numeric | Box plot, violin plot, bar chart | Grouped mean/median, ANOVA |
| Categorical vs Categorical | Heat map, grouped bar | Cross-tab, chi-square |
Numeric vs Numeric¶
Scatter Plot¶
The scatter plot is the first thing you reach for when both variables are numbers. It shows the direction, strength, and shape of a relationship at a glance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
rng = np.random.default_rng(42)
n = 200
df = pd.DataFrame({
"ad_spend": rng.uniform(1000, 10000, n),
"revenue": None,
"customer_age": rng.integers(20, 65, n),
"support_calls": rng.integers(0, 10, n),
"churn": rng.choice(["Yes", "No"], n, p=[0.3, 0.7]),
"region": rng.choice(["North", "South", "East", "West"], n),
"segment": rng.choice(["Basic", "Premium"], n, p=[0.6, 0.4]),
})
df["revenue"] = df["ad_spend"] * rng.uniform(1.5, 3.5, n) + rng.normal(0, 500, n)
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df["ad_spend"], df["revenue"], alpha=0.4, edgecolors="none")
ax.set_xlabel("Ad Spend (₹)")
ax.set_ylabel("Revenue (₹)")
ax.set_title("Ad Spend vs Revenue")
plt.tight_layout()
plt.savefig("scatter_ad_revenue.png")
What to look for:
- Direction — does revenue go up as ad spend goes up (positive) or down (negative)?
- Strength — are points tightly clustered around a line or scattered everywhere?
- Shape — is it linear, or does the relationship plateau after a point?
- Outliers — points far from the cluster often deserve individual investigation
Correlation Coefficient¶
A number between -1 and 1 that summarises the scatter plot. Pearson measures linear relationships; Spearman measures monotonic relationships and is robust to outliers.
pearson_r = df["ad_spend"].corr(df["revenue"], method="pearson")
spearman_r = df["ad_spend"].corr(df["revenue"], method="spearman")
print(f"Pearson r: {pearson_r:.3f}")
print(f"Spearman r: {spearman_r:.3f}")
# Output: both will be high (~0.85) given the data generation above
Interpreting r:
| |r| | Strength | |---|---| | 0.9 – 1.0 | Very strong | | 0.7 – 0.9 | Strong | | 0.5 – 0.7 | Moderate | | 0.3 – 0.5 | Weak | | 0.0 – 0.3 | Negligible |
Correlation Matrix¶
When you have many numeric columns, scan all pairs at once.
numeric_cols = ["ad_spend", "revenue", "customer_age", "support_calls"]
corr = df[numeric_cols].corr()
fig, ax = plt.subplots(figsize=(7, 5))
mask = np.triu(np.ones_like(corr, dtype=bool)) # hide upper triangle
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f",
cmap="coolwarm", center=0, ax=ax)
ax.set_title("Correlation Matrix")
plt.tight_layout()
plt.savefig("corr_matrix.png")
Correlation ≠ causation
A correlation of 0.85 between ad spend and revenue does not mean ad spend causes revenue. Both might be driven by a third variable — time of year, new product launches, or simply the fact that bigger companies both spend more and earn more. Always ask: what else could explain this?
Anscombe's Quartet
Four datasets can have identical correlations (r = 0.816) but completely different scatter patterns — including one with a curved relationship and one with an outlier driving everything. Always plot. Never trust the number alone.
Categorical vs Numeric¶
Grouped Summary Statistics¶
Start with numbers before charts.
summary = df.groupby("segment")["revenue"].agg(["mean", "median", "std", "count"])
print(summary)
# Output:
# mean median std count
# segment
# Basic 12847 11923 5621 120
# Premium 19341 18104 7832 80
The median is often more honest than the mean here — one high-value customer in a small segment inflates the mean significantly.
Box Plot¶
Shows distribution shape, median, IQR, and outliers per group. The most information-dense chart for this combination.
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df, x="segment", y="revenue", ax=ax)
ax.set_title("Revenue Distribution by Customer Segment")
ax.set_ylabel("Revenue (₹)")
plt.tight_layout()
plt.savefig("boxplot_segment_revenue.png")
Violin Plot¶
When you suspect the distribution is bimodal or has an unusual shape, a violin plot shows the full density.
fig, ax = plt.subplots(figsize=(8, 5))
sns.violinplot(data=df, x="region", y="revenue", ax=ax, inner="box")
ax.set_title("Revenue Distribution by Region")
plt.xticks(rotation=15)
plt.tight_layout()
plt.savefig("violin_region_revenue.png")
Bar chart vs box plot
A bar chart of means hides the distribution entirely. Two groups can have the same mean but completely different shapes — one tight and consistent, one wildly variable. Use box plots when distribution shape matters, which it almost always does.
Categorical vs Categorical¶
Cross-Tabulation¶
The workhorse for two categorical variables. Shows raw counts and, more usefully, percentages.
# Raw counts
counts = pd.crosstab(df["segment"], df["churn"])
print(counts)
# Output:
# churn No Yes
# segment
# Basic 78 42
# Premium 62 18
# Row percentages — what fraction of each segment churns?
pct = pd.crosstab(df["segment"], df["churn"], normalize="index") * 100
print(pct.round(1))
# Output:
# churn No Yes
# segment
# Basic 65.0 35.0
# Premium 77.5 22.5
The row percentages immediately tell the story: Premium customers churn at half the rate of Basic customers.
Grouped Bar Chart¶
Visualise the cross-tab.
fig, ax = plt.subplots(figsize=(7, 5))
pct.plot(kind="bar", ax=ax, rot=0, color=["#0D9488", "#F59E0B"])
ax.set_title("Churn Rate by Segment")
ax.set_ylabel("Percentage (%)")
ax.legend(title="Churn")
plt.tight_layout()
plt.savefig("bar_churn_segment.png")
Heatmap of Cross-Tab¶
More readable than a bar chart when there are many categories.
fig, ax = plt.subplots(figsize=(6, 4))
sns.heatmap(
pd.crosstab(df["region"], df["churn"], normalize="index") * 100,
annot=True, fmt=".1f", cmap="YlOrRd", ax=ax
)
ax.set_title("Churn Rate (%) by Region")
plt.tight_layout()
plt.savefig("heatmap_region_churn.png")
Target Variable Analysis¶
In a supervised ML project, one specific bivariate comparison matters more than all others: each feature against the target. Run this systematically before touching a model.
# Numeric features vs binary target
for col in ["ad_spend", "revenue", "customer_age", "support_calls"]:
churned_mean = df[df["churn"] == "Yes"][col].mean()
not_churned_mean = df[df["churn"] == "No"][col].mean()
print(f"{col:20s} churned={churned_mean:8.1f} retained={not_churned_mean:8.1f}")
If churned customers have significantly more support calls, that is a signal. If their ad spend is similar, it is probably not a useful feature.
The bivariate habit
Before building any model, produce a 2x2 grid: numeric features vs target (box plots), and categorical features vs target (cross-tabs). This 30-minute exercise will tell you which features are worth engineering and which to ignore.
Common Pitfalls¶
Ecological fallacy
Relationships that hold at the group level (regions with more ad spend have higher revenue) do not necessarily hold at the individual level. Aggregating then correlating inflates apparent relationships. Always verify at the appropriate grain.
Overplotting
With thousands of points, a scatter plot becomes a solid blob. Use alpha=0.1, hexbin plots (plt.hexbin()), or 2D KDE plots (sns.kdeplot(x=..., y=...)) instead.