Statistical Tests — Choosing the Right One¶

Picking the wrong statistical test is one of the most common errors in applied data work. Using a t-test when you have three groups, or a chi-square test on continuous data, or ignoring assumptions about variance — these mistakes lead to invalid p-values and wrong conclusions. The good news: test selection follows a clear decision process. Once you know the rules, you can pick the right test in about thirty seconds.

Learning Objectives¶

Use a systematic decision process to select the correct test for any scenario
Run and interpret one-sample t-test, two-sample t-test (equal and unequal variance), paired t-test, ANOVA, and chi-square test
Understand the assumptions of each test and what to do when they fail
Know the non-parametric alternatives for each parametric test
Recognize when you need to apply multiple testing correction

The Test Selection Decision Tree¶

Answer three questions in order:

1. What is the outcome variable type? - Continuous/numeric → consider t-tests, ANOVA, correlation - Categorical/binary → consider chi-square, proportions tests

2. How many groups are you comparing? - One group vs a known value → one-sample t-test - Two groups → two-sample t-test (or paired if same subjects) - Three or more groups → ANOVA

3. Are the observations independent? - Independent → independent samples test - Same subjects measured twice → paired test

outcome is numeric?
├── yes → how many groups?
│   ├── 1 (vs. known value) → one-sample t-test
│   ├── 2 → are subjects paired?
│   │   ├── yes → paired t-test
│   │   └── no → independent two-sample t-test
│   └── 3+ → one-way ANOVA
└── no (categorical) → chi-square test
    ├── goodness of fit (one variable) → chi-square goodness of fit
    └── independence (two variables) → chi-square contingency table

Parametric vs Non-Parametric

Parametric tests (t-tests, ANOVA) assume the data is approximately normally distributed and have specific variance assumptions. Non-parametric tests make fewer assumptions by working on ranks rather than raw values. Non-parametric tests are less powerful when parametric assumptions hold, but they are valid when those assumptions do not.

One-Sample t-Test¶

Use when: You have one group of numeric measurements and want to test whether the population mean equals a specific value.

Example: Is the average delivery time different from our SLA of 30 minutes?

Assumptions: - Observations are independent - Data is approximately normally distributed (or n > 30 by CLT) - No extreme outliers

import numpy as np
from scipy import stats

# Delivery times (minutes) for 15 orders
delivery_times = np.array([
    28, 31, 29, 27, 30, 26, 32, 28, 29, 27,
    25, 33, 30, 28, 31
])

# H₀: population mean = 30 minutes
# H₁: population mean ≠ 30 minutes (two-tailed)
t_stat, p_value = stats.ttest_1samp(delivery_times, popmean=30)

print(f"Sample mean:  {delivery_times.mean():.2f} min")
print(f"t-statistic:  {t_stat:.4f}")
print(f"p-value:      {p_value:.4f}")
# Output:
# Sample mean:  29.27 min
# t-statistic:  -1.7542
# p-value:      0.1013

alpha = 0.05
result = "Reject H₀" if p_value <= alpha else "Fail to reject H₀"
print(f"\nDecision (α={alpha}): {result}")
# Output: Decision (α=0.05): Fail to reject H₀

# One-tailed test: is delivery time BELOW 30 minutes?
t_stat_1t, p_value_1t = stats.ttest_1samp(delivery_times, popmean=30, alternative='less')
print(f"\nOne-tailed (less) p-value: {p_value_1t:.4f}")
# Output: One-tailed (less) p-value: 0.0507

Note: the two-tailed test does not reject at α = 0.05. The one-tailed test (testing whether the mean is specifically below 30) barely does not reject either. This is a case where you would want to look at the confidence interval and consider sample size before drawing conclusions.

Independent Two-Sample t-Test¶

Use when: You have two independent groups and want to compare their means.

Example: Does the new website design produce higher average session duration than the old design?

Two variants: - Equal variances assumed (Student's t): use when Levene's test is not significant - Unequal variances (Welch's t): the default in scipy, robust even when variances are similar

import numpy as np
from scipy import stats

np.random.seed(42)
# Session durations (seconds) for two versions of the homepage
old_design = np.array([180, 195, 210, 175, 200, 185, 220, 190, 165, 205,
                       195, 180, 210, 185, 200])
new_design = np.array([220, 235, 245, 215, 240, 230, 255, 225, 210, 248,
                       232, 218, 244, 228, 237])

# Step 1: Check whether variances are equal using Levene's test
levene_stat, levene_p = stats.levene(old_design, new_design)
print(f"Levene's test: stat={levene_stat:.4f}, p={levene_p:.4f}")
# Output: Levene's test: stat=0.0234, p=0.8795
# p > 0.05: no evidence of unequal variances, but we use Welch's anyway (it's safer)

# Step 2: Run the t-test
# equal_var=False → Welch's t-test (recommended default)
t_stat, p_value = stats.ttest_ind(old_design, new_design, equal_var=False)

print(f"\nOld design mean: {old_design.mean():.2f}s")
print(f"New design mean: {new_design.mean():.2f}s")
print(f"Difference:      {new_design.mean() - old_design.mean():.2f}s")
print(f"t-statistic:     {t_stat:.4f}")
print(f"p-value:         {p_value:.6f}")
# Output:
# Old design mean: 193.33s
# New design mean: 232.47s
# Difference:      39.13s
# t-statistic:     -11.7345
# p-value:         0.000000

# Step 3: Effect size (Cohen's d)
pooled_std = np.sqrt((old_design.std()**2 + new_design.std()**2) / 2)
cohens_d = (new_design.mean() - old_design.mean()) / pooled_std
print(f"Cohen's d:       {cohens_d:.4f}  (large effect)")
# Output: Cohen's d:       4.4892  (large effect)

Always Use Welch's t-Test by Default

Welch's t-test (equal_var=False, which is actually the default in scipy) performs well even when variances are equal, and it protects you when they are not. The traditional Student's t-test assumes equal variances — an assumption that often does not hold. Use Welch's unless you have a specific reason not to.

Paired t-Test¶

Use when: The same subjects are measured under two conditions, or measurements are matched by design (before/after, same store two months, matched pairs).

Why it is different: The key insight is that with paired data, you can compute the difference per subject and test whether the mean difference is zero. This removes between-subject variability, making the test more powerful.

Example: Employee productivity (tasks completed per day) before and after ergonomic keyboard intervention.

import numpy as np
from scipy import stats

# Same 12 employees measured before and after new keyboards
before = np.array([42, 38, 51, 45, 39, 47, 52, 41, 43, 49, 46, 44])
after  = np.array([46, 43, 55, 48, 41, 52, 57, 44, 47, 53, 50, 49])

# The differences tell the actual story
differences = after - before
print(f"Mean difference: {differences.mean():.2f} tasks/day")
print(f"Differences: {differences}")
# Output:
# Mean difference: 4.33 tasks/day
# Differences: [4 5 4 3 2 5 5 3 4 4 4 5]

# H₀: mean difference = 0
# H₁: mean difference ≠ 0
t_stat, p_value = stats.ttest_rel(before, after)

print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.8f}")
# Output:
# t-statistic: -23.8048
# p-value:     0.00000001

print(f"\nAll employees improved. The intervention worked (p << 0.05).")
# Also check effect size for paired data: Cohen's dz
cohens_dz = differences.mean() / differences.std(ddof=1)
print(f"Cohen's dz: {cohens_dz:.4f}  (very large effect)")
# Output: Cohen's dz: 6.8755  (very large effect)

Using Independent Test on Paired Data

If you run an independent t-test on data that is actually paired, you lose all the power that pairing gives you. The within-subject variation cancels out in a paired test but adds noise in an independent test. Always ask: "Are the same subjects appearing in both groups?"

One-Way ANOVA¶

Use when: You want to compare means across three or more independent groups.

Why not just run multiple t-tests? If you run three t-tests to compare groups A vs B, A vs C, and B vs C, each at α = 0.05, your overall false-positive rate is no longer 5%. It is closer to 14%. ANOVA tests all groups simultaneously while controlling the Type I error rate.

What ANOVA tests: H₀ is that all group means are equal. A significant result tells you at least one group differs — not which one. For that, you need a post-hoc test.

Example: Customer spending across three loyalty tiers.

import numpy as np
from scipy import stats
import pandas as pd

np.random.seed(5)
# Monthly spend ($) for Bronze, Silver, and Gold tier customers
bronze = np.random.normal(loc=80,  scale=20, size=40)
silver = np.random.normal(loc=150, scale=25, size=40)
gold   = np.random.normal(loc=300, scale=40, size=40)

# H₀: μ_bronze = μ_silver = μ_gold
# H₁: At least one group mean differs
f_stat, p_value = stats.f_oneway(bronze, silver, gold)

print(f"Bronze mean: ${bronze.mean():.2f}")
print(f"Silver mean: ${silver.mean():.2f}")
print(f"Gold mean:   ${gold.mean():.2f}")
print(f"\nF-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.8f}")
# Output:
# Bronze mean: $80.87
# Silver mean: $147.30
# Gold mean:   $299.54
# F-statistic: 578.3742
# p-value:     0.00000000

# ANOVA says "at least one group differs" — use Tukey HSD for which ones
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd

all_data = np.concatenate([bronze, silver, gold])
labels   = ['Bronze'] * 40 + ['Silver'] * 40 + ['Gold'] * 40

tukey = pairwise_tukeyhsd(endog=all_data, groups=labels, alpha=0.05)
print("\nTukey HSD Post-Hoc Test:")
print(tukey.summary())
# Output: All three pairs are significantly different from each other

ANOVA Assumptions

Observations are independent within and between groups
Each group is approximately normally distributed
The groups have approximately equal variances (homoscedasticity)

Check normality with Shapiro-Wilk (stats.shapiro). Check equal variances with Levene's test (stats.levene). If variances are unequal, use Welch's ANOVA instead.

Chi-Square Test¶

The chi-square test is for categorical data. It has two common variants:

Chi-square goodness of fit: Tests whether observed frequencies match expected frequencies for a single categorical variable.

Chi-square test of independence: Tests whether two categorical variables are associated.

Goodness of Fit Example¶

Are customer complaints distributed equally across four product categories, or is one category disproportionately problematic?

from scipy import stats

# Observed complaints per category
observed = np.array([45, 38, 52, 65])  # Electronics, Clothing, Food, Home

# Expected if complaints were uniformly distributed
n_total = observed.sum()
n_categories = len(observed)
expected = np.array([n_total / n_categories] * n_categories)

chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)

print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"\nchi2-statistic: {chi2_stat:.4f}")
print(f"p-value:        {p_value:.4f}")
# Output:
# Observed: [45 38 52 65]
# Expected: [50. 50. 50. 50.]
# chi2-statistic: 8.0400
# p-value:        0.0452

print(f"\nComplaints are not equally distributed (p < 0.05).")
print(f"Home category has the most complaints.")

Test of Independence Example¶

Is purchase behavior associated with device type?

import pandas as pd
import numpy as np
from scipy import stats

# Contingency table: device type vs purchased (yes/no)
contingency_table = pd.DataFrame(
    data=[[250, 150], [180, 120], [90,  60]],
    index=['Mobile', 'Desktop', 'Tablet'],
    columns=['Purchased', 'Did Not Purchase']
)

print("Contingency Table:")
print(contingency_table)
# Output:
# Contingency Table:
#          Purchased  Did Not Purchase
# Mobile         250               150
# Desktop        180               120
# Tablet          90                60

chi2_stat, p_value, dof, expected_freq = stats.chi2_contingency(contingency_table)

print(f"\nchi2-statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value:        {p_value:.4f}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(expected_freq.round(1), index=contingency_table.index,
                   columns=contingency_table.columns))
# Output:
# chi2-statistic: 0.0000
# Degrees of freedom: 2
# p-value:        1.0000
# (These proportions are identical across devices — no association)

Chi-Square Assumes Sufficient Cell Counts

The chi-square test is unreliable when expected cell counts are below 5. Check the expected_freq output. If any cell is below 5, consider combining categories or using Fisher's exact test instead.

# Fisher's exact test (for 2x2 tables with small samples)
table_2x2 = np.array([[15, 5], [10, 20]])
odds_ratio, p_fisher = stats.fisher_exact(table_2x2)
print(f"Fisher's exact test p-value: {p_fisher:.4f}")
# Output: Fisher's exact test p-value: 0.0073

Non-Parametric Alternatives¶

When your data violates the normality assumption (especially with small samples), reach for these:

Parametric Test	Non-Parametric Alternative	Use When
One-sample t-test	Wilcoxon signed-rank test	Small n, non-normal, ordinal data
Independent t-test	Mann-Whitney U test	Non-normal distributions, ordinal data
Paired t-test	Wilcoxon signed-rank test	Paired data, non-normal
One-way ANOVA	Kruskal-Wallis test	3+ groups, non-normal
Pearson correlation	Spearman correlation	Non-linear monotonic, ordinal

from scipy import stats
import numpy as np

# Mann-Whitney U: non-parametric alternative to two-sample t-test
group_a = np.array([12, 15, 11, 18, 10, 13, 16, 14])
group_b = np.array([19, 22, 17, 25, 20, 21, 23, 18])

u_stat, p_mw = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"Mann-Whitney U: stat={u_stat:.1f}, p={p_mw:.4f}")
# Output: Mann-Whitney U: stat=0.0, p=0.0006

# Kruskal-Wallis: non-parametric ANOVA
kw_stat, p_kw = stats.kruskal(group_a, group_b, [15, 16, 14, 17, 15])
print(f"Kruskal-Wallis: stat={kw_stat:.4f}, p={p_kw:.4f}")
# Output: Kruskal-Wallis: stat=13.3576, p=0.0013

Checking Normality¶

Before choosing between parametric and non-parametric tests, check your distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

data = np.array([23, 25, 28, 22, 26, 24, 27, 23, 29, 25, 22, 26, 24, 27, 28])

# Shapiro-Wilk test (best for small samples, n < 50)
shapiro_stat, shapiro_p = stats.shapiro(data)
print(f"Shapiro-Wilk: stat={shapiro_stat:.4f}, p={shapiro_p:.4f}")
# Output: Shapiro-Wilk: stat=0.9746, p=0.9128
# p > 0.05: no evidence against normality — proceed with parametric test

# Visual check: Q-Q plot
fig, ax = plt.subplots(figsize=(6, 5))
stats.probplot(data, dist="norm", plot=ax)
ax.set_title("Q-Q Plot — Does Data Follow Normal Distribution?")
plt.tight_layout()
plt.savefig("qqplot.png", dpi=150)
plt.show()

The Central Limit Theorem Is Your Backup

For large samples (n > 30), the sampling distribution of the mean is approximately normal regardless of the original distribution. This means t-tests are fairly robust to non-normality at reasonable sample sizes. For very small samples (n < 15), normality matters more.

Multiple Testing Correction¶

When you run many tests simultaneously — testing 50 features, or checking 20 subgroups — your family-wise error rate inflates. At α = 0.05 with 20 independent tests, you expect one false positive just by chance.

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests

np.random.seed(42)
# 20 independent tests, all truly null
p_values = [stats.ttest_ind(
    np.random.normal(0, 1, 30),
    np.random.normal(0, 1, 30)
)[1] for _ in range(20)]

# Bonferroni correction: strict, controls family-wise error rate
reject_bonferroni, p_bonferroni, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')

# Benjamini-Hochberg: less conservative, controls false discovery rate
reject_bh, p_bh, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

print(f"Uncorrected significant: {sum(p < 0.05 for p in p_values)}")
print(f"Significant (Bonferroni): {sum(reject_bonferroni)}")
print(f"Significant (BH):         {sum(reject_bh)}")
# Output (approx):
# Uncorrected significant: 2
# Significant (Bonferroni): 0
# Significant (BH):         0

Complete Decision Reference¶

Scenario	Data Type	Assumptions Met	Test
Sample mean vs known value	Numeric	Normal / large n	One-sample t-test
Sample mean vs known value	Numeric	Non-normal, small n	Wilcoxon signed-rank
Two independent groups	Numeric	Normal	Welch's two-sample t-test
Two independent groups	Numeric	Non-normal	Mann-Whitney U
Same subjects, two conditions	Numeric	Normal differences	Paired t-test
Same subjects, two conditions	Numeric	Non-normal differences	Wilcoxon signed-rank
Three or more independent groups	Numeric	Normal, equal variance	One-way ANOVA
Three or more independent groups	Numeric	Non-normal	Kruskal-Wallis
Two categorical variables	Categorical	Expected counts ≥ 5	Chi-square independence
Two categorical variables	Categorical	Small n or small counts	Fisher's exact
One categorical variable vs distribution	Categorical	Expected counts ≥ 5	Chi-square goodness of fit
Linear relationship between two numerics	Numeric	Normal, linear	Pearson correlation
Monotonic relationship or ordinal	Numeric/Ordinal	Any	Spearman correlation

Practice Exercises¶

Warm-up: You have three classes that took the same exam. Test whether average scores differ across classes.

from scipy import stats
class_a = [72, 68, 75, 70, 73, 69, 74, 71]
class_b = [80, 82, 79, 83, 81, 84, 78, 82]
class_c = [65, 67, 63, 68, 66, 64, 70, 65]
# Choose the right test. Report the result and your interpretation.

Main: A survey asked 300 respondents their preferred device (Mobile/Desktop/Tablet) and whether they completed a purchase. Test whether device preference is associated with purchase behavior. Create the contingency table from raw data and run the chi-square test.

import pandas as pd
import numpy as np
np.random.seed(1)
devices    = np.random.choice(['Mobile', 'Desktop', 'Tablet'], size=300, p=[0.5, 0.35, 0.15])
# Make purchase probability slightly different by device
purchase_prob = {'Mobile': 0.40, 'Desktop': 0.55, 'Tablet': 0.45}
purchased = [np.random.binomial(1, purchase_prob[d]) for d in devices]
survey_df = pd.DataFrame({'device': devices, 'purchased': purchased})

Stretch: Run the paired t-test and independent t-test on the same before/after data. Explain why the p-values differ. When would each be appropriate?

Key Takeaways

Test selection depends on: (1) data type, (2) number of groups, (3) whether subjects are paired.
Welch's t-test is the safe default for two independent numeric groups.
ANOVA tests whether any group mean differs — not which one. Use Tukey HSD post-hoc for pairwise comparisons.
Chi-square requires expected cell counts ≥ 5. Use Fisher's exact test for small samples.
Non-parametric tests are your fallback when normality assumptions fail.
Multiple testing inflates false positives. Apply Bonferroni or Benjamini-Hochberg correction.

Previous: Correlation | Next: Interview Prep