Statistics Cheat Sheet¶

A dense reference for Day 4 concepts. Every entry has the formula, the Python one-liner, and when to use it.

Central Tendency¶

Measure	Formula	Python	Use When
Mean	`Σx / n`	`s.mean()`	Symmetric data, no extreme outliers
Median	Middle value (sorted)	`s.median()`	Skewed data, outliers present
Mode	Most frequent value	`s.mode()[0]`	Categorical data, finding the peak
Weighted mean	`Σ(w·x) / Σw`	`np.average(arr, weights=w)`	Values have different importance

import pandas as pd
import numpy as np

s = pd.Series([22, 25, 29, 35, 40, 95])

print(s.mean())              # 41.0 — pulled up by 95
print(s.median())            # 32.0 — more representative
print(s.mode()[0])           # all unique here; .mode() returns all tied peaks
print(np.average(s, weights=[1,1,1,1,1,3]))  # 3x weight on the last value

Diagnosing Skew Quickly¶

gap = s.mean() - s.median()

# Gap > 0   → right-skewed (long right tail — income, prices)
# Gap < 0   → left-skewed  (long left tail  — easy exam scores)
# Gap ≈ 0   → roughly symmetric

print(s.skew())   # positive = right, negative = left

`skew` value	Shape	Prefer
`< -1` or `> 1`	Strongly skewed	Median
`-1` to `-0.5` or `0.5` to `1`	Moderately skewed	Median
`-0.5` to `0.5`	Roughly symmetric	Mean

Spread¶

Measure	Formula	Python	Use When
Range	`max - min`	`s.max() - s.min()`	Quick sanity check
Variance (sample)	`Σ(x - x̄)² / (n-1)`	`s.var()`	Squared units; input to std
Variance (population)	`Σ(x - μ)² / n`	`s.var(ddof=0)`	Full population data
Std dev (sample)	`√variance`	`s.std()`	Spread in original units
Std dev (population)	`√variance`	`s.std(ddof=0)`	Full population data
IQR	`Q3 - Q1`	`s.quantile(.75) - s.quantile(.25)`	Skewed data, outlier-robust
Coefficient of variation	`(std / mean) × 100`	`(s.std() / s.mean()) * 100`	Comparing spread across scales

s = pd.Series([60, 70, 75, 80, 95])

print(f"Range:      {s.max() - s.min()}")
print(f"Variance:   {s.var():.2f}")        # sample (ddof=1 default)
print(f"Std:        {s.std():.2f}")
print(f"IQR:        {s.quantile(.75) - s.quantile(.25):.2f}")
print(f"CV:         {(s.std() / s.mean()) * 100:.1f}%")

Pandas vs NumPy — Default Behavior¶

Operation	pandas	NumPy
Variance default	`ddof=1` (sample)	`ddof=0` (population)
Std dev default	`ddof=1` (sample)	`ddof=0` (population)
NaN handling	skips NaN by default	propagates NaN by default

arr = np.array([10, 20, 30, 40, 50])
s   = pd.Series(arr)

print(np.std(arr))         # 14.14 — population std (ddof=0)
print(s.std())             # 15.81 — sample std (ddof=1)
print(np.std(arr, ddof=1)) # 15.81 — force sample in NumPy

Outlier Detection — IQR Method¶

q1  = s.quantile(0.25)
q3  = s.quantile(0.75)
iqr = q3 - q1

lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers    = s[(s < lower) | (s > upper)]
clean_data  = s[(s >= lower) & (s <= upper)]

This is what a box plot uses to draw its whiskers and flag outliers.

Outlier Detection — Z-Score Method¶

from scipy import stats

z_scores = np.abs(stats.zscore(s))
outliers = s[z_scores > 3]   # values more than 3 std from the mean

Use Z-scores only for roughly normal data. Use IQR for skewed data. Mixing them gives wrong results.

Probability Rules¶

Rule	Formula	Code
Complement	`P(not A) = 1 - P(A)`	`1 - p_a`
AND (independent)	`P(A ∩ B) = P(A) × P(B)`	`p_a * p_b`
OR (general)	`P(A ∪ B) = P(A) + P(B) - P(A ∩ B)`	`p_a + p_b - p_ab`
Conditional	`P(A\\|B) = P(A ∩ B) / P(B)`	`p_ab / p_b`
Bayes	`P(A\\|B) = P(B\\|A) × P(A) / P(B)`	see below

# Bayes' theorem
p_a             = 0.01    # prior: P(disease)
p_b_given_a     = 0.95    # likelihood: P(positive | disease)
p_b_given_not_a = 0.05    # false positive rate

p_b = p_b_given_a * p_a + p_b_given_not_a * (1 - p_a)
p_a_given_b = (p_b_given_a * p_a) / p_b

print(f"P(disease | positive test) = {p_a_given_b:.4f}")  # 0.1610

Distributions Quick Reference¶

Distribution	Parameters	Mean	Variance	Use Case
Normal	μ, σ	μ	σ²	Symmetric measurements
Binomial	n, p	np	np(1-p)	Count of successes in n trials
Poisson	λ	λ	λ	Events per unit time/area
Uniform	a, b	(a+b)/2	(b-a)²/12	All outcomes equally likely

Generate and query distributions with scipy.stats¶

from scipy import stats

# Normal
n = stats.norm(loc=70, scale=10)       # mean=70, std=10
print(n.pdf(70))                        # density at 70
print(n.cdf(80))                        # P(X ≤ 80) ≈ 0.841
print(n.ppf(0.95))                      # 95th percentile ≈ 86.4

# Binomial
b = stats.binom(n=100, p=0.3)
print(b.pmf(30))                        # P(X = 30)
print(b.cdf(25))                        # P(X ≤ 25)

# Poisson
p = stats.poisson(mu=5)
print(p.pmf(5))                         # P(X = 5)
print(1 - p.cdf(9))                     # P(X > 9)

The 68-95-99.7 Rule (Normal Only)¶

mu, sigma = 100, 15

print(f"68%: {mu - sigma:.0f} to {mu + sigma:.0f}")     # 85 to 115
print(f"95%: {mu - 2*sigma:.0f} to {mu + 2*sigma:.0f}") # 70 to 130
print(f"99.7%: {mu - 3*sigma:.0f} to {mu + 3*sigma:.0f}") # 55 to 145

Z-Score¶

# Standardize a single value
z = (value - mean) / std

# Standardize a full series
z_series = (s - s.mean()) / s.std()

# Look up the probability (for normal data)
from scipy.stats import norm
p_below = norm.cdf(z)   # P(X ≤ value)
p_above = 1 - p_below   # P(X > value)

Comprehensive Data Summary¶

def full_summary(s, name="Series"):
    q1, q3 = s.quantile(0.25), s.quantile(0.75)
    print(f"\n=== {name} ===")
    print(f"  Count:   {s.count()}   | Missing: {s.isnull().sum()}")
    print(f"  Mean:    {s.mean():.3f}  | Median: {s.median():.3f}")
    print(f"  Std:     {s.std():.3f}  | IQR:    {q3 - q1:.3f}")
    print(f"  Min:     {s.min():.3f}  | Max:    {s.max():.3f}")
    print(f"  Skew:    {s.skew():.3f}  | Kurt:   {s.kurtosis():.3f}")

full_summary(pd.Series([22, 25, 30, 35, 200]), "Customer Ages")

Choosing Mean vs Median — Decision Rule¶

def recommend_central_tendency(s):
    skew = abs(s.skew())
    if skew > 1:
        return "Use MEDIAN — strongly skewed data"
    elif skew > 0.5:
        return "Use MEDIAN — moderately skewed; also report mean for context"
    else:
        return "Use MEAN — roughly symmetric data"

print(recommend_central_tendency(pd.Series([100, 120, 130, 125, 1000])))
# Output: Use MEDIAN — strongly skewed data

Back: Distributions | Next: Practice Questions