Mean, Median, and Mode¶
Salary surveys regularly report an "average" salary that almost nobody earns. This is not an accident — it is a deliberate (or careless) choice of the wrong summary statistic. Understanding when the mean lies to you is one of the first skills that separates a data scientist from someone who just runs .mean() on everything.
Learning Objectives¶
- Compute mean, median, and mode in Python using NumPy and pandas
- Explain why the mean is misleading for skewed data
- Choose the correct measure of central tendency for a given dataset
- Understand weighted mean and when to use it
- Recognize how outliers shift the mean while leaving the median stable
What Is Central Tendency?¶
Central tendency answers the question: "What is the typical value in this dataset?" There are three common answers, and they give different results depending on the shape of your data.
| Measure | Definition | Best when |
|---|---|---|
| Mean | Sum divided by count | Data is symmetric, no extreme outliers |
| Median | Middle value when sorted | Data is skewed or has outliers |
| Mode | Most frequent value | Categorical data, or finding the peak of a distribution |
The right choice is not a matter of preference — it follows from the shape of your data.
Mean — The Arithmetic Average¶
The mean adds everything up and divides by the number of values.
Formula
mean = (x₁ + x₂ + ... + xₙ) / n
In NumPy: np.mean(arr) | In pandas: series.mean()
import numpy as np
import pandas as pd
salaries = pd.Series([45000, 48000, 52000, 55000, 60000, 58000, 950000])
print("Mean: ", salaries.mean())
print("Median:", salaries.median())
# Output:
# Mean: 181142.857...
# Median: 55000.0
One CEO salary of 950,000 pulls the mean to 181,000 — a number that describes nobody in this dataset. The median of 55,000 is what a typical employee actually earns.
The mean lies on skewed data
Income, house prices, customer spend, website response times — these are all right-skewed. Reporting the mean for these variables without also checking the median is a common mistake. In any EDA, always compute both.
Why the Mean Still Matters¶
The mean is not wrong — it is the right tool for symmetric data. It is also the foundation of many statistical methods (regression, t-tests) and is what your ML model minimizes when you use mean squared error. Know when to trust it.
# Symmetric data — mean and median are close
exam_scores = pd.Series([68, 72, 74, 75, 76, 77, 79, 81, 83, 85])
print("Mean: ", exam_scores.mean()) # Output: 77.0
print("Median:", exam_scores.median()) # Output: 76.5
# Close to each other — either is fine here
Median — The Middle Value¶
Sort your data. The median is the value that sits at the exact center. Half the values are below it, half above.
# Odd number of values — middle element
odd_series = pd.Series([10, 20, 30, 40, 50])
print(odd_series.median()) # Output: 30.0
# Even number of values — average of two middle elements
even_series = pd.Series([10, 20, 30, 40])
print(even_series.median()) # Output: 25.0
The classic skew test
Compute mean - median. If the result is large and positive, your data is right-skewed. Large and negative means left-skewed. Near zero means roughly symmetric. This one comparison tells you which summary statistic to report.
housing_prices = pd.Series([250000, 275000, 290000, 310000, 320000, 340000, 2500000])
gap = housing_prices.mean() - housing_prices.median()
print(f"Mean: {housing_prices.mean():,.0f}") # Output: 612,142
print(f"Median: {housing_prices.median():,.0f}") # Output: 310,000
print(f"Gap: {gap:,.0f}") # Output: 302,142 — heavily right-skewed
Mode — The Most Common Value¶
The mode is the value that appears most often. It is the only measure of central tendency that works for categorical data.
customer_cities = pd.Series(["Mumbai", "Delhi", "Mumbai", "Bangalore", "Mumbai", "Delhi"])
print(customer_cities.mode())
# Output:
# 0 Mumbai
# dtype: object
For numeric data, mode finds the peak of the distribution — useful when you want to know the most popular price point or the most common transaction size.
purchase_amounts = pd.Series([99, 149, 99, 199, 99, 249, 149, 99])
print(purchase_amounts.mode())
# Output:
# 0 99
# dtype: int64
Multiple modes
series.mode() returns all modes if there are ties. Always check len(series.mode()) before assuming a single peak. A dataset with two modes (bimodal) often signals two distinct subgroups — worth investigating.
bimodal = pd.Series([1, 1, 2, 3, 4, 5, 5])
print(bimodal.mode())
# Output:
# 0 1
# 1 5
# dtype: int64
# Two modes — suggests two clusters in the data
Weighted Mean¶
The standard mean treats every value equally. The weighted mean lets you give some values more importance — used in GPA calculations, index construction, and recommendation systems.
Weighted Mean Formula
weighted_mean = sum(value × weight) / sum(weights)
# A student's grades across courses with different credit hours
grades = np.array([85, 90, 78, 92])
credits = np.array([4, 3, 4, 2])
weighted_mean = np.average(grades, weights=credits)
simple_mean = np.mean(grades)
print(f"Simple mean: {simple_mean:.2f}") # Output: 86.25
print(f"Weighted mean: {weighted_mean:.2f}") # Output: 84.69
# The 4-credit course with 78 drags the weighted mean below the simple mean
Real-World Example: Salary Data¶
This is the canonical example. Run this and internalize the numbers.
import pandas as pd
import numpy as np
np.random.seed(42)
# Simulate a small company: 20 employees + 1 founder
employee_salaries = np.random.normal(loc=60000, scale=8000, size=20)
founder_salary = np.array([2_500_000])
all_salaries = pd.Series(np.concatenate([employee_salaries, founder_salary]))
print(f"Mean salary: ${all_salaries.mean():>12,.0f}")
print(f"Median salary: ${all_salaries.median():>12,.0f}")
print(f"Mode salary: ${all_salaries.mode()[0]:>12,.0f}") # Will be approximate
print(f"Min: ${all_salaries.min():>12,.0f}")
print(f"Max: ${all_salaries.max():>12,.0f}")
# Output (approximate):
# Mean salary: $ 179,412
# Median salary: $ 59,711
# Mode salary: ~ varies
# Min: $ 45,083
# Max: $ 2,500,000
The mean of $179K describes none of the 20 employees and misrepresents the company's compensation. The median of $59K is what most people actually earn. This gap is why public salary discussions almost always use median — and why some companies prefer to report mean.
Computing All Three in a DataFrame¶
df = pd.DataFrame({
"employee_id": range(1, 8),
"department": ["Engineering", "Engineering", "Sales", "Sales", "HR", "Engineering", "Sales"],
"annual_salary": [70000, 85000, 55000, 60000, 48000, 95000, 52000]
})
# Per-column summaries
print(df["annual_salary"].mean()) # Output: 66428.57
print(df["annual_salary"].median()) # Output: 60000.0
print(df["department"].mode()[0]) # Output: Engineering
# Group-level summaries — more useful in practice
print(df.groupby("department")["annual_salary"].agg(["mean", "median"]))
# Output:
# mean median
# department
# Engineering 83333 85000
# HR 48000 48000
# Sales 55667 55000
Always group before summarizing
A company-wide average salary hides the fact that engineering earns 70% more than HR. Always break your summary statistics down by relevant groups — you will find the real story there.
Missing Values¶
Pandas skips NaN by default when computing these statistics. This is usually what you want, but be aware.
data_with_gaps = pd.Series([100, 200, np.nan, 400, 500])
print(data_with_gaps.mean()) # Output: 300.0 (ignores NaN, computes over 4 values)
print(data_with_gaps.count()) # Output: 4 (not 5)
NaN changes your denominator silently
If you have 30% missing values and compute the mean, you are computing the mean of the 70% that remain. Whether that is a valid estimate depends on WHY the values are missing. Never report a mean without checking .isnull().sum() first.
Key Takeaways
- Mean is sensitive to outliers. Use it on symmetric data.
- Median is robust. Use it when data is skewed or has outliers.
- Mode works on categories and tells you the most popular value.
- The gap between mean and median is your first diagnostic for skew.
- Weighted mean matters when values have different levels of importance.
- Always check for missing values before computing any summary statistic.
Back to Day 4 Agenda | Next: Variance and Standard Deviation