GROUP BY and HAVING¶

Raw rows tell you what happened. Aggregations tell you what it means. GROUP BY is how you turn a table of 10 million order records into a two-line summary that shows which region is growing and which is declining. Every analyst uses it constantly.

Learning Objectives¶

Use aggregate functions (COUNT, SUM, AVG, MIN, MAX) to summarise data
Understand why GROUP BY is required when you mix raw and aggregated columns
Filter before grouping with WHERE and after grouping with HAVING
Group by multiple columns to build multi-dimensional summaries
Recognise the non-aggregated column trap that produces misleading results

The Schema¶

-- orders(order_id, customer_id, product_id, amount, status, created_at)
-- customers(customer_id, name, country, email, created_at)
-- products(product_id, name, category, price)

Aggregate Functions¶

Aggregate functions collapse many rows into one number. They always ignore NULLs — except COUNT(*), which counts rows regardless.

-- Overall order metrics
SELECT
    COUNT(*)           AS total_orders,        -- counts every row
    COUNT(amount)      AS orders_with_amount,  -- skips NULLs
    SUM(amount)        AS total_revenue,
    AVG(amount)        AS average_order_value,
    MIN(amount)        AS smallest_order,
    MAX(amount)        AS largest_order
FROM orders;

Warning

COUNT(*) and COUNT(column) are not the same. COUNT(amount) skips rows where amount is NULL. If 10% of your rows have a NULL amount, your COUNT(amount) will be 10% lower than COUNT(*). Always decide which one you actually want.

GROUP BY — Aggregating by Category¶

Without GROUP BY, aggregate functions collapse the entire table into one row. With GROUP BY, they collapse per group.

-- Revenue and order count by status
SELECT
    status,
    COUNT(*)      AS order_count,
    SUM(amount)   AS total_revenue,
    AVG(amount)   AS avg_order_value
FROM orders
GROUP BY status;

Output:

status      | order_count | total_revenue | avg_order_value
------------|-------------|---------------|----------------
completed   | 4821        | 9,432,000     | 1,956.42
pending     | 1203        | 1,840,000     | 1,529.51
cancelled   | 387         | 643,000       | 1,661.24

Warning

Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. If you select a column without grouping by it, the database either throws an error (PostgreSQL, MySQL with strict mode) or picks an arbitrary value from the group (MySQL without strict mode). The arbitrary-value case is silently wrong — your numbers look plausible but are meaningless.

-- This fails in PostgreSQL (and should fail everywhere)
SELECT customer_id, name, SUM(amount)
FROM orders
GROUP BY customer_id;
-- ERROR: column "name" must appear in GROUP BY or be used in an aggregate

-- Fix: group by all non-aggregated columns
SELECT customer_id, name, SUM(amount)
FROM orders
GROUP BY customer_id, name;

Grouping by Multiple Columns¶

Add columns to GROUP BY to break down your summary further.

-- Revenue by country and status
SELECT
    c.country,
    o.status,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.country, o.status
ORDER BY c.country, o.status;

-- Monthly revenue by product category
SELECT
    DATE_TRUNC('month', o.created_at) AS month,
    p.category,
    SUM(o.amount)                      AS monthly_revenue
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY DATE_TRUNC('month', o.created_at), p.category
ORDER BY month, category;

Tip

When grouping by dates, use date truncation functions (DATE_TRUNC in PostgreSQL, DATE_FORMAT in MySQL, STRFTIME in SQLite) to group by month or year. Grouping by a raw datetime column will create one group per unique timestamp, which is almost never what you want.

HAVING — Filtering After Aggregation¶

WHERE filters rows before aggregation. HAVING filters groups after aggregation. The distinction matters because you cannot use aggregate functions in WHERE.

-- Find customers who have spent more than 5000 in total
-- (Cannot use WHERE here — SUM does not exist yet at the WHERE stage)
SELECT
    customer_id,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 5000
ORDER BY total_spent DESC;

WHERE and HAVING in the Same Query¶

Use WHERE to reduce the rows going into the aggregation, and HAVING to filter the groups after.

-- High-value completed orders: only count completed orders,
-- and only return customers who spent more than 5000 in completed orders
SELECT
    customer_id,
    COUNT(*)    AS completed_orders,
    SUM(amount) AS completed_revenue
FROM orders
WHERE status = 'completed'          -- runs first: eliminates non-completed rows
GROUP BY customer_id
HAVING SUM(amount) > 5000          -- runs after: keeps only high-value customers
ORDER BY completed_revenue DESC;

Info

The full execution order with GROUP BY is: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. HAVING is the only clause that runs after GROUP BY, which is why it is the only place where aggregate functions can be used as filters.

The WHERE vs HAVING Decision¶

Ask yourself: "Am I filtering individual rows or am I filtering summarised groups?"

What you want to filter	Use
Only look at orders from 2024	WHERE (row-level filter)
Only customers with > 10 orders	HAVING (group-level filter)
Only completed orders	WHERE (row-level filter)
Only categories with avg price > 50	HAVING (group-level filter)

-- Both together: completed orders in 2024, only high-spending customers
SELECT
    c.name,
    c.country,
    COUNT(*)    AS orders_in_2024,
    SUM(o.amount) AS revenue_in_2024
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
  AND o.created_at >= '2024-01-01'
  AND o.created_at <  '2025-01-01'
GROUP BY c.customer_id, c.name, c.country
HAVING SUM(o.amount) > 10000
ORDER BY revenue_in_2024 DESC;

Useful Patterns¶

Count Distinct Values¶

-- How many unique customers placed orders each month?
SELECT
    DATE_TRUNC('month', created_at) AS month,
    COUNT(DISTINCT customer_id)     AS unique_customers
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

Percentage of Total¶

-- What share of revenue does each status represent?
SELECT
    status,
    SUM(amount)                               AS status_revenue,
    SUM(amount) * 100.0 / SUM(SUM(amount)) OVER () AS pct_of_total
FROM orders
GROUP BY status;
-- Note: OVER() here is a window function — covered in file 06

A simpler approach without window functions:

WITH totals AS (
    SELECT SUM(amount) AS grand_total FROM orders
)
SELECT
    o.status,
    SUM(o.amount)                             AS status_revenue,
    ROUND(SUM(o.amount) * 100.0 / t.grand_total, 2) AS pct_of_total
FROM orders o
CROSS JOIN totals t
GROUP BY o.status, t.grand_total
ORDER BY status_revenue DESC;

Find Duplicate Records¶

-- Find customer_ids that appear more than once in the customers table
-- (should not happen if customer_id is a proper primary key — but check anyway)
SELECT
    email,
    COUNT(*) AS occurrences
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

Tip

This duplicate-finding pattern — GROUP BY the column(s) that should be unique, HAVING COUNT(*) > 1 — is one of the most useful SQL patterns in data quality work. Run it before every JOIN to verify key uniqueness.

Common Aggregate Functions Reference¶

Function	What it does	NULL behaviour
`COUNT(*)`	Count of all rows	Includes NULLs
`COUNT(col)`	Count of non-NULL values	Skips NULLs
`COUNT(DISTINCT col)`	Count of unique non-NULL values	Skips NULLs
`SUM(col)`	Total	Skips NULLs
`AVG(col)`	Mean	Skips NULLs (denominator is non-NULL count)
`MIN(col)`	Smallest value	Skips NULLs
`MAX(col)`	Largest value	Skips NULLs

Warning

AVG skips NULLs in both numerator and denominator. AVG(amount) on [100, 200, NULL] returns 150, not 100. If NULL should be treated as 0, use AVG(COALESCE(amount, 0)) instead.

Practice Exercises¶

Warm-up

Count the total number of orders in the table.
Find total revenue and average order value by status.
Find the number of orders per product category (join to products).

Main

Find all countries that have generated more than 50,000 in total revenue (join to customers).
Find customers who placed more than 5 orders. Show their customer_id and order count.
Find the month with the highest total revenue. Show month and revenue.
Find product categories where the average order amount is above the overall average order amount.

Stretch

Find customers who placed at least 3 completed orders in 2024. Use both WHERE and HAVING.
For each country, find the percentage of orders that were cancelled. Show country, cancelled count, total count, and percentage.
A table has a user_id and email column. Write a query to find emails that belong to more than one user_id (potential duplicates). Explain what you would do with the results.

Interview Questions¶

Q: What is the difference between WHERE and HAVING?

WHERE filters individual rows before the database groups them. HAVING filters groups after aggregation. You cannot use aggregate functions like SUM or COUNT in WHERE — those values do not exist yet at that execution stage.

Q: Why can you use a SELECT alias in ORDER BY but not in WHERE or HAVING?

Because of execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. SELECT runs before ORDER BY, so aliases are defined by the time ORDER BY executes. WHERE and HAVING run before SELECT, so the aliases do not exist yet.

Q: What does COUNT(*) vs COUNT(column) return when there are NULLs?

COUNT(*) counts every row. COUNT(column) skips rows where that column is NULL. On a table where 20% of rows have a NULL amount, COUNT(amount) will be 20% lower than COUNT(*).

Success

Key takeaway: WHERE filters rows (before grouping), HAVING filters groups (after aggregation). Every non-aggregated column in SELECT must appear in GROUP BY. AVG and SUM silently ignore NULLs.

What's Next¶

You've covered GROUP BY with COUNT, SUM, AVG, MIN, and MAX, the WHERE vs HAVING distinction, COUNT() vs COUNT(column) NULL behavior, and duplicate-finding with HAVING COUNT() > 1. Next up: 03-joins — where you'll learn to combine data from multiple tables, understand the four join types, and diagnose the row explosion problem that silently corrupts results.

Optional Deep Dive

Read "SQL for Data Analysis" by Cathy Tanimura (O'Reilly), Chapter 2 — it covers GROUP BY patterns used specifically in data analysis (cohort analysis, time-series aggregation, percentages of totals) rather than generic aggregate functions.

Previous: SELECT and WHERE | Next: JOINs