Data Science & Analytics — Beginner
Ask clear questions, write simple SQL, and get real answers from data.
SQL is a simple language for asking questions of data stored in databases. If you have ever searched, filtered, sorted, or summarized a spreadsheet, you already understand the goal. SQL helps you do those same tasks reliably—even when the data is large, split across multiple tables, or updated every day.
This beginner course is written like a short, practical book. You will start from first principles (what a table is and why it exists), then build step by step toward writing useful queries that answer real questions. You do not need any coding background. You will learn by practicing small queries, reading the results, and using basic checks so you can trust your answers.
This course is for absolute beginners who want to understand data and get answers quickly. It is designed for learners in any industry—students, office workers, analysts in training, and anyone who works with reports.
By the end, you will be able to explore a dataset, filter it to what matters, summarize it into clear numbers, and combine tables using joins. Most importantly, you will learn how to think: how to translate a business question into a query, and how to sanity-check the output so you avoid common beginner mistakes.
Chapter 1 builds your foundation: what databases and tables are, and how a query returns a result set. Chapter 2 focuses on finding what you need with filtering and sorting—skills you will use in every later chapter. Chapter 3 teaches you to make results more usable by creating calculated fields and simple categories. Chapter 4 turns detailed rows into summary answers using grouping and aggregation. Chapter 5 connects multiple tables using joins, which is where many real-world questions live. Chapter 6 gives you repeatable patterns for writing, organizing, and checking queries so your work is trustworthy and easy to reuse.
If you are ready to begin, you can Register free and start learning right away. Prefer to explore first? You can also browse all courses on Edu AI.
After completing the course, you will have a practical skill you can use immediately: you will be able to ask clear questions, write straightforward SQL, and produce results you can explain. This is the starting point for reporting, analytics, dashboards, and more advanced data work—built on a strong, beginner-friendly foundation.
Data Analytics Instructor (SQL and Reporting)
Sofia Chen teaches SQL and practical analytics for people starting from zero. She has helped teams and learners turn messy spreadsheets into clear answers using simple, repeatable query habits. Her focus is confidence: understanding what you’re asking, writing it in SQL, and checking results for accuracy.
SQL is a tool for turning real-world questions into precise data questions, then getting reliable answers quickly. In this chapter you’ll build the mental model that makes SQL feel predictable: you’ll learn how data is stored in tables, how to read rows and columns, and how a query produces a “result set” you can inspect without changing the underlying data.
We’ll keep the workflow practical. You’ll start with a real-world question (Milestone 1), map it to the tables and fields that could answer it (Milestone 2), run your first SELECT query (Milestone 3), and practice exploring safely—reading only, not modifying (Milestone 4). Along the way you’ll build a small glossary of SQL words you’ll see constantly (Milestone 5). By the end, you should be able to look at a dataset, pull out a slice of it, and notice early warning signs like duplicates and missing values—before you trust any summary numbers.
To make examples concrete, imagine an online store database with tables like customers, orders, and order_items. You don’t need to memorize schema details; the important skill is how to reason from a question to the data you need, then ask for it using SQL.
SELECT) and small samples (LIMIT) while you learn.The rest of the chapter is organized into six sections that build this foundation step by step.
Practice note for Milestone 1: Turn a real-world question into a data question: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Understand tables, rows, columns, and IDs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Run your first SELECT and read the results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Explore a dataset safely without changing anything: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Build a simple glossary of common SQL words: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Turn a real-world question into a data question: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Understand tables, rows, columns, and IDs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Run your first SELECT and read the results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
SQL (Structured Query Language) is the standard language for working with data stored in relational databases. Most beginners learn it as “the language to get data out,” and that’s a good starting point. A SQL query describes the result you want—rows and columns—rather than the exact step-by-step algorithm to compute it. The database engine figures out how to execute your request efficiently.
SQL is not a spreadsheet replacement, even though the output often looks like a grid. In a spreadsheet, you typically open a file and manipulate cells directly; in a database, the data lives in managed tables, often shared by many systems at once. SQL is also not a data visualization tool and not a programming language like Python—though many analytics workflows combine SQL for extraction and Python for modeling.
Engineering judgment matters early: use SQL for what it’s best at—filtering, joining, aggregating, and summarizing large datasets close to where they are stored. Avoid copying millions of rows into a spreadsheet “just to look” if the database can answer your question with a 10-line query. Also, treat SQL results as evidence: if you don’t understand how the rows were selected, you don’t yet understand the answer.
ORDER BY, row order is not guaranteed.AVG) without checking for missing or outlier values.This course starts with read-only querying: SELECT, WHERE, ORDER BY, LIMIT, and core summary functions (COUNT, SUM, AVG, MIN, MAX). Those are the building blocks you’ll use daily.
Milestone 1 is learning to translate a real-world question into a data question. Consider: “Which products are driving the most revenue this month?” In a spreadsheet mindset, you might hunt for a tab that “looks right.” In a database mindset, you ask: Which table records product sales? Which table stores dates? Which field represents revenue (unit price, quantity, discounts)? The database forces you to be explicit—and that’s a strength.
A database is an organized collection of tables. Each table is designed to store one type of thing consistently: customers, orders, products, support tickets. This structure scales: you can store millions of orders and still query them quickly, with many people and applications reading and writing at the same time.
Tables matter because they separate concerns and reduce duplication. Instead of repeating a customer’s address on every order row, an orders table can store a customer_id that links to a single row in customers. That design improves accuracy (one address to update) and enables richer questions (“revenue by customer region”) without messy copy/paste work.
When you explore data, adopt a “small, safe, repeatable” workflow (Milestone 4). Start by selecting a tiny sample, confirm the columns mean what you think they mean, and only then widen your query. SQL encourages this discipline because it is easy to rerun queries, adjust filters, and keep a clear record of what you asked.
orders while also joining order_items and accidentally duplicating totals.As your questions get more specific, the table structure becomes your map. The next step is understanding what rows and columns represent—and why IDs exist.
Milestone 2 is building a clear mental model of tables, rows, columns, and IDs. A table is like a carefully defined grid. A row is one record—one customer, one order, one product. A column is one attribute—like email, order_date, or price. Good tables keep each column consistent: one data type, one meaning.
Keys are the glue that connects tables. A primary key uniquely identifies a row in its own table, often named id (for example, customers.id). A foreign key is a column that points to a row in another table (for example, orders.customer_id refers to customers.id). You don’t need to memorize terminology to start; you just need to recognize that these ID columns create reliable links.
Why IDs instead of names? Because names change and are not guaranteed unique. Two customers might share “Alex Kim,” but they should not share the same customer_id. IDs support stable relationships and prevent ambiguity when you join tables later.
order_id is unique in orders but repeats in order_items because multiple items belong to one order.COUNT(*) vs COUNT(DISTINCT id) (you’ll use this often).Once you can point to the right table and the right columns, you’re ready for the core idea of SQL: a query that returns a result set.
A query is a request for data that returns a result set: a temporary table produced by the database engine. This is Milestone 3 in spirit—your first real interaction with a database. The important mindset shift is that your query does not “open the table”; it asks for a view of it. If you only use SELECT, you are reading data, not changing it.
Most beginner queries follow a predictable pattern:
SELECT: which columns do you want to see?FROM: which table are they in?WHERE: which rows qualify? (numbers, text, dates)ORDER BY: in what order should results be returned?LIMIT: how many rows do you want to inspect?As your questions evolve, you’ll summarize data using aggregates: COUNT (how many rows), SUM (total), AVG (average), MIN/MAX (range). When you add GROUP BY, the database produces one row per group (such as one row per product or per day). And when you need to filter groups after aggregation, you use HAVING (for example, “only products with at least 50 orders”).
Engineering judgment shows up in choosing the right level of detail. If your question is “How many orders were placed yesterday?” you want an aggregate result (a single number). If your question is “Which orders were placed yesterday?” you want detailed rows. Beginners often fetch detailed rows and then manually count—SQL can usually answer the question more directly and more reliably.
In the next section you’ll run a first SELECT and learn when SELECT * helps—and when it gets in your way.
This milestone is about momentum: write a query, run it, and learn to read the output. A classic first step is selecting all columns from a table:
SELECT *
FROM customers
LIMIT 10;
SELECT * means “give me every column.” With LIMIT, it’s a safe way to preview what’s inside a table: column names, example values, and the overall shape of the data. This is part of Milestone 4—exploring without changing anything. Add ORDER BY to make your sampling intentional, such as viewing the most recent records:
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 20;
However, SELECT * is rarely a good default for analysis. It can return more data than you need, slow down queries on wide tables, and make it harder to spot what matters. Prefer selecting specific columns once you know them:
SELECT id, email, created_at
FROM customers
ORDER BY created_at DESC
LIMIT 20;
Filtering with WHERE is where SQL starts answering real questions. You can filter numbers (total > 100), text (status = 'shipped'), and dates (order_date >= '2026-01-01'). Be careful with text: SQL usually requires single quotes for string literals, and spelling/case must match the stored values (depending on the database).
= with a column that has time components (timestamps). Filtering “yesterday” often needs a date range, not equality.Finally, remember the difference between WHERE and HAVING: WHERE filters individual rows before grouping; HAVING filters grouped results after aggregation. You’ll practice this heavily later, but keeping the distinction in mind prevents a lot of confusion.
Running a query is only half the skill. The other half is reading the result set critically. Beginners often assume the database output is “clean” because it came from a system. In reality, most datasets contain duplicates, missing values, and edge cases. Catching these early is what makes your later summaries trustworthy.
Start with duplicates. If you expect one row per customer, scan for repeated emails or names in your sample. More systematically, compare COUNT(*) to COUNT(DISTINCT ...) when you begin aggregating. If COUNT(DISTINCT customer_id) is much smaller than COUNT(*) in a table you thought was one-row-per-customer, your assumption about granularity is wrong—or the data needs cleaning.
Next, look for missing values. In SQL, missing is usually represented as NULL, which behaves differently than empty strings or zero. For example, NULL is not equal to anything (even another NULL), and many aggregates ignore NULL values. That can be helpful or misleading depending on your question. If you compute AVG(price) and many prices are NULL, the average is over fewer rows than you might assume.
Use sorting to find suspicious values: ORDER BY total DESC to spot extreme totals, and ORDER BY created_at ASC to find unexpectedly old records. Combine this with LIMIT to inspect the “top” and “bottom” of key columns. This simple habit prevents common analytics failures like reporting a “record revenue day” caused by a duplicated order import.
Milestone 5 is building your glossary as you go. At minimum, you should be comfortable with these words in context: table, row, column, primary key, foreign key, query, result set, filter, aggregate, group. Keep a running note with definitions in your own words; that personal glossary becomes a quick reference when you hit more complex queries.
Practical outcome: by the end of this chapter, you can safely preview tables, select relevant columns, filter by numbers/text/dates, sort and limit to find recent and extreme records, and approach outputs with healthy skepticism. That combination—asking good questions and verifying results—is what makes SQL valuable in real analytics work.
1. Which workflow best matches the chapter’s approach to answering a real-world question with SQL?
2. In the chapter’s mental model, what does a SELECT query produce?
3. What is the main reason the chapter recommends exploring with small samples (e.g., using LIMIT) while learning?
4. Before trusting any summary numbers, what early warning signs does the chapter suggest checking for?
5. In an online store database (customers, orders, order_items), what is the key skill the chapter says matters more than memorizing schema details?
In Chapter 1 you learned how to SELECT columns and get a feel for what’s inside a table. That’s useful, but real work starts when you can quickly narrow a big table down to the rows that matter. This chapter is about building that “search muscle” in SQL: filtering with WHERE, combining conditions, using pattern and list matching for fast lookups, and then sorting and limiting results to produce short, actionable lists.
Think like an analyst or engineer: you rarely want “all rows.” You want “orders from last week,” “customers in a certain region,” “products with low inventory,” or “tickets still open.” Filtering is also a safety habit—reducing result sets prevents you from scanning thousands of irrelevant rows and helps you validate your understanding of the data step by step.
We’ll work in a workflow that scales: start with a broad query to verify columns and units, add one filter at a time, and only then add sorting and limiting. Along the way you’ll learn why some filters behave unexpectedly (especially around NULL), and how to avoid the most common beginner mistakes that lead to empty results or “why is this row missing?” surprises.
Practice note for Milestone 1: Filter rows with WHERE confidently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Use multiple conditions with AND/OR: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Match patterns and lists for quick searching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Sort and limit to get actionable lists: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Avoid common beginner filter mistakes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Filter rows with WHERE confidently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Use multiple conditions with AND/OR: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Match patterns and lists for quick searching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Sort and limit to get actionable lists: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Avoid common beginner filter mistakes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The WHERE clause is your row filter. It answers: “Which rows should be included in the result?” A good mental model is a checklist applied to every row. If the condition evaluates to true, the row is kept; otherwise it’s dropped. The basic shape looks like this:
SELECT column1, column2
FROM table_name
WHERE condition;
For example, if you have an orders table, you might start with:
SELECT order_id, customer_id, status, order_date
FROM orders;
Then filter to shipped orders:
SELECT order_id, customer_id, status, order_date
FROM orders
WHERE status = 'shipped';
This “add one filter, rerun, confirm” loop is a professional habit. It helps you catch issues early (wrong column, unexpected status values, different spelling, trailing spaces) before your query gets complex. It also helps you build confidence that the filter does what you think it does.
WHERE to reduce noise and validate assumptions quickly.NULL values.Milestone 1 is simply this: become comfortable writing and reading WHERE conditions until filtering feels like “searching the database,” not “writing code.”
Most filters use comparison operators. For numbers, this is straightforward: =, >, <, >=, <=, and <> (not equal). Example: find products with low inventory:
SELECT product_id, name, quantity_on_hand
FROM products
WHERE quantity_on_hand <= 10;
Text comparisons also use these operators, but you must use quotes around string literals. Example:
SELECT customer_id, email
FROM customers
WHERE country = 'Canada';
Two practical cautions matter here. First, string comparisons can be case-sensitive depending on the database and collation settings. If you see missed matches, check whether the data uses 'canada' vs 'Canada', or consider normalizing case (for example, comparing LOWER(country) to 'canada') if your SQL dialect supports it.
Second, date filtering requires that the column truly be a date/time type (not a text column that “looks like” a date). If order_date is a date, you can do:
SELECT order_id, order_date
FROM orders
WHERE order_date >= '2026-01-01';
Engineering judgement: prefer inclusive/exclusive boundaries that avoid ambiguity. For timestamps, a common pattern is “greater than or equal to the start, and less than the next day” to avoid time-of-day surprises.
Real questions usually require multiple conditions. AND means “must satisfy both,” and OR means “either is acceptable.” This is Milestone 2: combining conditions without getting tricked by precedence.
Example: find high-value orders that are still open:
SELECT order_id, total_amount, status
FROM orders
WHERE total_amount >= 500
AND status = 'open';
Now suppose you want orders that are open or pending, but only for a specific country. Without parentheses, you can accidentally include too much. Compare:
-- Risky if you forget precedence
WHERE status = 'open' OR status = 'pending' AND country = 'US'
Most SQL engines evaluate AND before OR, so this becomes: status = 'open' OR (status = 'pending' AND country = 'US'). If you intended “(open or pending) and US,” you must write it explicitly:
WHERE (status = 'open' OR status = 'pending')
AND country = 'US';
A practical workflow is to translate your filter into plain language, then add parentheses that match the language. If you can read it aloud and it still matches your intent, you’re much less likely to ship a wrong report.
AND narrows results; OR expands results.AND and OR.WHERE clause to see which condition removes rows.Milestone 3 is learning a few high-leverage operators that cover many everyday “search” tasks. IN is a clean way to match a list of values without writing repetitive OR conditions:
SELECT order_id, status
FROM orders
WHERE status IN ('open', 'pending', 'backorder');
BETWEEN is convenient for ranges and is inclusive on both ends. It’s often used for numbers and dates:
SELECT order_id, total_amount
FROM orders
WHERE total_amount BETWEEN 100 AND 200;
For dates, remember “inclusive” can matter. If you store timestamps, BETWEEN with a date literal may exclude late-day times depending on how your database casts values. Many teams prefer explicit boundaries (start inclusive, end exclusive) for time ranges.
LIKE handles simple pattern matching, which is great for “starts with,” “contains,” or “ends with” searches on text. Use % as a wildcard for any sequence of characters, and _ for a single character:
SELECT customer_id, email
FROM customers
WHERE email LIKE '%@edu.ai';
Practical caution: leading wildcards (like '%text') can be slow on large tables because they often prevent efficient index use. Use them when you need them, but prefer prefix searches (like 'text%') when possible, and consider whether a dedicated search field or full-text search is appropriate in production systems.
Filtering finds the right rows; sorting and limiting turns them into decisions. This is Milestone 4: produce short lists like “top 10 customers,” “most recent tickets,” or “lowest inventory items.” Use ORDER BY to sort results, and LIMIT (or your database’s equivalent) to cap how many rows you return.
Example: most recent orders first:
SELECT order_id, order_date, total_amount
FROM orders
WHERE status = 'shipped'
ORDER BY order_date DESC
LIMIT 20;
DESC means descending (largest/newest first); ASC means ascending (smallest/oldest first). You can sort by multiple columns to make ordering stable and readable:
SELECT customer_id, last_name, first_name
FROM customers
ORDER BY last_name ASC, first_name ASC;
Engineering judgement: always ask “what does ‘top’ mean?” Top revenue is not the same as most orders, and “most recent” should be anchored on the right timestamp column (created vs updated vs closed). Also, if you use LIMIT without a deterministic ORDER BY, the database may return any arbitrary subset—fine for quick exploration, risky for reporting.
ORDER BY + LIMIT for actionable shortlists.Milestone 5 is avoiding the filter mistakes that come from misunderstanding NULL. In SQL, NULL does not mean zero, empty string, or false. It means “unknown” or “missing.” That single idea changes how comparisons work, because SQL uses three-valued logic: true, false, and unknown.
If a column value is NULL, comparisons like = and <> do not evaluate to true; they evaluate to unknown. For example, this will not return rows where shipped_date is missing:
SELECT order_id, shipped_date
FROM orders
WHERE shipped_date = NULL; -- wrong
The correct way is IS NULL (and IS NOT NULL):
SELECT order_id, shipped_date
FROM orders
WHERE shipped_date IS NULL;
This matters in combined conditions too. A filter like WHERE shipped_date >= '2026-01-01' automatically excludes NULL shipped dates (because unknown isn’t true). That might be correct—or it might silently drop “not shipped yet” orders you actually needed. When results look suspiciously small, check whether the filtered column contains NULL values.
= NULL; use IS NULL.NULL values often disappear from comparisons and can shrink your result set.WHERE some_column IS NULL to learn how missingness appears in your table.Once you’re fluent with WHERE, logical combinations, common search operators, and careful handling of NULL, you can move from “I can view data” to “I can consistently find the exact slice of data that answers the question.” That’s the foundation for grouping and summarizing in the next chapter stages.
1. Which workflow best matches the chapter’s recommended approach for building a reliable filtered query?
2. Why does the chapter describe filtering as a “safety habit”?
3. You need rows that match multiple requirements (e.g., customers in a region AND with low inventory items). Which concept from the chapter applies?
4. When you want quick searching such as “names that start with a prefix” or “values in a known set,” what feature does the chapter highlight?
5. What is the purpose of adding ORDER BY and LIMIT after filtering, according to the chapter?
When you first run a SELECT query, the result can look “technically correct” but still be hard to use. Columns may have cryptic names, values might include extra spaces, and key metrics (like revenue per order) aren’t stored anywhere—you need to calculate them. This chapter is about turning raw database fields into a clean, readable, shareable result table.
You’ll practice a workflow that many analysts use daily: start by selecting the fields you need, rename them so humans can read them, build calculated fields, and then make your calculations and formatting safer. You’ll also learn beginner-friendly tools for cleaning text and working with dates, and you’ll use CASE to create categories and flags that make results easier to filter and group later.
To keep examples consistent, imagine a table named orders with columns like order_id, customer_name, order_date, status, quantity, unit_price, discount, and shipping_cost. The specific functions vary a little by database (PostgreSQL, MySQL, SQL Server, etc.), but the habits and reasoning transfer well.
By the end of the chapter, you should be able to craft a result table you could confidently paste into an email or dashboard without apology.
Practice note for Milestone 1: Create readable columns with aliases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Build calculated fields for simple metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Work with text and dates at a beginner level: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Use CASE to label and bucket values: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Produce a clean, shareable result table: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Create readable columns with aliases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Build calculated fields for simple metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Work with text and dates at a beginner level: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Use CASE to label and bucket values: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Milestone 1 is simple but high leverage: rename columns so the output reads like a report. Databases often use short, technical names (for example, cust_nm or order_dt) to satisfy storage or legacy constraints. Your job in analysis is different: make meaning obvious to the next person who reads the results (including future you).
Use AS to create aliases. Aliases don’t change the database—only the result set. This is important engineering judgment: you can make a query readable without requesting schema changes, and you can standardize naming across multiple queries.
Example:
SELECT
order_id AS "Order ID",
customer_name AS "Customer",
order_date AS "Order Date",
status AS "Order Status"
FROM orders;
Two practical notes. First, quoting rules differ by database: double quotes, backticks, or square brackets may be required for names with spaces. If you’re not sure, choose simple aliases like order_date → order_date (no change) or order_date → order_date_clean and avoid spaces. Second, be consistent: if one query uses Customer and another uses Client for the same field, your team will waste time reconciling.
Common mistakes include aliasing two columns to the same name (confusing in exports), using overly short aliases (defeats the purpose), and forgetting that aliases typically can’t be referenced in the same SELECT list (you may need to repeat the expression or use a subquery).
Milestone 2 is building calculated fields: metrics that are implied by the data but not stored directly. SQL supports basic arithmetic: +, -, *, and /. The trick is to write calculations that match real-world definitions and to label them clearly.
A classic example is line revenue (quantity times unit price) and a net total after discount and shipping:
SELECT
order_id AS "Order ID",
quantity,
unit_price,
quantity * unit_price AS gross_revenue,
(quantity * unit_price) - discount + shipping_cost AS net_revenue
FROM orders;
Percentages are just ratios multiplied by 100, but you must decide what the denominator should be. For example, discount rate is often discount / gross_revenue. If you want a percent value:
SELECT
order_id,
discount,
quantity * unit_price AS gross_revenue,
(discount / (quantity * unit_price)) * 100 AS discount_pct
FROM orders;
Engineering judgment matters here: confirm whether the business defines “discount” as a dollar amount or already as a fraction. Also consider numeric types. In some databases, integer division truncates decimals (e.g., 1/2 becomes 0). If you see a lot of 0% and 100% outputs, you may be accidentally dividing integers. A common fix is to cast one side to a decimal type (syntax varies) or multiply first using a decimal constant like 100.0 instead of 100.
Finally, think about rounding for readability. Many SQL engines have ROUND(value, 2). Use it for presentation, but keep full precision when you need accurate downstream aggregation.
Real data is messy. Two issues cause most “my query broke” moments: divide-by-zero and missing values (NULLs). Milestone 3 is learning to make calculations resilient so the query runs and the output is interpretable.
First, divide-by-zero. If gross revenue can be zero, this breaks:
discount / gross_revenue
A common defensive pattern is NULLIF, which turns a zero into NULL so the division returns NULL instead of error:
SELECT
order_id,
discount,
quantity * unit_price AS gross_revenue,
discount / NULLIF(quantity * unit_price, 0) AS discount_rate
FROM orders;
Second, NULL behavior. In SQL, NULL means “unknown,” and most arithmetic with NULL results in NULL. If shipping_cost is NULL, then gross - discount + shipping_cost becomes NULL even if the rest is valid. If your business rule is “missing shipping means zero,” use COALESCE(shipping_cost, 0) (or similar) to replace NULL with a default.
SELECT
order_id,
(quantity * unit_price)
- COALESCE(discount, 0)
+ COALESCE(shipping_cost, 0) AS net_revenue
FROM orders;
Use defaults carefully. Replacing NULL with 0 is only correct when “not provided” truly means “none.” If NULL means “not yet recorded,” filling with 0 can hide data quality problems. A practical approach is to compute both: a clean metric for reporting and a flag that shows missing components, so you can monitor data health.
Milestone 4 begins with lightweight cleaning that prevents subtle filtering and grouping errors. Text fields often include leading/trailing spaces, inconsistent capitalization, or multiple fields that need to be combined for display.
TRIM removes extra spaces around a value. This matters because 'Acme' and 'Acme ' look the same but won’t match in an equality filter and will form separate groups in GROUP BY. Example:
SELECT
TRIM(customer_name) AS customer_clean,
status
FROM orders;
UPPER and LOWER standardize casing. This is useful before comparisons (again, database behavior varies: some collations are case-sensitive). Example:
SELECT
order_id,
UPPER(TRIM(status)) AS status_norm
FROM orders;
Concatenation combines text. Syntax varies: many systems use ||, others use CONCAT(a, b) or +. The goal is the same: create a presentation-friendly label such as “Order 12345 - SHIPPED.”
SELECT
order_id,
CONCAT('Order ', order_id, ' - ', UPPER(TRIM(status))) AS order_label
FROM orders;
Common mistakes: cleaning in some places but not others (leading to mismatched joins or groupings), and forgetting that NULL can “poison” concatenation in some databases (you may need COALESCE to protect it). In practice, build a consistent “normalized” version of key text fields in your query when you plan to filter, group, or share the results.
Dates are where beginner queries often become unreliable. The good news is you can do a lot with a few patterns: use “today,” filter by ranges, and extract parts like month or day for reporting.
Most databases provide a “current date” function such as CURRENT_DATE (or GETDATE()/CURRENT_TIMESTAMP). If you want recent orders, compare against a range. The most portable habit is to filter using a half-open interval: include the start, exclude the end. This avoids time-of-day edge cases when your column is a timestamp.
SELECT
order_id,
order_date
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 day'
AND order_date < CURRENT_DATE + INTERVAL '1 day';
(Exact interval syntax varies; the concept does not.)
To summarize by month, extract a month component or (better) truncate to the month. Different engines use EXTRACT(MONTH FROM order_date), MONTH(order_date), or DATE_TRUNC('month', order_date). The practical outcome: you can group reliably without string hacks.
SELECT
EXTRACT(YEAR FROM order_date) AS order_year,
EXTRACT(MONTH FROM order_date) AS order_month,
COUNT(*) AS orders
FROM orders
GROUP BY 1, 2
ORDER BY 1, 2;
Common mistakes include comparing dates as strings (can sort incorrectly depending on format), mixing timestamps and dates without realizing it, and using “between” with end dates that accidentally exclude late-day records. When in doubt, inspect a few raw values and confirm the data type your database reports for the column.
Milestone 5 is producing a clean, shareable result table, and CASE is one of the best tools for that. It lets you label records, bucket numeric values, and create boolean-style flags—all inside the query—so the output is immediately usable for filtering and grouping.
A simple label example: normalize multiple statuses into a smaller set:
SELECT
order_id,
status,
CASE
WHEN UPPER(TRIM(status)) IN ('SHIPPED', 'DELIVERED') THEN 'Fulfilled'
WHEN UPPER(TRIM(status)) IN ('CANCELLED', 'REFUNDED') THEN 'Not fulfilled'
ELSE 'Open'
END AS fulfillment_group
FROM orders;
Bucketing numeric values is equally common. Suppose you want order size bands based on net revenue:
SELECT
order_id,
(quantity * unit_price) - COALESCE(discount,0) + COALESCE(shipping_cost,0) AS net_revenue,
CASE
WHEN (quantity * unit_price) <= 0 THEN 'Check'
WHEN (quantity * unit_price) - COALESCE(discount,0) + COALESCE(shipping_cost,0) < 50 THEN 'Small'
WHEN (quantity * unit_price) - COALESCE(discount,0) + COALESCE(shipping_cost,0) < 200 THEN 'Medium'
ELSE 'Large'
END AS order_size
FROM orders;
Two judgment calls matter. First, ordering: CASE stops at the first matching WHEN, so put the most specific rules first (and include a thoughtful ELSE). Second, “clean table” mindset: give the derived column a clear name, keep categories mutually exclusive, and include a “Check/Unknown” bucket so data issues don’t silently disappear.
Once you have aliases, safe calculations, cleaned text, date logic, and CASE categories, you’re no longer just viewing data—you’re shaping it into an analysis-ready table that others can trust.
1. A SELECT query returns correct results, but the column names are cryptic. What is the best Chapter 3 technique to make the output more readable without changing the underlying data?
2. Revenue per order isn’t stored in the orders table. According to the chapter’s workflow, what should you do to include it in your results?
3. Which situation best matches a common pitfall the chapter warns about when doing calculations?
4. You want to create a new column that labels orders as categories (for example, based on status or price range) to make filtering and grouping easier later. Which SQL tool from this chapter fits that goal?
5. What best describes the chapter’s end goal for your query output?
So far you’ve been asking questions by looking at individual rows: “Which orders happened last week?” or “Which customers are in Canada?” In real analysis, you often need the next level: an answer that is already summarized. Managers rarely want to scan 50,000 rows; they want a total, an average, a breakdown by category, or a compact KPI table they can compare week to week.
This chapter is about moving from rows to answers. You’ll learn the core aggregate functions (COUNT, SUM, AVG, MIN, MAX), then how to compare categories with GROUP BY. You’ll also learn two common sources of wrong results—counting duplicates and filtering at the wrong time—and how to fix them with DISTINCT and HAVING. Finally, you’ll build a simple KPI-style summary and practice the habit that separates careful analysts from reckless ones: quick sanity checks and reconciling totals.
Throughout the chapter, keep one workflow in mind: (1) define the population (often with WHERE), (2) choose your metrics (aggregates), (3) decide whether you need category comparisons (GROUP BY), (4) filter groups (HAVING), and (5) validate the answer with a cross-check. That workflow maps to the milestones in this chapter: summarize a table, group to compare, filter groups correctly, create a KPI summary, and validate with fast checks.
In the sections below, examples use a typical dataset like orders (order_id, customer_id, order_date, status, total_amount) and order_items (order_id, product_id, quantity, unit_price). Your table names may differ; the patterns do not.
Practice note for Milestone 1: Summarize a table with basic aggregates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Group results to compare categories: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Filter groups correctly with HAVING: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Create a simple KPI-style summary table: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Validate totals with quick cross-checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Summarize a table with basic aggregates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Group results to compare categories: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Filter groups correctly with HAVING: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Aggregate functions compress many rows into a smaller number of results—often a single row. They are your fastest way to answer “How many?”, “How much?”, and “What’s typical?” The most common aggregates are:
COUNT(*): number of rows in the result set.COUNT(column): number of non-NULL values in that column.SUM(column): total of a numeric column (ignores NULL).AVG(column): average of a numeric column (ignores NULL).MIN(column)/MAX(column): smallest/largest value (works for numbers, dates, and text by sort order).A basic “summarize the table” milestone is to produce a one-row profile of a dataset. For example, to summarize paid orders in 2025:
SELECT
COUNT(*) AS order_count,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date
FROM orders
WHERE status = 'PAID'
AND order_date >= '2025-01-01'
AND order_date < '2026-01-01';
Common mistakes here are subtle but expensive. First, confusing COUNT(*) with COUNT(column): if total_amount is sometimes NULL, COUNT(total_amount) will undercount orders. Second, averaging the wrong thing: AVG(total_amount) is an average per order, but AVG(unit_price) in an item table is an average per item line, not per order. Third, if you join tables before aggregating, you can accidentally multiply rows and inflate COUNT and SUM. When in doubt, aggregate at the correct “grain” (per order, per customer, per day) and only then join summaries together.
Practical tip: when you run an aggregate query, always read the aliases like a sentence. “order_count, revenue, avg_order_value” is easy to interpret and much harder to misreport than raw function outputs.
GROUP BY changes the shape of your result: instead of one answer for the whole table, you get one answer per category (or per combination of categories). This is the core skill behind “compare categories,” the second milestone of the chapter.
Suppose you want revenue and order counts by status:
SELECT
status,
COUNT(*) AS orders,
SUM(total_amount) AS revenue
FROM orders
GROUP BY status;
Think of GROUP BY as: “make piles of rows with the same status, then compute aggregates inside each pile.” Everything in your SELECT list must be either (a) grouped columns (like status) or (b) aggregates (like SUM(total_amount)). Beginners often try to include a non-grouped column such as order_date; SQL rejects that because there isn’t a single date per status.
You can group by multiple columns to create a two-dimensional summary. For example, orders per month and status:
SELECT
DATE_TRUNC('month', order_date) AS order_month,
status,
COUNT(*) AS orders,
SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= '2025-01-01'
AND order_date < '2026-01-01'
GROUP BY DATE_TRUNC('month', order_date), status
ORDER BY order_month, status;
Engineering judgement: group only on fields you can explain. Grouping by a highly unique field (like order_id) produces nearly as many groups as rows, which defeats the purpose of summarizing. A good group field is one that creates a manageable number of categories: status, country, product category, month, sales rep.
Also watch for “group explosion” when grouping by two or three fields at once. If each field has many values, their combinations can create thousands of groups, which is slower and harder to interpret. Start simple (one grouping column), confirm the result, then add dimensions if they support a clear question.
Many real datasets contain repeated identifiers. If you count rows when you meant to count unique entities, your results can be wildly wrong. DISTINCT is how you tell SQL: “treat duplicates as one.” This is especially important after joins and in detail tables like order_items.
Example: “How many customers placed at least one paid order?” If you do COUNT(customer_id) on orders, you will count repeat customers multiple times. The correct pattern is COUNT(DISTINCT customer_id):
SELECT
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE status = 'PAID';
Now consider a join scenario. Suppose you join orders to order_items to analyze products. Each order may have multiple items, so the join multiplies rows. If you then run COUNT(*), you are counting item lines, not orders. If your question is “How many orders contained at least one item?”, use COUNT(DISTINCT orders.order_id) after the join:
SELECT
COUNT(DISTINCT o.order_id) AS orders_with_items
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'PAID';
Common mistake: using SELECT DISTINCT ... as a “fix” for messy joins without understanding the grain. Sometimes DISTINCT hides a join bug rather than solving it. Use it deliberately: when the question is inherently “unique customers,” “unique orders,” or “unique products,” COUNT(DISTINCT ...) is appropriate. When the question is “how many line items,” you should not use DISTINCT.
Practical outcome: once you’re comfortable with DISTINCT, you can build trustworthy KPIs like “active customers” and “orders per customer” without overcounting.
WHERE filters rows before aggregation. HAVING filters groups after aggregation. This distinction is not academic—it determines whether your summary is correct. The third milestone in this chapter is learning to filter groups correctly with HAVING.
Example: “Which customers have spent more than $1,000 in paid orders?” You must first group by customer, compute spend, then filter on that computed total. That means HAVING:
SELECT
customer_id,
SUM(total_amount) AS total_spend
FROM orders
WHERE status = 'PAID'
GROUP BY customer_id
HAVING SUM(total_amount) > 1000
ORDER BY total_spend DESC;
If you try to put SUM(total_amount) > 1000 in WHERE, SQL will reject it because SUM doesn’t exist until after grouping. More importantly, even if your database allowed some workaround, you would be mixing up “row-level conditions” with “group-level conditions.”
Use this mental model:
One common mistake is applying a row filter that changes the meaning of the group. For instance, if you filter out low-value orders in WHERE, then compute “customer total spend,” you are no longer measuring total spend—you are measuring spend on high-value orders only. That may be correct, but it must be intentional and clearly named (e.g., high_value_spend).
Practical outcome: HAVING lets you produce lists like “top customers,” “products with low sales,” or “months with unusually high refund rates” in a single query.
Once you can group, the next practical step is to rank: “top categories,” “bottom regions,” “most common statuses.” This is where ORDER BY and LIMIT turn a summary into an actionable list. The key is to sort by the aggregate, not by the category label.
Example: top 10 product categories by revenue (assuming you have a products table with category):
SELECT
p.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN products p ON p.product_id = oi.product_id
GROUP BY p.category
ORDER BY revenue DESC
LIMIT 10;
Two judgement calls matter here. First, decide which metric defines “top”: revenue, units, margin, number of orders, or number of unique customers. Different metrics tell different stories. For example, a category might be “top by units” but not “top by revenue” if it contains low-priced items.
Second, be consistent about filtering. If “top categories” should reflect only completed business, filter status in WHERE via the orders table (and join carefully). A common mistake is to compute revenue from items but forget to restrict to paid orders, accidentally counting canceled transactions. Another mistake is ordering by SUM(...) without an alias and then changing the expression later; using an alias like revenue makes maintenance safer.
Milestone 4—building a KPI-style summary—often uses this same pattern, but produces a compact table rather than a ranked list. For instance, you might group by month to produce monthly revenue, orders, and unique customers, then order by month to form a time series you can chart.
Aggregations feel authoritative because they return clean numbers—but they can be clean and wrong. The fifth milestone is adopting quick cross-checks so you can trust your results before sharing them. This is a professional habit: validate totals, reconcile between levels of detail, and spot-check a few groups.
A simple reconciliation is: “Do grouped totals add up to the overall total?” Compute the grand total once, then compute a grouped breakdown and sum it. If they differ, you likely filtered differently, joined incorrectly, or changed the grain.
Grand total revenue:
SELECT SUM(total_amount) AS revenue
FROM orders
WHERE status = 'PAID';
Breakdown by status (should match if the WHERE is the same and status partitions the data):
SELECT status, SUM(total_amount) AS revenue
FROM orders
WHERE status = 'PAID'
GROUP BY status;
In this example, it will not “match” across statuses because you filtered to only PAID; you’ll only see one status. That’s the point: your sanity check should reflect the logic you intended. If you want all statuses, remove the status filter and confirm the sum across all groups equals the ungrouped sum.
Spot-checking is equally effective. If a category shows unusually high revenue, pull a few underlying rows to see whether it’s real or a join multiplication issue. Another reliable check is comparing COUNT(*) to COUNT(DISTINCT order_id) after joins; if the gap is large, you are likely counting detail rows rather than entities.
>= start and < next day/month) to avoid missing end-of-day data.total_amount) or must be computed from items.The practical outcome is confidence. When you can explain how your numbers were computed—and show that grouped totals reconcile—you can move fast without being careless, which is exactly what SQL is for.
1. Which workflow best matches the chapter’s recommended steps for producing a correct summary answer?
2. A manager wants a single number: the total revenue from all orders. Which approach best fits the chapter’s idea of a “one-row answer”?
3. You want to compare average order value across order status (e.g., shipped, cancelled). What is the key SQL feature to create a “one-row-per-category answer”?
4. You need to keep only categories whose totals meet a threshold (e.g., statuses with COUNT(*) >= 100). Which clause is designed to filter at the group level?
5. Your count of customers seems too high because a customer appears multiple times in the data. Which technique from the chapter is meant to prevent overcounting duplicates?
So far, you have been able to answer many questions using a single table: filter rows, sort, aggregate, and group. Real databases, however, rarely store everything you need in one place. Instead, data is spread across multiple tables on purpose. This chapter teaches you how to “connect the dots” with joins—combining related rows from different tables into one result set you can analyze.
Joining tables is the moment SQL starts to feel like you can ask real business questions: “Which customers placed the most orders?”, “What products drive revenue?”, “Which orders have no shipment yet?” These questions require relationships between tables and a careful approach to avoid subtle mistakes like duplicated totals or missing rows.
In this chapter you’ll work through five practical milestones: understanding how tables relate, writing your first INNER JOIN, using LEFT JOIN to keep unmatched rows, preventing duplicate inflation, and building a multi-table query to answer a realistic question end-to-end.
INNER vs LEFT based on whether “missing matches” should disappear or stay visible.We’ll use a typical commerce-style example: customers, orders, order_items, and products. The exact names don’t matter—what matters is the pattern. Once you learn it, you’ll recognize it in almost any dataset.
As you read, keep one guiding idea: a join is not magic—it is a matching operation between columns. If you know what should match, and how many rows you expect, you can debug almost any join problem.
Practice note for Milestone 1: Understand relationships between tables: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Write your first INNER JOIN to combine data: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Use LEFT JOIN to keep unmatched rows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Prevent duplicate inflation when joining: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Build a multi-table query for a real question: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Understand relationships between tables: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Write your first INNER JOIN to combine data: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Use LEFT JOIN to keep unmatched rows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Databases split data into multiple tables to keep information clean, consistent, and efficient. Imagine storing customer data (name, email, address) inside every order row. The first time a customer updates their email, you would need to update it in every past order. That’s slow, error-prone, and a recipe for conflicting values.
Instead, a database typically stores “things” in separate tables: customers in customers, orders in orders, products in products. Each table focuses on a single subject and keeps repeated facts in one place. Orders then reference customers, and order items reference both orders and products. This design reduces duplication and makes updates safer.
This structure also improves performance. Smaller tables with fewer repeated columns are faster to scan and easier to index. Finally, splitting data supports multiple uses: the product catalog may be maintained by a merchandising team, while orders are generated by the checkout system.
Your practical milestone here is to look at a schema and recognize the relationships: “One customer can have many orders,” and “One order can have many order_items.” Once you see these patterns, you know joins are required to answer questions that cross boundaries, like “revenue by product category” or “orders per customer.”
A join needs a reliable way to connect rows. That’s what keys provide. A primary key is a column (or set of columns) that uniquely identifies each row in its table. In customers, the primary key is often customer_id. In orders, it’s often order_id. Unique means no duplicates and no ambiguity: one key value points to exactly one row.
A foreign key is a column in one table that refers to the primary key in another table. For example, orders.customer_id refers to customers.customer_id. This is the “linking idea”: the order stores the customer’s ID, not a copy of the customer’s full profile.
In practice, you don’t always need the database to enforce foreign keys for you, but you should think like they exist. When you join, you are making an assumption: “these columns represent the same entity.” Engineering judgment comes from validating that assumption with quick checks:
Common mistakes include joining on the wrong column (like customer_name instead of customer_id) or joining a “code” to a “label” (e.g., status_id to status_name). Names change; IDs are meant to be stable. When available, always join on keys or key-like columns.
Once you can spot primary keys and foreign keys, you’re ready for your first real join: combine customer attributes with order attributes in a single query.
An INNER JOIN returns rows only when there is a match in both tables. Think of it as “give me the overlap.” This is perfect when you only care about records that connect cleanly—for example, orders that belong to known customers.
Start with a base table and add one join at a time. A clear first join is orders to customers:
SELECT o.order_id, o.order_date, c.customer_id, c.email
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id;
Notice the pattern: FROM selects the left table, JOIN selects the right table, and ON defines the match. Use table aliases (o, c) to keep column references readable and unambiguous.
Practical workflow: after writing an inner join, validate it quickly. First, check a small sample with LIMIT. Second, compare counts. If you expect one row per order, verify it:
SELECT COUNT(*) FROM orders;
SELECT COUNT(*)
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
If the join count is lower than the number of orders, some orders don’t have matching customers (or the key values don’t line up). That may be acceptable (e.g., deleted customers), or it may indicate a data issue. An INNER JOIN hides unmatched records, so it’s great for “clean” analysis but risky if you need completeness.
This milestone is about correctness: write the join, qualify columns (o.order_date not just order_date), and confirm your row counts align with your intent.
A LEFT JOIN keeps all rows from the left table, even if there is no match on the right. When there is no match, the right table’s columns appear as NULL. This join is essential for finding missing relationships: orders without shipments, customers without orders, products never purchased.
Example: list all customers and show their most recent order date if they have one. Start by keeping all customers:
SELECT c.customer_id, c.email, MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.email;
Here, customers with no orders still appear, and last_order_date will be NULL. That is useful information: it distinguishes “no activity” from “activity we failed to join.”
Common mistake: placing a filter on the right table in the WHERE clause, which can accidentally turn your left join into an inner join. For example:
... LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'COMPLETE';
This removes rows where o.status is NULL, eliminating customers with no orders—exactly what you were trying to keep. A safer pattern is to put the filter into the join condition:
LEFT JOIN orders o
ON o.customer_id = c.customer_id
AND o.status = 'COMPLETE'
This milestone is about intent: decide what “should be preserved,” put that table on the left, and be careful where you place filters so you don’t silently drop unmatched rows.
The most common join pitfall is duplicate inflation. It happens when you join a “one” table to a “many” table and then aggregate at the wrong level. For example, one order has many order items. If you join orders to order_items, you will get one row per item, not one row per order.
This is not inherently wrong—often it’s exactly what you want. The problem appears when you sum order-level values after expanding the rows. If orders has a column like order_total and you join to order_items, each item row repeats the same order_total, and summing it will overcount.
Safer patterns:
order_id (or product) and sum quantity * unit_price, not an order header total.COUNT(DISTINCT o.order_id) can fix counts, but it doesn’t fix inflated sums.Example: compute revenue per order from items (the “many” side), then you can safely sum across orders:
SELECT oi.order_id, SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM order_items oi
GROUP BY oi.order_id;
Then join:
SELECT o.order_id, o.order_date, r.order_revenue
FROM orders o
LEFT JOIN (
SELECT oi.order_id, SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM order_items oi
GROUP BY oi.order_id
) r ON r.order_id = o.order_id;
This milestone is about engineering judgment: always ask, “What is one row in my result supposed to represent?” If the join changes that grain, your aggregations must change too.
When a join result looks wrong, treat it like debugging. Your goal is to discover whether the issue is (1) the join condition, (2) the data, or (3) the level of detail. A reliable approach is to check row counts and inspect a few “problem” rows.
First, use LEFT JOIN plus a NULL check to find mismatches. Example: orders with no matching customer:
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
This isolates the rows that fail to link. If you expected zero, you’ve found a data integrity issue (missing customers, bad IDs, type mismatch). If you expected some, you now know exactly how many and can decide whether to exclude them (use INNER JOIN) or report them.
Second, watch for accidental filtering. If you left join shipments to orders but then put WHERE shipments.shipped_date >= '2026-01-01', you’ll drop unshipped orders because their shipped_date is NULL. Prefer conditions in the ON clause when you are trying to preserve left-side rows.
Third, validate the “grain” after each join. If you start with orders (one row per order) and join to order_items, you now have one row per item. That’s expected, but you must not interpret counts as orders anymore unless you use COUNT(DISTINCT o.order_id) or re-aggregate back to orders.
Finally, build multi-table queries incrementally. A realistic question might be: “Which product categories generated the most revenue last month?” You can do this by joining items to products (to get category), joining to orders (to filter by date/status), then aggregating:
SELECT p.category, SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
INNER JOIN products p ON p.product_id = oi.product_id
INNER JOIN orders o ON o.order_id = oi.order_id
WHERE o.order_date >= '2026-02-01' AND o.order_date < '2026-03-01'
GROUP BY p.category
ORDER BY revenue DESC;
If the revenue looks too high, revisit Section 5.5: are you summing at the item level (good) or repeating an order-level value (bad)? If categories are missing, revisit Section 5.4: should a left join be used to keep products with zero sales? Reading joins correctly means interpreting NULL as “no match,” and treating each join as a deliberate change to what each row represents.
1. Why do real databases often require joins to answer business questions?
2. What is the most accurate way to think about a SQL join?
3. How should you choose between INNER JOIN and LEFT JOIN in this chapter’s workflow?
4. When building a multi-table query, what workflow does the chapter recommend to reduce mistakes?
5. What is a common risk the chapter warns about when joining tables, especially for counts and sums?
In the first chapters, you learned the “what” of SQL: select columns, filter rows, sort, and summarize. This chapter is about the “how” professionals work: writing queries you can trust, read later, and reuse safely. Trustworthy queries are not only correct today—they remain correct when data grows, when new categories appear, when someone else runs them, and when you return in three weeks and can’t remember what you meant.
You’ll practice a repeatable workflow (Milestone 1), learn habits that make long queries readable (Milestone 2), and turn everyday questions into a small report (Milestone 3). You’ll also learn quick tests to catch mistakes and edge cases (Milestone 4) and simple ways to save and reuse your work as templates (Milestone 5). The goal is practical: move from “I can write SQL” to “I can deliver results with confidence.”
Throughout this chapter, think like an analyst and an engineer at the same time. Analysts translate questions into metrics and slices. Engineers reduce ambiguity and prevent silent errors. SQL sits exactly in the middle, so your habits matter as much as your syntax.
Practice note for Milestone 1: Use a repeatable checklist to write any query: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Organize longer queries for readability: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Turn questions into a small report with SQL: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Catch mistakes with quick tests and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Save and reuse queries as templates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Use a repeatable checklist to write any query: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Organize longer queries for readability: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Turn questions into a small report with SQL: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Catch mistakes with quick tests and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Save and reuse queries as templates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
When you feel stuck in SQL, it’s usually not a syntax problem—it’s a planning problem. A repeatable checklist prevents “random walking” through the database and helps you explain your logic to others. Use this workflow every time: question → tables → query → check.
1) Clarify the question. Write it as a sentence with a time window and definitions. “Revenue last month” should become “total paid order amount for completed orders in February 2026, in USD.” Decide what “completed” means and whether refunds count. If you don’t define terms, SQL will happily produce a number that looks authoritative but is meaningless.
2) Identify the tables and keys. Ask: where do the facts live (orders, payments, events) and where do the attributes live (customers, products)? Find the join keys (order_id, customer_id). If you join without understanding the relationship (one-to-one vs one-to-many), you risk duplicated rows and inflated totals.
3) Build the query in layers. Start with a small SELECT that returns raw rows you recognize. Add WHERE filters next. Then add grouping and aggregates. Only after the result is correct should you add formatting (aliases, rounding) for presentation.
4) Check your work. Verify row counts, spot-check a few records, and compare against a known truth if available (a dashboard, an invoice total, yesterday’s result). You can also run “sanity” queries: count distinct keys, check min/max dates, and confirm there are no unexpected NULLs.
MIN(date) and MAX(date) to confirm you hit the intended range.SUM(amount) and COUNT(*) * AVG(amount)) to see if they roughly align.This workflow is your reliability baseline. It also makes your SQL easier to debug because each step has a clear purpose and a natural “checkpoint” you can return to.
Readable SQL is maintainable SQL. As queries get longer—more joins, more filters, more metrics—clarity becomes a feature. Good formatting and comments don’t just help teammates; they help you catch errors. Many mistakes are visible only when the query is laid out in a predictable structure.
Formatting pattern: put each clause on its own line and align lists vertically. For example, list selected columns one per line, list join conditions clearly, and indent subqueries. This reduces the chance you miss a comma or accidentally apply a filter to the wrong table alias.
Use consistent aliases. Prefer short, meaningful aliases like o for orders, c for customers, p for products. Avoid reusing t1, t2 across different queries—those names communicate nothing. In result columns, choose clear names with AS: total_revenue, orders_count, avg_order_value. These names become the “API” of your report.
Comment with intent, not narration. Good comments explain why a filter exists (“exclude test accounts”) or define a metric (“revenue excludes tax and shipping”). Avoid comments that simply repeat the SQL. Many editors support -- single-line and /* multi-line */ comments.
SELECT * in “final” queries—explicit columns make changes safer.These habits support Milestone 2: organizing longer queries. They also make reuse easier later (Milestone 5), because a well-named, well-commented query is already halfway to being a template.
As your questions become more report-like, you’ll often need “a query inside a query.” That’s where subqueries and common table expressions (CTEs) help. Conceptually, both let you create an intermediate result set, then build on it. Think of them as temporary tables that exist only while the query runs.
Subquery: often used inline, for example in a FROM clause (a “derived table”) or in a WHERE condition. Subqueries are powerful but can become hard to read when nested deeply.
CTE (WITH clause): a named step at the top of your query: WITH step_name AS (...). A CTE is usually the best beginner choice for readability because it lets you label each stage: filtered_orders, customer_totals, monthly_metrics, etc. This supports Milestone 2 (readability) and Milestone 3 (report-building) because your query starts to look like a small pipeline.
Practical pattern: use a CTE to filter and shape your data first, then aggregate later. For example, a first CTE might select only completed orders in a date range with the exact columns you need. A second CTE might join to customers to add region. The final SELECT groups and computes metrics. When each step is named, it’s easier to test each step independently (Milestone 4): run the CTE’s SELECT alone to verify row counts and sample records.
One warning: intermediate steps can hide duplication if you join incorrectly. Always know what a row represents at each stage (“one row per order” vs “one row per line item”). Write that as a comment above the CTE to prevent accidental metric inflation later.
You don’t need to be a database administrator to write efficient queries, but you should learn one core principle: reduce data early. Query performance is often about how many rows and columns the database must scan, join, and group. If you can shrink the dataset before expensive operations, your query is usually faster and more reliable.
Reduce columns: avoid SELECT * when working with large tables. Pull only the columns you need for joins, filters, grouping, and final output. This makes results easier to inspect and can reduce work for the database, especially when tables have wide text or JSON columns.
Reduce rows: apply WHERE filters as early as possible. Filter by date range, status, country, or other high-impact conditions before joining to other tables. For example, filtering orders to “last 30 days” before joining to customers is usually cheaper than joining all historical orders first and filtering later.
Be careful with joins. Joining large tables multiplies work. Join only what you need, and join on correct keys. If a join should be one-to-one but is actually one-to-many, grouping later may produce inflated sums. A quick performance-and-correctness test is to compare COUNT(*) before and after the join, and also COUNT(DISTINCT primary_key).
LIMIT to iterate quickly, but remove or adjust it for final reporting.These basics won’t replace indexing or query plans, but they will prevent the most common beginner problem: slow, confusing queries that are hard to validate and even harder to reuse.
A “one-page report” is a single query (often with CTEs) that answers a business question with a few key metrics and slices—something you could paste into an email or put into a spreadsheet. This is Milestone 3: turning questions into a small report with SQL.
Start by deciding the report shape: what are the dimensions (how you slice) and what are the metrics (what you measure)? Dimensions might be month, product category, country, or plan type. Metrics might include COUNT(*) orders, SUM(amount) revenue, and AVG(amount) average order value. Then decide your global filters: date window, only completed transactions, exclude internal users.
A practical pattern is:
GROUP BY your dimensions and compute metrics.HAVING (e.g., only show categories with at least 50 orders).Use HAVING carefully: it filters after grouping, so it’s perfect for “only show meaningful groups.” Use WHERE for row-level filters like date range or status. For trust, include a small “total row” as a separate query or use a second aggregation to validate that grouped sums match overall sums.
Finally, make it reusable (Milestone 5): parameterize the date range in comments (e.g., “set start_date/end_date here”), keep dimension choices obvious, and avoid embedding one-off exclusions without explanation. A report query is a living artifact—treat it like a small piece of code you expect to run again.
Trustworthy queries come from catching mistakes early. Here are common beginner errors and quick troubleshooting moves (Milestone 4), along with habits that prevent repeats.
1) Inflated counts or sums (join duplication). Symptom: revenue is too high, order count jumps after a join. Fix: check join cardinality. Run COUNT(*) and COUNT(DISTINCT o.order_id) before and after the join. If the distinct count stays stable but total rows grow, you likely joined to a one-to-many table (like order_items). Decide whether you should aggregate the many-side first, or change the metric to match the row grain.
2) Wrong date filtering. Symptom: missing recent records or including an extra day. Fix: confirm the column type (date vs timestamp) and use a consistent pattern (e.g., inclusive start, exclusive end). Validate with MIN/MAX of the filtered results.
3) NULL surprises. Symptom: averages look off, groups appear as blank. Fix: check how aggregates treat NULL (most ignore NULLs). Decide whether to keep NULLs as “unknown” or replace with a label in reporting. Always quantify NULLs with a quick count.
4) GROUP BY errors. Symptom: SQL complains about non-aggregated columns, or results are too granular. Fix: ensure every selected non-aggregated column is in GROUP BY. Reconfirm what each row should represent in the final output.
5) Hard-to-reuse queries. Symptom: you can’t safely change the date range without breaking logic. Fix: apply Milestone 5—save a clean template version with clear comments, consistent aliasing, and a base CTE that defines the population.
Troubleshooting mindset: reduce the problem. Comment out parts, run intermediate CTEs, add LIMIT to inspect raw rows, and compare results to a simplified version of the same logic. SQL is deterministic: if you isolate the step where the numbers change, you will find the cause.
1. What best defines a “trustworthy” SQL query in this chapter?
2. Milestone 1 recommends a repeatable checklist primarily to:
3. According to the chapter, organizing longer queries for readability (Milestone 2) is valuable because it:
4. Turning questions into a small report with SQL (Milestone 3) most closely reflects which analyst skill?
5. What is the main purpose of quick tests and edge cases (Milestone 4)?