HELP

+40 722 606 166

messenger@eduailast.com

Data Cleaning Basics: Fix Missing Values, Duplicates & Typos

Data Science & Analytics — Beginner

Data Cleaning Basics: Fix Missing Values, Duplicates & Typos

Data Cleaning Basics: Fix Missing Values, Duplicates & Typos

Turn messy spreadsheets into trustworthy data—step by step.

Beginner data-cleaning · missing-values · duplicates · typos

Make messy data usable—even if you’ve never done this before

Real-world data is rarely neat. It often arrives as a spreadsheet with empty cells, repeated rows, inconsistent labels, and small typing mistakes that quietly break totals, charts, and decisions. This beginner course is a short, book-style guide that teaches data cleaning from the ground up—using plain language and simple steps you can apply in Excel or Google Sheets.

You’ll learn how to recognize the most common issues that appear in everyday datasets: missing values, duplicates, and typos. More importantly, you’ll learn safe ways to fix them without “guessing” or accidentally changing meaning. The goal is not perfection—it’s trustworthy, consistent data that is good enough for analysis and reporting.

What you’ll be able to do by the end

By the final chapter, you’ll have a repeatable workflow you can use on almost any spreadsheet dataset. You’ll know how to inspect data before changing it, apply clear rules for fixes, and verify that your changes improved quality.

  • Identify missing values and decide when to leave, remove, or fill them
  • Find duplicates and keep the correct record (not just delete blindly)
  • Fix typos and standardize text so categories group correctly
  • Run simple “before vs. after” checks to confirm your results
  • Document your cleaning steps so others can trust your work

How the course is structured (like a short technical book)

This course is organized into exactly six chapters. Each chapter builds on the last, so you start with the idea of “clean data,” then learn to profile a dataset, then fix one problem type at a time, and finally combine everything into a single workflow.

You’ll begin with first principles: what rows and columns represent, what counts as an error, and why making a safe copy matters. Next, you’ll learn quick ways to scan a dataset for trouble spots and record what you find. Then you’ll tackle missing values, duplicates, and typos using simple rules and checks designed for beginners. The final chapter brings it all together into a repeatable checklist and a small quality report you can reuse at school, work, or home projects.

Who this is for

This course is for absolute beginners—no coding, no statistics, and no prior data science experience required. If you can open a spreadsheet, sort a column, and save a copy of a file, you can follow along.

  • Students learning analytics for the first time
  • New analysts and operations staff working with spreadsheets
  • Teams in business or government who need reliable reporting
  • Anyone who wants fewer errors and more confidence in their data

Get started

If you want to turn messy spreadsheets into clean, dependable datasets, this course will guide you step by step. Register free to begin, or browse all courses to find the right learning path for your goals.

What You Will Learn

  • Explain what data cleaning is and why it affects results and decisions
  • Spot common data problems: missing values, duplicates, typos, and inconsistent formats
  • Choose safe ways to handle missing values (leave, remove, fill) for simple datasets
  • Find and remove duplicates while keeping the correct record
  • Standardize text and fix common typos using simple, repeatable rules
  • Create a basic data cleaning checklist you can reuse on any spreadsheet
  • Document what you changed so others can trust and reproduce your work

Requirements

  • No prior AI, coding, or data science experience required
  • Basic computer skills (copy/paste, saving files)
  • Access to a spreadsheet tool (Excel, Google Sheets, or similar)
  • Willingness to practice with small sample datasets

Chapter 1: What “Clean Data” Means (and Why It Matters)

  • Understand messy vs. clean data with everyday examples
  • Learn the costs of dirty data (wrong totals, wrong decisions)
  • Meet the three core problems: missing values, duplicates, typos
  • Set up your first cleaning workspace and make a safe copy
  • Build your first simple “data dictionary” (what columns mean)

Chapter 2: Profiling Your Dataset Before You Change Anything

  • Scan columns to find likely problem areas quickly
  • Count blanks and unusual values without advanced tools
  • Detect inconsistent formats (dates, phone numbers, casing)
  • Create a simple issue log to track what you find
  • Decide what “good enough” means for your goal

Chapter 3: Fixing Missing Values Without Guessing

  • Learn why values go missing and what it can mean
  • Choose between leaving blank, removing rows, or filling in
  • Fill missing text and categories using simple rules
  • Handle missing numbers with safe beginner methods
  • Validate results so you don’t introduce new mistakes

Chapter 4: Finding and Removing Duplicates the Right Way

  • Understand duplicates: exact vs. “same person, slightly different”
  • Pick the columns that define a unique record
  • Identify duplicate groups and review them safely
  • Remove duplicates while keeping the correct row
  • Prevent duplicates with simple input rules

Chapter 5: Fixing Typos and Standardizing Text

  • Spot common typo patterns and inconsistent naming
  • Standardize casing, spacing, and punctuation
  • Fix frequent misspellings using a reference list
  • Normalize categories so charts and counts make sense
  • Run quality checks to confirm text cleanup worked

Chapter 6: Putting It All Together: A Repeatable Cleaning Workflow

  • Clean a small dataset from start to finish with the full checklist
  • Document every change so it’s transparent and repeatable
  • Create a “cleaned data” output and keep raw data untouched
  • Build a simple quality report (before vs. after metrics)
  • Plan next steps: automation ideas without needing code

Sofia Chen

Data Analytics Instructor, Data Quality & Reporting

Sofia Chen is a data analytics instructor who helps beginners turn messy spreadsheet data into reliable reports. She has supported teams in operations and public programs by building simple, repeatable data quality workflows that reduce errors and rework.

Chapter 1: What “Clean Data” Means (and Why It Matters)

Before you can fix missing values, remove duplicates, or correct typos, you need a clear definition of what you’re aiming for. “Clean data” is not “perfect data.” It is data that is consistent enough, complete enough, and well-understood enough for a specific purpose—like calculating totals, producing a report, or training a simple model—without silently introducing errors.

This chapter establishes the mental model you’ll use throughout the course: data usually lives in tables; problems usually fall into a few repeatable categories; and safe cleaning is as much about workflow (copies, versions, documentation) as it is about changing cells. You’ll also learn why dirty data is expensive: it can lead to wrong totals, wrong conclusions, and wasted time when people argue over numbers instead of improving them.

We’ll keep the examples grounded in everyday spreadsheet scenarios: a customer list with missing emails, a sales report with duplicated invoices, or a product catalog with inconsistent names like “USB-C Cable,” “USBC cable,” and “USB C cabel.” By the end of the chapter, you’ll have your first simple data dictionary and a reusable checklist that helps you clean methodically rather than “click and hope.”

Practice note for Understand messy vs. clean data with everyday examples: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Learn the costs of dirty data (wrong totals, wrong decisions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Meet the three core problems: missing values, duplicates, typos: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up your first cleaning workspace and make a safe copy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build your first simple “data dictionary” (what columns mean): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand messy vs. clean data with everyday examples: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Learn the costs of dirty data (wrong totals, wrong decisions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Meet the three core problems: missing values, duplicates, typos: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up your first cleaning workspace and make a safe copy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Data as a table (rows, columns, and records)

Most beginner-friendly datasets—especially in spreadsheets—are tables. A table is a grid where each row is usually one record (one customer, one invoice, one product), and each column is a field (customer name, invoice date, product category). Cleanliness starts with agreeing on what a row represents. If some rows are “customers” and other rows are “customer contacts,” the table is already ambiguous, and any totals or counts can be misleading.

Here’s a concrete example: imagine a sales table where each row is meant to be one transaction. If someone inserts a “subtotal” row after each month, your pivot table might treat those subtotals as real transactions and inflate revenue. That’s not a “typo” problem—it’s a table-structure problem. Clean data, in practice, is data that behaves predictably when filtered, sorted, grouped, and summarized.

When you open a dataset for the first time, do a quick structural scan:

  • Does every row represent the same kind of thing?
  • Do columns have a single meaning each (not “Notes + Status” combined)?
  • Are there header rows repeated in the middle (common in merged exports)?
  • Are there blank rows, extra title lines, or footers that will break formulas?

This matters because many “cleaning” steps depend on the record definition. You can only detect duplicates if you know what “the same record” means. You can only judge missing values if you know which fields are required for a valid record.

Section 1.2: What counts as an error vs. a valid value

Not every unusual value is an error. Data cleaning requires judgement: you must decide what is invalid (impossible or out-of-scope) versus what is simply rare. For example, an age of 250 is almost certainly invalid, but an age of 98 might be valid. A quantity of 0 might be an error in some contexts (you can’t sell zero units), but it might be valid in others (a backordered item listed with quantity 0 to reserve a line item).

Missing values are especially tricky because “missing” can be represented in many ways: blank cells, “N/A,” “NA,” “unknown,” “-”, or even a single space. Some systems also use sentinel values like 0 or 9999. Your job is to convert these messy representations into a consistent one, then decide what to do with them. The safe options are usually:

  • Leave missing values as missing when filling would invent information.
  • Remove records only when the missingness makes the record unusable for your task (and you can justify the loss).
  • Fill (impute) with a defensible rule for simple datasets, such as “fill missing country with ‘Unknown’” or “fill missing price with the median price within the same product category.”

Common mistake: filling missing numeric values with 0 “so formulas work.” This can distort totals and averages. If you must keep calculations running, prefer leaving blanks and handling them in formulas (e.g., using IF or COALESCE-like logic), or fill with an explicit placeholder that signals “not provided.” A clean dataset is one where errors are corrected, but uncertainty is not hidden.

Section 1.3: The idea of “data quality” in plain language

“Data quality” sounds abstract, but you can think of it as answering one practical question: Can I trust this table for the decision I’m about to make? Dirty data is costly because it produces confident-looking outputs that are wrong. Wrong totals can lead to under-ordering inventory, overpaying commissions, or missing revenue targets. In a small business setting, it can mean calling the wrong customers, shipping to old addresses, or counting the same invoice twice.

Data quality has a few simple dimensions that show up in everyday work:

  • Completeness: Are required fields present (email, date, amount)?
  • Uniqueness: Are records duplicated (same customer entered twice, same transaction imported twice)?
  • Consistency: Do formats match (dates, currency symbols, capitalization, units)?
  • Validity: Do values fall within allowed ranges and categories?

In this course, you’ll meet three core problems again and again: missing values, duplicates, and typos/inconsistent text. They often interact. For instance, duplicates might differ only by a typo in a name; missing values might prevent you from matching records correctly; inconsistent formats ("03/04/25" meaning different dates) can create false duplicates or hide real ones.

The practical outcome of thinking in “quality dimensions” is that you stop cleaning randomly. Instead, you clean with a purpose: make the dataset reliable for the next step—reporting, analysis, or automation—and record what you changed so someone else can reproduce it.

Section 1.4: Backups, versions, and working on copies

The fastest way to create a data crisis is to “just fix a few things” in the only copy of your file. A professional cleaning workflow starts with safety: keep the raw data unchanged and do all edits on a copy. This is not bureaucracy—it’s how you avoid irreversible mistakes and how you earn trust when people ask, “Where did that number come from?”

Set up a simple workspace even if you’re only using a spreadsheet:

  • Create a folder with subfolders like raw, working, and output.
  • Save the original export in raw and never edit it.
  • Copy it into working and add a date or version number (e.g., sales_working_v01_2026-03-28).
  • Only share files from output once checks pass.

When you make a change—remove duplicates, standardize a column, fill missing values—write down what you did. In a spreadsheet, a “Change Log” tab is enough. In more advanced tools, you might use scripts and version control, but the principle is the same: cleaning should be repeatable. Common beginner mistake: using “Find and Replace” across the entire sheet and accidentally changing values in the wrong column (e.g., replacing “CA” with “Canada” and corrupting product codes). Working on copies and tracking versions makes these mistakes recoverable.

Your goal is to be able to answer: “If I re-download the data tomorrow, can I produce the same cleaned dataset again?” That mindset turns cleaning from one-off tidying into a reliable process.

Section 1.5: Column names, units, and allowed values

Clean data is also understood data. If column headers are vague or inconsistent, every later step becomes guesswork. A column named “Value” could mean revenue, profit, quantity, or a rating score. A column named “Date” might be order date, ship date, or invoice date. Ambiguity creates silent errors because your analysis might run successfully while using the wrong meaning.

This is where you build your first simple data dictionary: a short description of each column’s meaning and rules. Keep it lightweight—one table on a new sheet is enough. For each column, capture:

  • Description: what it represents (e.g., “Order date in customer’s local time”).
  • Type: text, integer, decimal, date/time.
  • Units: USD vs. EUR, kilograms vs. pounds, percentages vs. fractions.
  • Allowed values: especially for categories (e.g., Status ∈ {Pending, Shipped, Cancelled}).
  • Missing allowed? and if not, what to do when missing.

Typos and inconsistent formats often show up as “new categories.” For example, a Status column might include “Shipped,” “shipped,” “Shippped,” and “SHIPPED.” Without a defined set of allowed values, you won’t know which one is correct, and your totals by status will fragment across multiple spellings. Standardizing text becomes much easier once you’ve declared the target values.

Practical tip: rename columns early to be specific and consistent (e.g., order_date, ship_date, amount_usd). Clear names reduce mistakes when filtering, joining, or building charts.

Section 1.6: A beginner’s cleaning checklist overview

A checklist turns “cleaning” from an intimidating, open-ended task into a predictable routine. You’ll refine this checklist throughout the course, but you can start with a beginner-friendly version that works on almost any spreadsheet dataset.

  • 1) Protect the raw file: save an untouched copy; work on a versioned duplicate.
  • 2) Confirm table structure: one header row, one record per row, no embedded subtotals or footers.
  • 3) Understand each column: draft a simple data dictionary with meaning, type, and units.
  • 4) Standardize formats: dates, currency, decimal separators, capitalization, and whitespace.
  • 5) Check missing values: identify how missing is encoded; decide leave/remove/fill with a documented rule.
  • 6) Find duplicates: define the “duplicate key” (e.g., invoice_id, or name+date+amount); remove extras while keeping the correct record.
  • 7) Fix typos and inconsistent text: use repeatable rules (trim spaces, consistent case, mapping table for known misspellings).
  • 8) Validate results: re-check totals, counts, and key distributions to ensure cleaning didn’t introduce new errors.

The key engineering judgement is knowing that “clean” depends on your goal. If you’re emailing customers, missing email addresses are critical; if you’re analyzing store performance, missing emails might not matter. Similarly, removing duplicates is safe only when you can explain why one record is the true one—maybe the newest timestamp wins, or maybe the record with the most complete fields wins. A checklist helps you make these decisions explicitly and consistently, which is what makes your work trustworthy.

With this foundation—table structure, error definitions, quality dimensions, safe workflow, and a basic data dictionary—you’re ready to start hands-on cleaning in the next chapter.

Chapter milestones
  • Understand messy vs. clean data with everyday examples
  • Learn the costs of dirty data (wrong totals, wrong decisions)
  • Meet the three core problems: missing values, duplicates, typos
  • Set up your first cleaning workspace and make a safe copy
  • Build your first simple “data dictionary” (what columns mean)
Chapter quiz

1. In this chapter, what best describes “clean data”?

Show answer
Correct answer: Data that is consistent enough, complete enough, and well-understood enough for a specific purpose without silently introducing errors
The chapter defines clean data as fit-for-purpose (consistent, complete, and understood), not perfect.

2. Why does the chapter say dirty data is expensive?

Show answer
Correct answer: It can lead to wrong totals, wrong conclusions, and wasted time arguing over numbers
Dirty data creates incorrect results and wastes time because people debate numbers instead of improving them.

3. Which set lists the three core problems introduced in the chapter?

Show answer
Correct answer: Missing values, duplicates, typos
The chapter centers on three repeatable categories of issues: missing values, duplicates, and typos.

4. What is the main reason to make a safe copy before cleaning?

Show answer
Correct answer: To avoid irreversible changes and support a versioned, safe workflow
Safe cleaning emphasizes workflow: copies/versions allow you to recover and compare changes.

5. What is the purpose of a simple data dictionary in this chapter’s workflow?

Show answer
Correct answer: To document what columns mean so the data is well-understood for its intended use
A data dictionary provides shared understanding of columns, supporting consistent, reliable cleaning and use.

Chapter 2: Profiling Your Dataset Before You Change Anything

Before you delete rows, fill blanks, or “fix” spelling, you need a quick, disciplined read of what you actually have. This step is called profiling: a structured scan of the dataset to discover patterns, problems, and risks. Profiling is not busywork. It prevents two common failures in data cleaning: (1) making changes that silently damage the meaning of your data, and (2) spending time polishing columns that do not matter for your goal.

Think of profiling as an inspection checklist you run before repairs. You are looking for likely problem areas (missing values, duplicates, typos, inconsistent formats), counting how often they occur, and deciding what “good enough” means for your task. In a business setting, a dataset used for monthly trend reporting can tolerate different imperfections than a dataset used to send invoices or calculate commissions.

A practical profiling workflow is simple: scan columns quickly, count blanks and placeholders, look for outliers and suspicious values, check formats (dates, numbers, IDs), check consistency (categories and spelling patterns), and write down what you find in an issue log. The issue log is your bridge from observation to action. It keeps your cleaning steps explainable and repeatable, especially if you need to defend decisions later.

In this chapter, you will learn how to profile data using basic spreadsheet actions: sorting, filtering, simple counts, and a careful eye. No advanced tools are required—just a safe approach that avoids accidental edits.

Practice note for Scan columns to find likely problem areas quickly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Count blanks and unusual values without advanced tools: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Detect inconsistent formats (dates, phone numbers, casing): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a simple issue log to track what you find: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Decide what “good enough” means for your goal: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Scan columns to find likely problem areas quickly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Count blanks and unusual values without advanced tools: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Detect inconsistent formats (dates, phone numbers, casing): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a simple issue log to track what you find: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Quick column scans and sorting safely

Section 2.1: Quick column scans and sorting safely

Start by scanning your columns like an auditor. Read the column headers, confirm what each field should represent, and note where ambiguity exists (for example, does “Date” mean order date, ship date, or signup date?). This sounds basic, but unclear meaning is a major cause of “clean” data producing wrong results.

Next, do quick column scans using sorting and filtering—but do it safely. The number-one mistake is sorting a single column and breaking the alignment between columns (turning one customer’s email into another customer’s order). To avoid this, always select the entire table (or use your spreadsheet’s “Format as Table” feature) before sorting. Confirm that “Expand the selection” is enabled so all columns move together.

  • Sort A→Z / Z→A to bring blanks, symbols, and odd values to the top or bottom.
  • Filter to view unique values and get a feel for category variety.
  • Freeze the header row so you can scroll without losing context.

When scanning, look for “obvious seams”: sudden shifts in value style (some entries in ALL CAPS, some in Title Case), mixed separators (hyphens vs spaces), and entries that look like they belong to a different field. Also pay attention to columns that are unexpectedly empty or unexpectedly full. A “Phone” column with 95% blanks might be normal for a web signup; it might also mean the column was never populated due to an import issue.

Practical outcome: by the end of this step, you should know which 3–5 columns are most likely to cause trouble in analysis or operations, and you should have avoided any destructive “quick fixes” while exploring.

Section 2.2: Counting blanks and spotting placeholders (NA, -, 0)

Section 2.2: Counting blanks and spotting placeholders (NA, -, 0)

Missing data is rarely just “blank cells.” Real-world datasets often encode missingness using placeholders: NA, N/A, null, a dash (-), unknown, or even 0. If you only count blanks, you will underestimate missing values and make unsafe decisions about filling or deleting rows.

In a spreadsheet, you can do a basic missingness profile with simple counts. Use a blank count (for example, COUNTBLANK in many tools) and then separately count common placeholders with straightforward criteria (filter for “NA”, filter for “-”, filter for “0”). Keep the counts in a small side table so you can compare columns. You are not trying to be perfect; you are trying to understand magnitude and patterns.

  • Blank vs placeholder: decide whether placeholders should be treated as missing or meaningful. A “0” in Quantity might be real; a “0” in Age is suspicious.
  • Missingness pattern: missing values clustered in certain dates, regions, or sources may indicate a system issue.
  • Critical fields: mark which columns are required for your purpose (e.g., invoice needs customer ID and amount; marketing email needs email address).

A common mistake is to immediately fill missing values “to make charts work.” That can bias results. For example, filling missing revenue with 0 can understate totals if the revenue is missing due to tracking gaps, not true zeros. Profiling helps you choose between leaving missing values, removing records, or filling—based on what the column means and how the dataset will be used.

Practical outcome: you will know the true extent of missingness and the specific tokens used to represent it, which prepares you to standardize missing values later (e.g., turning “-” and “N/A” into a consistent blank).

Section 2.3: Finding outliers and suspicious values

Section 2.3: Finding outliers and suspicious values

Outliers are values that do not fit the general pattern. Some outliers are legitimate (a large enterprise order), and some are data errors (an extra zero, a misplaced decimal, a negative quantity). Your job in profiling is not to delete outliers automatically—it is to identify them and decide which ones require investigation.

Use quick, non-technical checks first: sort numeric columns ascending and descending and look at the extremes. For dates, sort earliest to latest and check for impossible ranges (orders in 2099, birthdates in 1800). For text fields that should be short (state codes, country codes), sort and look for unusually long strings that might indicate notes were pasted into the wrong column.

  • Boundary checks: values below 0 where negatives are impossible; percentages above 100; ages above a plausible maximum.
  • Unit confusion: some rows in dollars, others in cents; weights in kg mixed with lbs.
  • Copy/paste artifacts: repeated values down a column from a misfilled cell; trailing spaces that create “different” categories.

Engineering judgement matters here. If you are cleaning data for a high-level trend report, you might flag a handful of extreme values and proceed. If the dataset drives financial payouts, you may need a stricter process: verify against source systems, require approvals, and document every correction.

Common mistake: deleting outliers because they “look wrong” without verifying what “wrong” means. In many domains, the unusual values are the most important signals (fraud detection, high-value customers, system failures). Profiling helps you separate “rare but real” from “rare because broken.”

Practical outcome: you will have a short list of suspicious records and columns, ready to be logged and handled with appropriate caution.

Section 2.4: Format checks: dates, numbers, currency, IDs

Section 2.4: Format checks: dates, numbers, currency, IDs

Inconsistent formats are a hidden cause of analysis errors because they often look fine to the eye but behave differently in calculations and filters. Profiling formats means checking whether a column is consistently stored as the type you expect: dates as true dates (not text), numbers as numbers (not strings), currency as a numeric value plus a currency indicator (not mixed symbols), and IDs as stable identifiers (often best kept as text).

Start with dates. Look for mixed date patterns such as MM/DD/YYYY and DD/MM/YYYY, or a mix of “2026-03-28” and “3/28/26”. Sorting can reveal this quickly: text dates often sort alphabetically, causing “10/…” to appear before “2/…”. Also watch for time components where you do not expect them (timestamps in a “Date” field).

  • Numbers: check for commas, spaces, and parentheses (e.g., “(1,200)” as negative). These can convert numbers into text.
  • Currency: look for “$1,200”, “1200 USD”, and “1.200,00 €” in the same column—mixed locale formats require deliberate handling.
  • IDs: beware of leading zeros (e.g., “001234”) disappearing if the column is treated as numeric.

A common mistake is “fixing” format by retyping a few cells, creating inconsistent manual edits. Instead, profiling should tell you whether a column needs a systematic rule later (for example, parse all dates using one known source format, store IDs as text, strip currency symbols into a separate column).

Practical outcome: you will know which columns have type/format risks that could break merges, comparisons, or calculations—even if the values look similar on screen.

Section 2.5: Consistency checks: categories and spelling patterns

Section 2.5: Consistency checks: categories and spelling patterns

Text inconsistency is where duplicates and typos hide. Category fields (country, product line, department) should have a manageable set of values. Profiling is about discovering whether the categories are clean enough to group reliably. If “NY”, “N.Y.”, “New York”, and “newyork” appear, your pivot tables will split what should be one group into four.

Use filters to view unique values and scan for patterns: differences in casing, extra spaces, punctuation, and common misspellings. Sorting A→Z often clusters near-duplicates together, making them easier to spot. Also check for “other” buckets that have become dumping grounds (“Misc”, “Unknown”, “TBD”) and decide whether those are acceptable for your goal.

  • Casing: “HR” vs “Hr” vs “hr” can create separate groups.
  • Whitespace: trailing spaces make values look identical but compare as different.
  • Synonyms: “United States”, “USA”, “US” need a standard.
  • Typos: repeated patterns (e.g., “Califronia”) suggest a rule-based fix later.

This is also the right time to spot potential duplicates conceptually: two records may refer to the same entity with slightly different spelling (“Acme Inc” vs “ACME Incorporated”). Profiling does not remove duplicates yet; it identifies which fields could serve as reliable matching keys and which fields are too messy to match safely.

Practical outcome: you will know whether a “standardization pass” is required, and you will have a shortlist of the highest-impact categories to normalize first.

Section 2.6: Defining cleaning rules based on the task

Section 2.6: Defining cleaning rules based on the task

Profiling ends with a decision: what does “good enough” mean for this dataset and this purpose? Data cleaning is not a moral virtue; it is risk management. You choose rules that reduce error in the output you care about while minimizing unintended side effects.

Start by writing a simple issue log. A basic table is enough: column name, issue type (missing, duplicate, typo, format), examples, estimated count, impact, and proposed action. This log prevents random, one-off edits and creates a repeatable checklist for future datasets.

  • Goal-driven thresholds: for a dashboard, you might accept 1–2% missing in non-critical fields; for billing, missing customer ID may be unacceptable.
  • Safe defaults: prefer non-destructive steps first (standardize placeholders, trim spaces) before dropping rows.
  • Document assumptions: if you decide “0 means missing for Age,” record it. Someone will ask later.

Then define cleaning rules in plain language before you implement them. Examples: “Treat ‘NA’, ‘N/A’, and ‘-’ as missing in the Notes field,” “Store CustomerID as text to preserve leading zeros,” “Standardize State to two-letter codes,” “For duplicate emails, keep the most recent signup date.” You are translating profiling discoveries into consistent actions.

Common mistake: using one universal rule like “drop all rows with blanks.” That can wipe out valuable records and bias your analysis toward more complete sources. Instead, tie each rule to the task: analysis, reporting, outreach, or operations. Profiling gives you the evidence to choose wisely.

Practical outcome: you finish this chapter with an issue log and a set of proposed rules, ready to execute in the next cleaning steps with confidence and traceability.

Chapter milestones
  • Scan columns to find likely problem areas quickly
  • Count blanks and unusual values without advanced tools
  • Detect inconsistent formats (dates, phone numbers, casing)
  • Create a simple issue log to track what you find
  • Decide what “good enough” means for your goal
Chapter quiz

1. What is the main purpose of profiling a dataset before making any cleaning changes?

Show answer
Correct answer: To discover patterns, problems, and risks so you avoid damaging meaning and wasting effort
Profiling is a structured inspection to understand what you have and prevent harmful or unnecessary cleaning.

2. According to the chapter, which pair of failures does profiling help prevent?

Show answer
Correct answer: Silently damaging the meaning of the data and polishing columns that don’t matter for the goal
Profiling reduces the risk of changes that alter meaning and avoids spending time on low-impact columns.

3. Which action best matches the chapter’s recommended quick profiling workflow?

Show answer
Correct answer: Scan columns, count blanks/placeholders, check for outliers and format consistency, and record findings
The chapter emphasizes a simple, disciplined scan and documentation before any fixes.

4. Why does the chapter recommend creating an issue log during profiling?

Show answer
Correct answer: To bridge observation to action and keep cleaning steps explainable and repeatable
An issue log documents what you found and supports defensible, repeatable cleaning decisions.

5. How should you decide what “good enough” data quality means during profiling?

Show answer
Correct answer: Base it on the goal, since different tasks tolerate different imperfections
The acceptable level of imperfections depends on how the dataset will be used (e.g., trend reporting vs invoicing).

Chapter 3: Fixing Missing Values Without Guessing

Missing values are one of the most common reasons analyses go wrong in simple spreadsheets. A blank cell can silently change totals, hide patterns, or break a pivot table. The goal of this chapter is not to “make blanks disappear,” but to make deliberate, documented choices about what a blank means and how your dataset should behave after cleaning.

Begin with a mindset shift: missing data is information. It can signal a process problem (someone skipped a field), a data integration issue (a lookup failed), or a meaningful outcome (a question didn’t apply). If you treat every blank the same way—always deleting or always filling—you will eventually create confident-looking results that are simply untrue.

A safe workflow for beginners is: (1) identify where values are missing and how often, (2) decide what each missing field means, (3) choose one of three actions—leave, remove, or fill—based on the role of that field, then (4) validate with before/after checks so you don’t introduce new mistakes. The sections below give you practical rules you can apply immediately, plus warnings about common traps.

  • Leave missing values when the blank is meaningful or you can’t justify a replacement.
  • Remove rows only when the missingness makes the record unusable and removal won’t bias results.
  • Fill with controlled labels (for categories) or simple summaries (for numbers) only when it supports your analysis and you can explain it.

Most importantly, “fixing” missing values should not be about guessing what the value was. It should be about ensuring the dataset behaves consistently and transparently for the decision you’re trying to support.

Practice note for Learn why values go missing and what it can mean: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose between leaving blank, removing rows, or filling in: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Fill missing text and categories using simple rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Handle missing numbers with safe beginner methods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate results so you don’t introduce new mistakes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Learn why values go missing and what it can mean: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose between leaving blank, removing rows, or filling in: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Fill missing text and categories using simple rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Types of missingness (blank vs. unknown vs. not applicable)

Not all blanks mean the same thing. Before you touch the data, classify missing values into a few practical types. This reduces guesswork and helps you choose a safe action later.

Blank / not recorded means the value could exist, but it wasn’t captured. Example: a customer email cell is empty because the form allowed submission without email. This type often points to process issues (optional fields, training gaps, system bugs). Treat it as “missing,” not as a meaningful category.

Unknown means the data collector tried but couldn’t obtain it, or it wasn’t available at the time. Example: “Age” might be unknown for some users. In spreadsheets, unknown is often represented as blank, but it is conceptually different from “not recorded,” because it can be a legitimate outcome of collection.

Not applicable (N/A) means the field does not apply to that record. Example: “Refund reason” is not applicable when there was no refund. If you leave these blank without labeling, you may later confuse “no refund” with “refund happened but reason missing.”

Practical tip: create a short data dictionary note for each column: what does blank mean here—missing, unknown, or not applicable? If different meanings appear in the same column, that’s a red flag. You may need to standardize inputs (e.g., use “Not Applicable” for cases where the field truly doesn’t apply) to avoid mixing categories.

Section 3.2: When to keep missing values as-is

Keeping missing values is often the safest choice, especially early in a project. You should leave blanks as-is when replacing them would add assumptions you cannot defend or when the absence itself is informative.

Leave missing values untouched when: (1) the field is not required for your analysis, (2) the missingness might indicate a real-world condition you want to measure, or (3) you don’t have a trustworthy rule for replacement. For example, if you’re analyzing total sales by region, you might not need to fill missing “Customer phone number.” Filling it with anything adds noise and can create false completeness.

Another common case: time series or event logs. If a “last_login_date” is blank, that may indicate the user never logged in. Filling with today’s date would be misleading; filling with an average date would destroy the meaning entirely. In these situations, leaving blanks preserves the ability to interpret behavior correctly.

Be careful with tools that automatically coerce blanks to zeros. In many spreadsheets and BI tools, a blank numeric cell may be treated as zero in some calculations but ignored in others. If you decide to leave values missing, document how your formulas handle blanks (e.g., AVERAGE ignores blanks; SUM treats blanks as zero; COUNT vs. COUNTA behave differently). A safe practice is to explicitly calculate missing counts per column so you don’t forget they exist.

Outcome: keeping missing values can be a deliberate, transparent decision—not an incomplete cleanup. Your dataset remains honest about what you do and don’t know.

Section 3.3: When removing rows is acceptable (and when it’s risky)

Deleting rows feels clean and final, but it’s also the easiest way to bias results. Removal is acceptable only when the row cannot serve the purpose of the dataset and when the missingness is small and plausibly random.

Acceptable examples: you are building a contact list and rows missing the primary identifier (e.g., customer_id or email) cannot be used; or you are summarizing transactions and rows missing the transaction amount are clearly corrupted imports. In these cases, the record is effectively unusable, and removal may be better than filling with guesses.

Risky examples: removing rows where “income” is missing in a demographic survey. People with missing income may be systematically different (privacy-conscious, higher income, certain regions). If you delete them, your averages and distributions can shift in a way that looks “more complete” but is less accurate.

Practical rule: before removing anything, compute how many rows you would drop and where those rows are concentrated. Are missing rows clustered in a particular month, store, or category? If yes, deletion is likely to distort comparisons. If the missingness is under a small threshold (for beginner projects, think a few percent) and appears spread out, removal may be acceptable—still document it.

Safer workflow: instead of deleting immediately, create a filtered view or a separate “excluded_rows” sheet. Record the reason for exclusion (e.g., “missing primary key,” “amount missing”), and keep a count. This preserves traceability and makes it easy to revisit the decision if the analysis changes.

Section 3.4: Filling missing categories with “Unknown” and controlled labels

For text fields and categorical columns (department, product type, channel), filling can be useful—but only with controlled labels that do not pretend to know the true value. The safest fill for genuine missingness is a label such as “Unknown” or “Missing”, and for non-applicable cases, a label like “Not Applicable”.

Why fill categories at all? Because many summaries (pivot tables, charts, grouping) work better when every row has a category. If blanks remain, you may unintentionally drop records from grouped views or make the “(blank)” bucket ambiguous. A controlled label makes the missingness visible and countable.

Key practice: use a limited, consistent set of labels. Don’t allow “unknown,” “Unknown,” “UNK,” and “?” to coexist. Pick one spelling and one capitalization. If you’re cleaning a spreadsheet, create a small reference list (a mini codebook) such as: Known values: {Online, Store, Partner}; Missing: Unknown; Not applicable: Not Applicable.

Also watch for “hidden missingness” in categories: cells with spaces, “-”, “N/A”, or “none” can all behave like separate categories. Standardize them intentionally. For example, treat “N/A” as “Not Applicable” only if that meaning is correct in your context; otherwise, it may be better mapped to “Unknown.”

Outcome: you gain cleaner group-by results without inventing data. Missingness becomes a measurable category you can report, not a silent gap.

Section 3.5: Filling missing numbers with simple summaries (and limits)

Filling missing numeric values is where people most often “guess.” Beginner-safe methods use simple summaries, but even these must be applied carefully and only when the analysis requires complete numbers.

Common simple fills include: median (often safer than mean when there are outliers), mean (works when values are symmetric and missingness is light), or a group-specific summary (e.g., median by product category). Another practical option is a constant that clearly signals “not provided,” such as 0 only when 0 is a valid real-world value and means “none.”

Limits you must understand: filling numbers changes distributions and reduces variability. If you replace many blanks with the same median, charts can show an artificial spike at that value. Correlations and regression results can be distorted because filled values create patterns that were not observed. This is why “fill everything” is rarely correct.

Practical guideline: fill numeric values only when (1) the numeric field is required for the computation you’re doing, (2) the missing rate is low or you can justify a group-based fill, and (3) you will clearly label the method (e.g., “price_filled_median”) or track it with a companion flag column like “price_was_missing = TRUE/FALSE.”

In spreadsheets, you can implement a cautious fill with an IF statement: if blank, use median; else use original. Keep the original column unchanged when possible. Outcome: your analysis can proceed while preserving transparency about which values were observed versus filled.

Section 3.6: Before/after checks: totals, counts, and spot checks

Any missing-value fix can introduce new errors: accidental overwrites, misapplied formulas, or shifting row alignment. Validation is how you prevent “cleaning” from becoming silent corruption. Always do before/after checks that match the risks of your chosen method.

Counts: record the number of rows before cleaning and after. If you removed rows, the difference should match exactly what you intended. Also count missing cells per column before and after (COUNTBLANK in spreadsheets). If you filled categories with “Unknown,” the blank count should drop while the “Unknown” count rises by the same amount.

Totals and summaries: for numeric columns, compare totals (SUM) and central tendencies (mean/median) before and after. If you filled missing numbers, totals will change—so compute both the original total (ignoring blanks) and the filled total, and write down why the change is acceptable for your analysis. If totals change unexpectedly when you did not intend to fill numbers, investigate formula behavior with blanks and zeros.

Spot checks: manually inspect a small sample of rows that were missing before. Confirm they were handled correctly (Unknown vs Not Applicable; median fill applied only where blank). Also spot-check rows that were not missing to ensure they were not altered by a broad find/replace or a dragged formula range.

Reconciliation: if your data comes from a source system, compare key metrics (e.g., transaction count per day) to a trusted report. A simple reconciliation catches issues like an entire day becoming “missing” due to a failed import.

Outcome: you finish with data you can defend. Cleaning decisions become auditable steps—counts, totals, and samples—rather than hope that the spreadsheet “looks better.”

Chapter milestones
  • Learn why values go missing and what it can mean
  • Choose between leaving blank, removing rows, or filling in
  • Fill missing text and categories using simple rules
  • Handle missing numbers with safe beginner methods
  • Validate results so you don’t introduce new mistakes
Chapter quiz

1. Why does the chapter say missing values should be treated as “information” rather than just errors to fix?

Show answer
Correct answer: Because blanks can signal process problems, integration failures, or meaningful outcomes
Missingness can reveal real issues or meaningful non-applicability, so treating all blanks the same can create untrue results.

2. What is the main goal when fixing missing values in this chapter?

Show answer
Correct answer: Make deliberate, documented choices so the dataset behaves consistently and transparently
The focus is not “making blanks disappear,” but choosing actions you can justify and explain.

3. Which workflow best matches the chapter’s recommended beginner-safe approach to missing values?

Show answer
Correct answer: Identify missingness, decide what it means, choose leave/remove/fill, then validate with before/after checks
The chapter outlines a four-step workflow that prioritizes understanding meaning and validating changes.

4. When is it most appropriate to leave a missing value blank according to the chapter?

Show answer
Correct answer: When the blank is meaningful or you can’t justify a replacement
Leaving blanks is recommended when missingness itself carries meaning or replacement would be unjustified.

5. Which choice best describes a safe way to fill missing values without guessing, as described in the chapter?

Show answer
Correct answer: Use controlled labels for categories or simple summaries for numbers when you can explain why it supports the analysis
Filling should use controlled, explainable rules (labels for categories; simple summaries for numbers), not guesses.

Chapter 4: Finding and Removing Duplicates the Right Way

Duplicates look harmless—“it’s just the same row twice”—but they quietly change totals, averages, and decisions. A duplicated customer can inflate revenue, a duplicated incident report can trigger a false alarm trend, and a duplicated inventory line can cause you to reorder what you already have. The goal of this chapter is not only to delete repeated lines, but to do it safely: identify what “same record” means in your dataset, review groups of suspected duplicates, and keep the right row on purpose.

A reliable duplicate workflow has three parts. First, define uniqueness: which column (or combination of columns) should identify one real-world thing. Second, find duplicates: exact copies and “same person with slight differences.” Third, resolve them: choose which record to keep, remove the rest, and add simple rules so duplicates stop appearing. Throughout, the key engineering judgment is this: deleting is irreversible, so you should always isolate duplicates, review them in a controlled way, and keep an audit trail (even if it’s just a backup copy of the sheet).

One practical habit will save you repeatedly: before you remove anything, make a copy of the original tab (e.g., rename it “raw_backup_2026-03-28”). Then do duplicate work in a new tab. That small step turns a risky cleaning step into a safe, repeatable process.

Practice note for Understand duplicates: exact vs. “same person, slightly different”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pick the columns that define a unique record: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Identify duplicate groups and review them safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Remove duplicates while keeping the correct row: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prevent duplicates with simple input rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand duplicates: exact vs. “same person, slightly different”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pick the columns that define a unique record: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Identify duplicate groups and review them safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Remove duplicates while keeping the correct row: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: What a duplicate is (and why it happens)

A duplicate is two or more rows that represent the same real-world record. Sometimes they are literally identical (every cell matches), and sometimes they are “the same person, slightly different.” Both types matter, and the difference changes how you detect them. Exact duplicates are usually caused by copy/paste errors, importing the same file twice, appending data without de-duplication, or accidental double form submissions. Near-duplicates usually come from human variation: “Robert” vs “Bob,” “Acme Inc.” vs “ACME, Incorporated,” or extra spaces and punctuation that make two entries look different to a computer.

It’s worth naming the real harm: duplicates bias your analysis. Counts become inflated, conversion rates change, and downstream joins can explode rows (one duplicate on each side can multiply into many rows). A common mistake is to run a “Remove duplicates” tool on the whole sheet without thinking. If one column contains a timestamp or a notes field, then two rows that are truly the same customer may not be identical, so they won’t be removed. The opposite mistake is worse: removing “duplicates” based on too few columns and accidentally merging two different people who share a name.

Practical outcome: treat duplicates as a data modeling question first (“what is one record?”) and only then a cleanup action. When you can explain what a duplicate means in one sentence, you’re ready to move on.

Section 4.2: Unique keys: IDs, composite keys, and why they matter

The safest way to define duplicates is by a unique key: a column (or set of columns) that should be unique per record. If you have a stable ID like Customer_ID, Order_ID, or Ticket_Number, use it. IDs are designed for this job: they are consistent, short, and less ambiguous than names. In many spreadsheets, though, you don’t have a single ID. That’s when you create a composite key—a combination of columns that together define uniqueness.

For example, an attendance sheet might be unique by (Student Email + Date + Class). A sales line might be unique by (Invoice Number + Line Item). A contact list might be unique by (Email) rather than (First Name + Last Name), because names collide. The key design choice is balancing false positives and false negatives: too broad (many columns) and you miss duplicates; too narrow (few columns) and you incorrectly merge distinct records.

In practice, create a helper column named something like unique_key that concatenates fields in a consistent way. For instance: =LOWER(TRIM(A2))&"|"&LOWER(TRIM(B2)) to combine email and date. Using TRIM removes extra spaces; using LOWER avoids case-based mismatches. Once you have this key, you can count and filter duplicates with far more confidence than eyeballing rows.

Section 4.3: Exact duplicates in spreadsheets (sort, filter, highlight)

Exact duplicates are the easiest to handle, but you still want a controlled workflow. Start by sorting or filtering so duplicates appear next to each other. If you have a unique key or ID column, sort by it first, then by a timestamp (if available) so you can see which entry is newer. If you do not have a key yet, temporarily sort by the most identifying columns (like email, order number, or address) to cluster likely matches.

Next, highlight duplicates rather than deleting immediately. Most spreadsheet tools have Conditional formatting → Duplicate values for a selected column. Apply it to your key column (preferred) or to a combination approach: highlight duplicates on Email, then separately on Phone, then compare. A more explicit method is to add a count column: use COUNTIF or COUNTIFS on the key and mark any row where the count is greater than 1. Once marked, filter to show only duplicate groups. This creates a “review queue” you can work through systematically.

Only after you can see the duplicate groups should you remove them. A common mistake is to remove duplicates across the entire table, which can preserve rows that differ by an irrelevant notes field (so nothing gets removed) or delete rows that differ in a meaningful field (so you lose real information). Practical outcome: you should be able to produce a filtered view showing just the duplicate groups, with a clear reason why each group is considered a duplicate.

Section 4.4: Near-duplicates (small differences) and manual review tactics

Near-duplicates are where most “real life” de-duplication effort goes. These are rows that refer to the same entity but don’t match exactly: typos, abbreviations, swapped first/last names, missing apartment numbers, or old vs new emails. Automated “Remove duplicates” tools usually won’t catch these, so your job is to make them easier to spot and safer to review.

Start with normalization helpers: create columns that standardize the fields you plan to compare. For names and emails: LOWER + TRIM. For phone numbers: remove spaces, dashes, and parentheses (many spreadsheets can do this with repeated SUBSTITUTE). For addresses: standardize common abbreviations (St vs Street) if your dataset needs it. Then compare the normalized versions rather than the raw input.

Next, use grouping tactics. Sort by normalized last name, then first name; or sort by normalized address; or sort by the first 5–8 characters of an email. This “clusters” similar records. Filter for blanks too—near-duplicates often have missing data in one row that is present in the other. When you find a suspected pair, don’t immediately delete: add a temporary review_status column with values like “keep,” “remove,” “merge,” “unclear.” That small step prevents accidental loss and forces you to make an explicit decision.

Common mistake: trying to perfect-match near-duplicates without context. If two rows share a name but have different emails and cities, they may be two different people. Practical outcome: you can consistently surface likely near-duplicates and route them into a manual review process with clear labels.

Section 4.5: Choosing which record to keep (latest, most complete, trusted source)

Once you have a duplicate group, the real work is deciding which row is the “winner.” This should be a rule, not a gut feeling, and the rule should match your business purpose. Three practical tie-breakers cover most simple datasets: keep the latest record, keep the most complete record, or keep the record from the most trusted source.

Latest is common for contact lists and customer profiles: the newest entry may have an updated phone number or address. To apply it, sort duplicate groups by a timestamp (Created_At, Updated_At) descending and mark the top row as keep. Most complete is useful when duplicates exist because different systems capture different fields: one row has phone, another has address. You can score completeness by counting non-blank cells across key columns, then keep the highest score. Trusted source matters when you merge exports: a CRM entry might be more reliable than a marketing signup sheet. Add a Source column if you can, then prefer the authoritative system.

Sometimes you don’t want to “keep one and delete the rest”—you want to merge information. In a spreadsheet, merging can be manual: keep one row and copy missing fields from the other duplicates into it (documenting what you changed). If merging is frequent, consider adding a rule: “keep latest, but fill blanks from older rows before deleting them.” Common mistake: deleting duplicates without checking which row contains the only non-empty value for an important column. Practical outcome: every duplicate group ends with an intentional kept row and a defensible reason.

Section 4.6: Simple prevention: validation lists and consistent forms

The best duplicate removal is the one you don’t have to do again. Prevention in spreadsheets can be simple and high impact. First, use data validation to limit free-text variation. For fields like Country, State, Department, or Product Category, use a validation list so people choose from the same options instead of typing “CA,” “Calif,” or “California.” Consistent categories reduce near-duplicates and make grouping more reliable.

Second, make input forms consistent. If you collect data via a shared sheet, define which columns are required, where IDs come from, and how dates should be entered. Even a short “How to enter data” note at the top of the sheet can reduce accidental re-entry. If you use a form tool, configure it to prevent multiple submissions where possible (for example, limiting one response per email) and ensure it captures a stable identifier like email or employee ID.

Third, add lightweight uniqueness checks. A practical approach is conditional formatting on your unique key column to highlight duplicates immediately. You can also add a “duplicate_flag” formula: if the key appears more than once, mark it. This turns duplicates into a visible issue at entry time instead of a surprise later.

  • Use IDs whenever possible; avoid relying on names for uniqueness.
  • Standardize key fields (trim spaces, consistent case) before comparing.
  • Highlight duplicates as data is entered, not only during cleanup.

Practical outcome: your dataset becomes easier to maintain, and duplicate cleanup becomes an occasional check rather than a recurring crisis.

Chapter milestones
  • Understand duplicates: exact vs. “same person, slightly different”
  • Pick the columns that define a unique record
  • Identify duplicate groups and review them safely
  • Remove duplicates while keeping the correct row
  • Prevent duplicates with simple input rules
Chapter quiz

1. Why does Chapter 4 treat duplicates as a serious data issue rather than a harmless repeated row?

Show answer
Correct answer: They can distort totals/averages and lead to wrong decisions like inflated revenue or false trends
Duplicates quietly change metrics and can trigger incorrect business actions.

2. What is the first step in a reliable duplicate-handling workflow described in the chapter?

Show answer
Correct answer: Define which column(s) make a record unique in your dataset
You must define what “same record” means before you can safely detect or remove duplicates.

3. Which approach best matches the chapter’s guidance for removing duplicates safely?

Show answer
Correct answer: Isolate suspected duplicate groups, review them in a controlled way, and keep an audit trail/backup
Deleting is irreversible, so the chapter emphasizes controlled review and an audit trail.

4. When duplicates include “same person, slightly different,” what does the chapter suggest you do?

Show answer
Correct answer: Treat them as part of duplicate finding, not just exact copies
Duplicate detection includes both exact matches and near-duplicates representing the same real-world entity.

5. What practical habit does the chapter recommend before removing any duplicates?

Show answer
Correct answer: Make a backup copy of the original tab (e.g., "raw_backup_2026-03-28") and work in a new tab
A backup plus working in a new tab makes the process safe and repeatable.

Chapter 5: Fixing Typos and Standardizing Text

Text fields look “easy” because they are readable, but they often create the most confusion in analysis. A numeric column typically has one way to be wrong (a bad number). A text column has many ways to disagree with itself: different spellings, extra spaces, inconsistent capitalization, punctuation differences, abbreviations, and even invisible characters copied from other systems. These issues quietly corrupt grouping and reporting—your pivot tables split one category into many, charts show duplicate-looking labels, and filters miss records you expected to match.

This chapter gives you a repeatable workflow for text cleanup that is safe and auditable. The goal is not to make text “pretty”; it is to make it consistent so counts, joins, and category-based decisions are trustworthy. You will learn how to spot typo patterns, standardize casing and whitespace, fix frequent misspellings with a reference list, normalize categories, and run quality checks to confirm the cleanup worked.

A practical mindset helps: (1) decide what the “standard” should be, (2) make changes using rules you can re-run, (3) avoid over-aggressive replacements that change meaning, and (4) verify with simple checks like unique lists and frequency counts. When you treat cleaning as a small engineering task—documented steps, limited scope, and validation—you reduce the risk of introducing new errors while removing old ones.

Practice note for Spot common typo patterns and inconsistent naming: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Standardize casing, spacing, and punctuation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Fix frequent misspellings using a reference list: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Normalize categories so charts and counts make sense: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run quality checks to confirm text cleanup worked: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Spot common typo patterns and inconsistent naming: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Standardize casing, spacing, and punctuation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Fix frequent misspellings using a reference list: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Normalize categories so charts and counts make sense: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run quality checks to confirm text cleanup worked: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Why typos break grouping and reporting

Grouping depends on exact matches. Computers do not know that “New York”, “new york”, “New York” (double space), and “New York,” (trailing comma) are “the same” unless you standardize them. This is why typos and inconsistent naming break pivot tables, dashboards, and summary statistics. Your “top customers” chart can quietly split one customer into multiple rows; your “orders by region” report can create extra regions that are really misspellings.

Start by spotting common patterns rather than hunting random mistakes. Common patterns include: extra spaces at the start/end; multiple internal spaces; inconsistent abbreviations ("St" vs "Street"); punctuation differences ("Co." vs "Co"); and a few recurring misspellings ("Califronia"). Another major pattern is inconsistent category granularity, such as mixing “USA”, “United States”, and “United States of America”. These are not “typos” exactly, but they create the same reporting problem.

Use an “inspect first” workflow: sort the column A→Z, scan for near-duplicates, and generate a unique list to see all distinct values. If your tool supports it, also review frequency counts; the rare values often contain the errors. Engineering judgement matters here: sometimes rare values are legitimate edge cases (a new product line), and sometimes they are just mistakes. Your job is to decide which is which before changing data.

Section 5.2: Trimming spaces and removing hidden characters

Whitespace is the most common invisible cause of mismatches. A value that looks correct can still fail to match because it contains leading/trailing spaces, non-breaking spaces, or line breaks copied from emails or PDFs. Fixing whitespace should be one of your first text-cleaning steps because it is low risk and improves downstream checks.

In spreadsheets, apply a trim operation to remove leading and trailing spaces and collapse repeated internal spaces when appropriate. In many tools, a TRIM-like function removes regular spaces but may not remove non-breaking spaces (often introduced by web pages). For those, you may need a replace step that targets the special character (commonly CHAR(160)) and replaces it with a normal space before trimming. Also watch for hidden newlines (line breaks) that can appear inside cells; these can be removed by replacing line break characters with a space.

  • Safe default: remove leading/trailing spaces everywhere in text columns.
  • Be careful: collapsing internal spaces can change meaning in free-text fields (comments), but is usually fine for categories like city, state, product, department.
  • Tip: after trimming, re-check the unique list—many “duplicates” disappear immediately.

Common mistake: trimming only a sample. If you clean 20 rows but forget the rest, your report still breaks. Prefer repeatable column-wide operations (formulas filled down, power query steps, scripts) rather than manual edits.

Section 5.3: Standard casing rules (upper/lower/title case)

Case differences create fake categories: “apple” and “Apple” group separately in many systems. Pick a casing rule and apply it consistently. The “best” rule depends on how the text is used. If the column is a join key (e.g., customer email), you may want to keep the original but also create a normalized version for matching. If the column is a category label for charts, choose a readable standard like Title Case.

Practical casing guidelines:

  • Lower case for identifiers where case is not meaningful and consistency matters (emails, usernames). This simplifies matching.
  • Upper case for short codes (state/province abbreviations, airport codes) when that is the domain convention.
  • Title case for proper names and display labels (customer names, cities), with the caution that automated title-casing can mishandle “McDonald”, “O’Neil”, or acronyms.

Engineering judgement: avoid “beautifying” data that is not meant for display. For example, product SKUs or ticket IDs should not be title-cased. Also be wary of columns where capitalization carries meaning (some programming languages, certain part numbers). When uncertain, keep the raw column and create a cleaned companion column (e.g., city_clean) so your transformations are reversible and auditable.

After applying casing rules, re-run frequency counts. You want to see categories merge (fewer unique values) without losing legitimate distinctions.

Section 5.4: Find/replace safely and avoid accidental replacements

Find/replace is powerful and dangerous. It is easy to fix hundreds of rows in seconds—and also easy to change the wrong thing everywhere. Safe replacement is about scoping, testing, and using precise patterns. Always preview matches before applying changes, and keep a copy of the original column or file so you can roll back.

Common accidental replacements happen when the match pattern is too broad. Replacing “St” with “Street” can turn “St Louis” into “Street Louis”. Replacing “CA” with “California” can break words that contain “ca” (depending on case sensitivity). The fix is to match whole words, exact values, or specific positions (e.g., only if “St” appears at the end of a street address).

  • Prefer exact-value replacements for categories: replace only when the entire cell equals a known variant (e.g., “Califronia” → “California”).
  • Use case-sensitive rules when casing conveys meaning, or normalize case first and then replace.
  • Work in passes: trim/clean whitespace, then standardize case, then fix misspellings, then normalize categories.

Practical outcome: your replacement operations should be explainable as a list of rules. If you cannot describe the rule clearly (“I changed whatever looked wrong”), it will be hard to reproduce and hard to trust.

Section 5.5: Building a simple “lookup table” for standard names

When you have repeated variants (especially across multiple files or monthly reports), a lookup table is the most reliable approach. A lookup table is a two-column reference: raw_value and standard_value. Instead of manually fixing the same typo every week, you add it once to the table and apply the mapping each time you import data.

Build the table by extracting a unique list from the dirty column, then reviewing it for variants. Start with the highest-frequency values and the obvious near-duplicates. Map all known variants to a single standard name. For example, map “NY”, “New York”, “NewYork”, and “New York ” (with trailing space) to “New York”. For business entities, decide whether punctuation matters: do you want “Acme Inc” and “Acme, Inc.” to be the same? Make the decision explicit and consistent.

Implementation options:

  • Spreadsheet: use VLOOKUP/XLOOKUP to return the standard value; if not found, return the original so you can see unmapped values.
  • SQL/Python/Power Query: join/merge the dataset to the lookup table and replace values based on the match.

Key practice: track “unmapped” values (those not found in the lookup). Each refresh, review the new unmapped set and decide whether to add mappings or keep them as legitimate new categories. This turns text cleaning into a controlled, incremental process instead of a one-time cleanup.

Section 5.6: Final text validation: unique lists and frequency counts

Text cleaning is not done until you validate it. Validation catches two problems: (1) you did not fix enough (variants remain), or (2) you fixed too much (distinct values were incorrectly merged). Two simple checks cover most cases: unique lists and frequency counts.

First, generate a unique list for the cleaned column and scan it quickly. Look for near-duplicates that survived: differences in punctuation, spacing, pluralization, or abbreviations. This is where hidden characters show up—if you still see duplicates that “look” identical, investigate for non-printing characters or mixed Unicode punctuation (different apostrophes or dashes).

Second, compute frequency counts (value → number of rows). Compare counts before and after cleaning. You should see: fewer total unique values, higher counts on the standardized categories, and fewer singletons caused by typos. Pay attention to big shifts: if one category’s count drops unexpectedly, you may have accidentally mapped it elsewhere or removed characters that mattered.

  • Sanity check joins: if the cleaned text is used to join tables, measure join success rate (matched vs unmatched) before/after.
  • Spot-check samples: filter to a standardized value and inspect a handful of underlying rows to confirm they truly belong.
  • Document rules: record trimming, casing, replacements, and lookup mappings so the same cleanup can be repeated.

The practical outcome is confidence: your charts won’t double-count, your category filters will behave, and your dataset becomes easier to maintain. By combining safe text normalization steps with a lookup table and basic validation, you create a repeatable method you can use on any spreadsheet or simple dataset.

Chapter milestones
  • Spot common typo patterns and inconsistent naming
  • Standardize casing, spacing, and punctuation
  • Fix frequent misspellings using a reference list
  • Normalize categories so charts and counts make sense
  • Run quality checks to confirm text cleanup worked
Chapter quiz

1. Why can text fields cause more confusion in analysis than numeric fields?

Show answer
Correct answer: Text can disagree in many subtle ways (spelling, spaces, casing, punctuation), splitting groups and counts
The chapter explains that small text inconsistencies quietly corrupt grouping, reporting, and filtering.

2. What is the main purpose of cleaning and standardizing text in this chapter?

Show answer
Correct answer: To make text consistent so counts, joins, and category-based decisions are trustworthy
The goal is consistency for reliable analysis, not aesthetics.

3. Which workflow best matches the chapter’s recommended mindset for text cleanup?

Show answer
Correct answer: Decide a standard, apply repeatable rules, avoid meaning-changing replacements, then validate with checks
The chapter emphasizes defining standards, using re-runnable rules, being cautious about meaning, and verifying results.

4. How do typos and inconsistent naming typically show up in analysis outputs?

Show answer
Correct answer: Pivot tables and charts split one category into multiple duplicate-looking labels
Inconsistent text creates fragmented categories, misleading counts, and confusing labels.

5. Which quality checks are suggested to confirm text cleanup worked?

Show answer
Correct answer: Review unique value lists and frequency counts to ensure categories are normalized
The chapter recommends simple validation like unique lists and frequency counts after applying cleanup rules.

Chapter 6: Putting It All Together: A Repeatable Cleaning Workflow

In earlier chapters you learned how to spot missing values, duplicates, typos, and inconsistent formats—and how to fix each problem safely. This chapter turns those individual techniques into a workflow you can repeat on any small dataset (especially spreadsheets). The goal is not “perfect data.” The goal is a reliable process that produces a clearly improved dataset, keeps the original untouched, and leaves an audit trail that another person (or future you) can understand.

A repeatable cleaning workflow has four practical benefits. First, it reduces risk: you avoid accidental overwrites, hidden filter mistakes, and “fixes” that change meaning. Second, it makes your work faster: you don’t reinvent steps each time. Third, it improves trust: you can show what changed and what didn’t. Finally, it makes downstream analysis more stable: when categories are standardized and duplicates removed, charts and pivot tables stop “drifting” from one refresh to the next.

To keep this chapter concrete, imagine a small customer-orders spreadsheet with columns like Order_ID, Date, Email, State, Product, Quantity, and Revenue. It has blanks in Quantity, duplicate Order_IDs, “CA / Calif. / california” as multiple state values, and a few obvious typos in Product names. You will clean it end-to-end with a full checklist, document every change, generate a cleaned output, and produce a simple before-vs-after quality report—without needing code.

The core mindset: cleaning is a series of small decisions made with engineering judgment. Every decision has tradeoffs. Removing rows might reduce bias in some contexts, but it can also throw away rare cases. Filling missing values might be reasonable for a “State” field, but dangerous for “Revenue.” This chapter teaches you to make those tradeoffs explicit, consistent, and visible.

Practice note for Clean a small dataset from start to finish with the full checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document every change so it’s transparent and repeatable: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a “cleaned data” output and keep raw data untouched: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a simple quality report (before vs. after metrics): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan next steps: automation ideas without needing code: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Clean a small dataset from start to finish with the full checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document every change so it’s transparent and repeatable: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a “cleaned data” output and keep raw data untouched: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: A practical order of operations (profile → fix → validate)

Section 6.1: A practical order of operations (profile → fix → validate)

A cleaning workflow works best when you follow a consistent order of operations. A practical sequence for small datasets is: profile → fix → validate. Profiling means measuring and scanning before you touch anything. Fixing means applying changes in a controlled way. Validating means checking that your changes actually improved the dataset and didn’t create new problems.

Profile first. Do a quick shape check (rows/columns), then scan for common issues: blanks per column, obvious outliers (e.g., negative revenue), inconsistent formats (dates, phone numbers), and suspicious categories (e.g., “Calif.” plus “CA”). In a spreadsheet, use filters, sort, and pivot tables to quickly understand distributions. A common mistake is to begin “correcting” typos immediately; you might standardize a value that is actually meaningful (for example, “WA” could mean “Washington” or “Warranty Add-on” depending on the column).

Fix in an order that reduces rework. Often, start with structural issues that affect identity and joining: standardize IDs and key fields, then handle duplicates, then missing values, then text standardization and formats. For example, if duplicates exist, decide the “correct record” before filling blanks—otherwise you might fill data into a row you later delete. When fixing, use consistent rules: trim extra spaces, standardize case, apply the same mapping table for categories, and make one change type at a time so you can verify its impact.

Validate

Section 6.2: Creating a cleaning log (what changed and why)

Section 6.2: Creating a cleaning log (what changed and why)

Cleaning is only trustworthy when it’s transparent. A cleaning log is your lightweight documentation that answers: What changed? Why did we change it? How was it changed? What was the impact? For small projects, a simple table in a separate sheet (or a text document) is enough. Treat it like a lab notebook: short, factual entries that someone else can replay.

A practical cleaning log format includes: Date, Step, Column(s), Rule/Action, Reason, Before count, After count, and Notes/Exceptions. Example entries: “Trim leading/trailing spaces in Product” (Reason: duplicates caused by whitespace); “Map State values {Calif., california} → CA” (Reason: standardize categories); “Remove duplicate Order_ID keeping latest Date” (Reason: de-duplication rule to keep most recent update); “Leave Revenue blanks as blank” (Reason: do not invent financial values).

Documenting “why” is where engineering judgment shows. If you fill missing values, record the method (median, mode, or a business rule) and the assumption behind it. If you remove rows, record the criteria and how many were removed. A common mistake is logging only the action (“deleted duplicates”) without the rule that defines the correct record. Without the rule, the result is not repeatable and cannot be audited.

The log also protects you from subtle spreadsheet errors. If a filter was on when you pasted changes, you may have edited only a subset of rows. By recording before/after counts and spot-checking, you catch these issues early. The practical outcome: your cleaned dataset becomes defensible, and collaboration becomes easier because disagreements can focus on rules, not on guessing what happened.

Section 6.3: Separating raw, working, and cleaned files

Section 6.3: Separating raw, working, and cleaned files

A repeatable workflow always separates raw, working, and cleaned data. This is less about bureaucracy and more about safety. The raw file is your immutable source of truth. The working file is where you experiment and apply fixes. The cleaned file is the output you share or analyze.

Start by saving the original dataset as RAW and never editing it. If you’re using spreadsheets, store raw data in a protected tab (or a separate file) and restrict edits. Then create a WORKING copy where you perform the profile → fix → validate steps. Finally, export a CLEANED version with only necessary sheets and clear column names, ready for analysis.

Use simple naming conventions: orders_RAW_2026-03-28.xlsx, orders_WORKING_2026-03-28.xlsx, orders_CLEANED_2026-03-28.xlsx. If you do multiple iterations, add a version suffix (v1, v2) and keep a short changelog in the file or folder. A common mistake is cleaning “in place” and then discovering later that you need to revisit an assumption; without raw data, you cannot recover what was changed.

When producing the cleaned output, remove intermediate helper columns unless they are useful to downstream users (for example, keep a “State_Standardized” column if stakeholders want to compare original vs standardized). If you must keep both, label clearly: State_raw and State_clean. The practical outcome is a pipeline you can trust: raw stays untouched, working remains flexible, and cleaned is stable and shareable.

Section 6.4: Basic QA metrics: blanks, duplicates, category counts

Section 6.4: Basic QA metrics: blanks, duplicates, category counts

A cleaning workflow is incomplete without a basic quality report. You don’t need complex statistics; you need a few simple metrics that show “before vs. after” improvements and highlight remaining risks. Think of this as your QA snapshot: fast to compute, easy to understand, and directly tied to common data problems.

At minimum, track three families of metrics. Blanks: count missing values per column (and optionally the percentage). In spreadsheets, you can use COUNTBLANK or a pivot-like summary. Be careful: blanks can hide as spaces (“ ”) or special values like “N/A” and “null.” Part of standardization is turning those placeholders into true blanks or into a consistent “Unknown” category—depending on what your analysis needs.

Duplicates: count duplicates of key fields (e.g., Order_ID, Email+Date). Report how many duplicate groups exist and how many rows are involved. After de-duplication, confirm the duplicate count is zero (or explain why some duplicates are allowed, such as multiple items per order). A common mistake is using “Remove Duplicates” without defining the key; you might delete legitimate rows that share the same customer email.

Category counts: list unique values and frequencies for fields like State, Product, Channel, or Status. Before cleaning, you might have 30 state values because of typos and casing; after cleaning, you should have a smaller, expected set. Compare top categories before vs. after to ensure mappings did not accidentally merge distinct groups. Add one or two “sanity totals” where appropriate (sum of revenue, total quantity). If totals change, the report should explain why (for example, removed duplicate orders reduced revenue inflation).

The practical outcome is a one-page report you can attach when sharing cleaned data: it demonstrates improvement, communicates remaining limitations, and gives stakeholders confidence that the dataset is fit for its intended use.

Section 6.5: Sharing cleaned data responsibly (notes, assumptions, limits)

Section 6.5: Sharing cleaned data responsibly (notes, assumptions, limits)

Cleaned data is not just a file—it’s a set of decisions. Sharing responsibly means packaging the dataset with the context needed to use it correctly. This is especially important when cleaning involved filling missing values, dropping records, or standardizing categories based on judgment.

Include a short Read Me (a worksheet tab or a separate document) that lists: purpose of the dataset, date of extraction, cleaning date, who performed the cleaning, and where the raw data lives. Then summarize key assumptions: “Duplicates defined as same Order_ID; kept the row with latest Date,” “State standardized to two-letter codes,” “Revenue blanks left blank; no imputation performed,” or “Removed 12 test records where Email ended with @example.com.” These notes prevent misinterpretation and reduce repeated questions.

Also state limits. For example: “Some missing Quantity values remain because there was no reliable source to infer them,” or “Product names were standardized using a manual mapping table; rare items may still contain typos.” A common mistake is presenting cleaned data as “final” without acknowledging these edges; downstream users may treat it as ground truth and build fragile decisions on top.

When sharing, prefer the cleaned output plus the quality report and cleaning log, not the working file full of intermediate steps. If you must share working materials, clearly label them as non-final. The practical outcome is a cleaned dataset that can be reused safely, audited later, and extended without undoing earlier decisions.

Section 6.6: Your reusable template: checklist + rules + sign-off

Section 6.6: Your reusable template: checklist + rules + sign-off

To make this workflow repeatable, turn it into a reusable template: a checklist, a small set of standard rules, and a sign-off step. This is the bridge from “I cleaned this once” to “I can clean any similar spreadsheet reliably.” Keep it short enough that you’ll actually use it.

A practical checklist can follow the same phases. Profile: confirm row/column counts, identify key fields, count blanks per column, list top categories for major text fields, and check for obvious invalid values. Fix: standardize formats (dates, casing, whitespace), resolve duplicates using a defined key and keep-rule, address missing values using approved methods (leave/remove/fill) per column, and apply typo/category mapping consistently. Validate: re-run blank/duplicate/category metrics, spot-check a sample of records, and confirm totals or pivot results match expectations.

Next, define a small set of rules you reuse across datasets: “Trim whitespace in all text columns,” “Standardize unknown markers (N/A, none, null) to blank,” “Use a mapping table for categories rather than manual find/replace,” and “Never impute financial fields without an explicit business rule.” These rules prevent common mistakes like inconsistent edits across columns or silent meaning changes.

Finally, add a sign-off step: you (or a reviewer) confirms the cleaning log is complete, the cleaned file is generated, and the QA report is attached. If you’re planning next steps toward automation without code, start small: build reusable mapping tables, standard pivot-based QA summaries, and consistent file naming. Later, these same steps can be translated into repeatable tools, but the workflow itself stays the same. The practical outcome is a template you can apply in minutes, producing cleaned data that is transparent, stable, and ready for analysis.

Chapter milestones
  • Clean a small dataset from start to finish with the full checklist
  • Document every change so it’s transparent and repeatable
  • Create a “cleaned data” output and keep raw data untouched
  • Build a simple quality report (before vs. after metrics)
  • Plan next steps: automation ideas without needing code
Chapter quiz

1. What is the main goal of the repeatable cleaning workflow in Chapter 6?

Show answer
Correct answer: Use a reliable process that improves the dataset, keeps raw data untouched, and leaves an audit trail
The chapter emphasizes a repeatable, transparent process—not perfection—while preserving the original data and documenting changes.

2. Which practice best supports transparency and repeatability when cleaning a dataset?

Show answer
Correct answer: Documenting every change so another person (or future you) can understand what happened
An audit trail requires recording what changed and why, not just the final output.

3. Why does the chapter recommend creating a "cleaned data" output while keeping the raw data untouched?

Show answer
Correct answer: To reduce the risk of accidental overwrites and preserve a recoverable baseline
Keeping raw data unchanged lowers risk and lets you trace or redo steps if needed.

4. What is the purpose of a simple quality report with before-vs-after metrics?

Show answer
Correct answer: To show what improved (and what didn’t) to build trust and stability for downstream analysis
Before/after metrics make improvements visible and help others trust and reuse the cleaned dataset.

5. Which statement best reflects the chapter’s mindset about cleaning decisions?

Show answer
Correct answer: Cleaning is a series of small judgment calls where tradeoffs should be made explicit and consistent
The chapter stresses engineering judgment: decisions like removing rows or filling values can help or harm depending on context, so tradeoffs must be explicit.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.