AI Research & Academic Skills — Beginner
Recreate a real mini-study with open data and report it clearly.
AI and data results can look impressive, but they only become trustworthy when someone else (or you, three weeks later) can rerun the work and get the same outcome. This course teaches AI reproducibility from first principles, without assuming any background in AI, coding, or statistics. You will recreate a simple, beginner-friendly study using a public dataset, then document your process so it can be repeated reliably.
Think of this course as a short technical book with six chapters. Each chapter adds one essential building block: understanding what reproducibility means, choosing and checking open data, setting up a repeatable workspace, running a baseline model, troubleshooting differences, and finally writing a clear reproducibility report.
You will complete a small “reproducibility bundle” that includes: a frozen copy (or version reference) of the dataset, a step-by-step run guide, saved outputs (tables/figures), and a short report that explains exactly what you did and what you found. The goal is not to build the fanciest AI model. The goal is to produce a result that can be rerun and checked.
Reproducibility often fails because of small, invisible details: file names, missing steps, different software versions, or randomness in model training. This course treats those details as the main lesson—not an advanced side topic. You will learn practical habits like freezing dataset versions, naming files consistently, recording run settings, and writing instructions that another person can follow without guessing.
Whenever a new term appears (like “feature,” “baseline,” or “train/test split”), it is explained in plain language with a concrete purpose: helping you recreate the same study again.
This course is designed for absolute beginners: students, early-career professionals, analysts, policy staff, and anyone who wants to understand how to verify AI-related results responsibly. You do not need prior coding experience. You only need a computer, internet access, and the willingness to follow a step-by-step process.
If you want to build research credibility and learn a skill that applies across AI, analytics, and reporting, start here. You can begin right away and follow the chapters like a short book. Register free to access the course, or browse all courses to compare learning paths.
Data Science Educator, Research Methods Specialist
Sofia Chen designs beginner-friendly programs that teach research and data skills from first principles. She has supported cross-functional teams in documenting analyses, validating results, and building repeatable workflows for reports and audits.
In AI, a “result” is rarely just a number. It is a chain of decisions: which dataset you used, how you cleaned it, how you split it, which model you trained, which metrics you reported, and which defaults your tools quietly chose for you. Reproducibility is the discipline of making that chain visible and runnable, so another person (or your future self) can follow it and reach the same outcome. This course is about building that discipline with beginner-friendly tools and open data.
You will practice reproducibility in a realistic way: by recreating a small, published-style analysis end-to-end. That means you will need to choose a dataset with clear documentation, set up a workspace you can rerun later, keep careful notes on changes you make to the data, and interpret baseline model results with plain-language metrics. This first chapter sets expectations: what reproducibility is, what it is not, and what you must capture so results match.
As you read, keep a simple framing question in mind: “If I stopped today, could I come back in six months, rerun everything on a new computer, and confidently explain any differences?” Reproducibility is answering “yes” to that question—not through luck, but through process.
Practice note for Define reproducibility in everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Spot common reasons results don’t match: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose a simple “study” to recreate: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a reproducibility checklist you’ll use all course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define reproducibility in everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Spot common reasons results don’t match: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose a simple “study” to recreate: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a reproducibility checklist you’ll use all course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define reproducibility in everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Spot common reasons results don’t match: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
AI research and applied ML both rely on trust. When a paper claims “Model A improves accuracy by 3%,” or a blog post claims “this feature reduces error,” the audience assumes the result came from a method that can be repeated. If results cannot be repeated, you cannot tell whether a finding reflects a real pattern in the world or a one-time accident of data handling, random seeds, or hidden defaults.
Repeatability matters even when you are not publishing. In a work setting, it protects teams from “it worked on my laptop” failures, and it allows you to debug intelligently. In a learning setting, it turns experiments into knowledge: when you can rerun the same pipeline, you can change one thing at a time and see what truly caused the metric to move.
Practical example: suppose you trained a classifier on an open dataset and got 0.87 AUC. A week later you rerun the notebook and get 0.81. If the process was not repeatable, you waste time guessing. If it was repeatable, you can pinpoint the difference: a new dataset version, a different library version, a changed preprocessing step, or a different random split.
In this course, “repeatable” is not a vague ideal. You will create a small reproducibility checklist and use it throughout, so each study recreation is something you can run again with confidence.
These terms are often mixed up, so we will use simple, consistent meanings.
Repeatable means: the same person can rerun the same code on the same data, in the same environment, and get the same results. This is the “can I run it again tomorrow?” level. It depends heavily on capturing tool versions, settings, and randomness controls.
Reproducible (in the everyday sense used in many ML projects) means: someone else can take your materials—data access instructions, code, and documentation—and obtain the same results (or extremely close) without needing private knowledge. This is the “can a stranger rerun it?” level. It depends on clarity, not just correctness.
Replicable usually means: an independent team can run a new study (often with new data or a new implementation) and reach the same conclusion. This is harder and less mechanical. A result can be reproducible but not replicable if it was overfit to one dataset, or if the effect is weak.
Why do these distinctions matter for beginners? Because you can control repeatability and reproducibility directly through good workflow, while replicability depends more on the underlying phenomenon and study design. This course focuses on reproducibility: you will recreate a reference analysis using open data, and you will learn to describe any mismatch instead of hand-waving it away.
Reproducibility is not about writing a novel. It is about capturing the minimum information required to rerun the work without guessing. When results do not match, the cause is often missing context rather than incorrect code.
At minimum, a rerunnable project needs four categories of information:
python train.py” or “Run notebook cells top-to-bottom”). Avoid hidden manual steps.requirements.txt or environment.yml).For a beginner-friendly workflow, aim for a single folder with: README.md (steps + assumptions), requirements.txt, data/ (raw data kept separate from processed), notebooks/ or src/, and results/ (metrics and plots). Keep a small “data cleaning log” that lists each transformation, why it was done, and how many rows were affected.
Common mistake: editing the raw dataset file directly. If you overwrite raw data, you lose the ability to explain differences later. Instead, keep raw immutable and write cleaned outputs as new files with dated names or clear version labels.
Practical outcome: by the end of this course, you will be able to hand your folder to someone else and they can rerun the baseline model and compare metrics to a reference without hunting for missing steps.
When your recreated results don’t match a reference, treat it like a debugging problem, not a personal failure. Most mismatches come from a small set of variation sources. Learning to spot them quickly is a core reproducibility skill.
Data variation is the most common. Public datasets can be updated, rehosted, or preprocessed differently across mirrors. Even column types can shift (e.g., an ID column read as integer vs string). Practical protections include: saving a local copy, recording dataset version/date, and validating row/column counts before modeling.
Code variation includes accidental changes (a cell re-run out of order) and invisible differences (using a different function that defaults to a different behavior). A simple habit helps: make your pipeline run from a clean start (restart kernel, run all), and store the core steps in scripts/functions instead of scattered notebook cells.
Settings variation is subtle. A different train/test split, a different metric definition, or a different preprocessing (e.g., scaling before vs after split) can shift results. Always specify: split ratio, stratification choice, feature set, and evaluation metric computation details.
Randomness is expected in ML. Random initialization, stochastic training, and random shuffling can move metrics. Beginners often set one random seed and assume it is enough. In practice, you should: set seeds for all relevant libraries, record them, and consider running multiple seeds to estimate variability. For baseline recreation, one fixed seed plus a note about expected noise is usually acceptable.
This mindset also supports good engineering judgment: not every mismatch is meaningful. Your job is to determine whether a difference is explained by a known source of variation or whether it suggests a real methodological gap.
Open data is powerful because it lowers barriers to learning and verification. But “publicly available” does not mean “free to use however you want.” Reproducible AI work is also responsible AI work: you must know what you are allowed to do with the data, what you must attribute, and what privacy constraints apply.
Start with the dataset’s license and documentation page, not just the download link. Identify: (1) the license name (e.g., CC BY, CC BY-SA, CC0, ODC-BY), (2) attribution requirements, (3) restrictions on commercial use or redistribution, and (4) any special clauses for derived works. If a dataset has no clear license, treat it as “not safe to reuse” for a course project that you might share publicly.
Documentation matters because it tells you what the fields mean and what the known limitations are. For example, a target label might be collected through self-report, which introduces bias; or a timestamp might be in local time, affecting seasonality analyses. Reproducibility includes citing these limitations so readers do not overinterpret your baseline model.
Also consider privacy and sensitivity. Even when data is legally shareable, it may contain attributes that deserve careful handling (health, location, protected classes). As a beginner, prefer datasets that are already widely used for teaching and have low risk of re-identification. Avoid scraping personal data for this course; it is difficult to do ethically and reproducibly.
Ethical data use supports long-term reproducibility: if your project depends on questionable access, others cannot rerun it responsibly.
To recreate a study end-to-end, you need a question that is small enough to complete but real enough to teach you the full workflow. Beginners often pick questions that are too ambitious (“predict stock prices,” “detect disease from images”), which introduces complex modeling, unclear baselines, and large compute requirements—each a reproducibility risk.
Choose a question with these properties:
Next, define success criteria before you code. Reproducibility is easier when “done” is objective. For a recreation project, success might mean: (1) you can run the pipeline from scratch and regenerate the same metrics, (2) your metrics are within a reasonable tolerance of a reference (because minor differences can occur), and (3) you can explain any gap using the variation sources from Section 1.4.
This is also where you build your course-long reproducibility checklist. Keep it short and actionable. Example items you will reuse: dataset URL + version/date recorded; license checked and noted; raw data preserved; cleaning steps logged with row counts; random seed set and recorded; environment captured in requirements.txt; one-command (or one-notebook) rerun works; results exported to a timestamped file.
Common mistake: changing multiple things at once (new features, new model, new split) and then being unable to attribute why results moved.
Practical outcome: by the end of this chapter, you should be ready to pick a simple “study” to recreate—one that teaches reproducibility fundamentals without drowning you in complexity.
1. In this chapter, what does “reproducibility” primarily mean in an AI workflow?
2. Why does the chapter say an AI “result” is rarely just a single number?
3. Which situation best reflects the chapter’s standard for reproducibility?
4. Which action is most aligned with practicing reproducibility “end-to-end” as described in the chapter?
5. According to the chapter, what is a common reason results don’t match when someone tries to reproduce an analysis?
Reproducibility starts long before you train a model. It starts the moment you choose a dataset and click “download.” In practice, most failed reproductions are not caused by fancy algorithms—they’re caused by mismatched data versions, unclear column meanings, silent preprocessing, or a missing note about how a value was measured.
This chapter walks you through a beginner-friendly, repeatable workflow: find and download a public dataset safely, understand what each column means by writing a plain-English data dictionary, perform a first-pass quality check (missing values, duplicates, obvious outliers), and then save a clean, “frozen” copy that you can always return to. These steps are not busywork. They are the foundation that lets you re-run the same analysis months later, compare your results to a reference study, and explain differences with evidence instead of guesses.
As you work through the sections, keep a simple goal in mind: if someone else (or future you) opens your project folder, they should be able to answer three questions quickly: (1) Where did the data come from? (2) What do the columns mean? (3) Exactly which version of the data did you use?
Practice note for Find and download a public dataset safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a data dictionary in plain English: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Check for missing values and obvious issues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Save a clean “frozen copy” for your study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Find and download a public dataset safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a data dictionary in plain English: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Check for missing values and obvious issues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Save a clean “frozen copy” for your study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Find and download a public dataset safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a data dictionary in plain English: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Public datasets come from many places: government agencies (census, health, transport), research labs (curated benchmark datasets), nonprofits, companies releasing challenge data, and community repositories. “Public” only means you can access it; it does not automatically mean you can reuse it freely. In reproducible AI work, “open” usually means the data is available without special permission and comes with a license that spells out what you may do (share, modify, use commercially) and what you must do (cite, keep notices, avoid re-identification).
Use engineering judgment when selecting a dataset for your first reproduction project. Prefer datasets that are (a) widely used, (b) small enough to inspect manually, and (c) accompanied by clear documentation. A good dataset page should tell you who produced the data, how it was collected, when it was last updated, and what each file contains. If the dataset is a “living” dataset that changes weekly, you must plan for versioning (covered in Section 2.6) or you may not be able to match a published result.
Downloading safely is part of reproducibility. Avoid random re-uploads of popular datasets with unclear provenance. Download from the official host when possible (e.g., an agency portal or the dataset creator’s repository). Record the download URL, the date, and any version identifier in a README. If the download requires an API, save the exact API query and parameters you used. Common mistakes include: using a different train/test split file than the study used, downloading a “cleaned” community copy without noticing, or grabbing a dataset with hidden access terms that prevent sharing your results.
Dataset pages are more than marketing text—they are your first “methods section.” Train yourself to scan for specific items: a schema (list of fields), units (meters vs. feet, dollars vs. euros), measurement methods (sensor type, survey wording), and caveats (known missingness, changes in collection). If you skip this step, you risk building a model that looks accurate but is actually learning from mislabeled or misunderstood variables.
Start by locating the field descriptions. Many portals provide a table of columns, sometimes called “schema,” “metadata,” or “data dictionary.” Note which columns are identifiers (IDs), which are inputs (features), and which might be outcomes (labels). Pay special attention to time-related fields: time zones, date formats, and whether timestamps reflect event time or processing time. If the data includes geographic fields, check coordinate systems and whether locations are obfuscated for privacy.
Caveats often explain why your counts may not match a paper. For example, a dataset might exclude records below a reporting threshold, or a variable might have been redefined mid-year. Another common caveat is sampling: a “representative” survey may involve weights; ignoring them can change averages and model behavior. You do not need to master every nuance on day one, but you should capture the key caveats in your notes so you can justify decisions later.
Reproducibility includes legal and scholarly hygiene. A dataset license tells you what you are allowed to do; a citation tells others where the data came from. Beginners often assume “free to download” means “free to republish,” but that is not always true. Before you build on a dataset, find the license section on the dataset page or in a LICENSE file.
Common license patterns you may encounter include Creative Commons (e.g., CC BY requires attribution; CC BY-SA requires sharing derivatives under the same terms; CC BY-NC restricts commercial use), Open Data Commons licenses for databases, or custom terms from agencies and platforms. If the license is missing or unclear, treat that as a risk: you may still be able to learn privately, but sharing data or derived files may be restricted. For a course reproduction project, prefer datasets with clear reuse rights.
Citation is simpler than it sounds. Most dataset pages provide a “Cite this dataset” block or a recommended citation in BibTeX/APA format. Save it in your project (for example, in a REFERENCES.md file). Also record the version and access date, because dataset content may change over time. If the dataset comes with a related paper, you may need to cite both: the dataset itself and the study describing its creation.
A common mistake is citing only the platform (e.g., “Kaggle”) instead of the dataset creator, or failing to include a version/date so others cannot retrieve the same snapshot. Another mistake is accidentally violating terms by uploading the raw data to a public repository. When in doubt, share your code and instructions, not the raw data.
Once you have the data, resist the urge to jump straight into modeling. Your first job is to understand the shape and meaning of the table(s). Start with the basics: how many rows and columns? What does one row represent (a person, a transaction, a day)? Are there multiple files that must be joined, and if so, what keys connect them?
Next, inspect types. Many reproducibility problems begin when a numeric column is accidentally read as text, or when dates are parsed differently on different machines. Check a small sample of rows and look for surprises: numbers with commas, “NA” strings, mixed units, or categorical values that differ only by capitalization (“Male” vs “male”). If you are using Python or R, print a schema summary and a few example rows; then write down anything that looks ambiguous.
This is where you create your own plain-English data dictionary. Even if one exists, your dictionary should reflect how you will use the data in your reproduction. For each column you plan to use, write: (1) what it measures, (2) units, (3) allowed values or typical range, (4) whether missing values occur and how they are encoded, and (5) whether you will treat it as input, label, or metadata. Keep it short but specific. For example: “age: integer years at time of survey; valid 0–100; missing encoded as blank; used as feature.”
Common mistakes include: using an ID as a feature (leading to leakage), treating an ordinal code (1–5) as a real number without understanding what it represents, or using a post-outcome field that would not be available at prediction time. Your dictionary helps you catch these early.
“Clean the data” should not mean “change things until the model works.” In reproducible work, cleaning is a controlled, documented set of decisions. Start by looking for three basic quality flags: missing values, duplicates, and outliers. Your goal is not perfection; your goal is to understand what issues exist and handle them consistently.
Missing values: Count missingness per column and check how it is encoded (empty strings, “NA,” -999). Then decide what to do. For a beginner reproduction, choose simple, defensible rules: drop rows only if the label is missing; otherwise consider basic imputation (mean/median for numeric, most frequent for categorical) and record the method. Be careful: dropping rows can change class balance and may be the reason your results differ from a reference.
Duplicates: Determine what “duplicate” means. Is it identical rows, or repeated IDs? Some datasets have multiple entries per person or per day; those are not duplicates. If true duplicates exist, decide whether to keep the first occurrence, aggregate, or remove them. Document the rule and why it matches the study’s intent.
Outliers: Outliers can be real (rare but valid) or errors (a height of 999). Use basic checks: min/max, histograms, and simple thresholds informed by domain knowledge and dataset notes. Avoid deleting outliers just because they hurt performance; instead, define a rule such as “clip to the 1st–99th percentile” or “remove values outside documented valid ranges,” and record it.
Common mistakes include mixing cleaning with feature engineering in an untracked notebook, silently changing the label definition, or cleaning the full dataset before splitting (which can leak information if you compute statistics on all data). Even in a simple project, keep cleaning steps explicit and reproducible.
To reproduce a study, you need a stable target. That means freezing the dataset version you used and naming files so the relationship between raw and cleaned data is obvious. Think like a lab: raw inputs are preserved, transformations are recorded, and outputs are labeled with enough context to be rerun.
Start with a folder structure that separates raw from processed data, for example: data/raw/, data/interim/, and data/processed/. Never edit files in data/raw/. Treat them as read-only evidence. Put your cleaned “frozen copy” in data/processed/ and ensure it can be recreated from raw data using a script or a clearly documented notebook.
File naming is a small habit with a big payoff. Include: dataset name, a version or date, and a short descriptor. For example: credit_default_raw_2026-03-28.csv and credit_default_clean_v1.csv. If the dataset host provides a release number, include it. If not, create your own version tag and record the source URL and download date in a README. For extra rigor, store a checksum (e.g., SHA-256) for the raw file so you can verify later that nothing changed.
Finally, write a short “data preparation log”: what rows were removed (and why), what columns were renamed, how missing values were handled, and any type conversions. This log is what lets you compare your pipeline to a reference study and pinpoint why results differ. A common mistake is to overwrite the only copy of the data with a cleaned file; another is to forget which cleaning version was used to train the baseline model. Freezing avoids both.
1. According to Chapter 2, why do many reproduction attempts fail more often than because of complex algorithms?
2. What is the main purpose of creating a plain-English data dictionary?
3. Which set of checks best matches the chapter’s recommended first-pass data quality review?
4. What does saving a clean “frozen copy” of the dataset enable you to do?
5. If someone opens your project folder later, which three questions should they be able to answer quickly?
Reproducing an AI result is rarely blocked by “hard math.” It’s usually blocked by small, avoidable ambiguities: Where did this file come from? Which version of the dataset did you use? Did you run the notebook top-to-bottom, or did you skip a cell? Which Python environment was active? This chapter turns those ambiguities into explicit, repeatable choices.
Your goal is simple: create a workspace where (1) you can run your first analysis notebook or script successfully today, and (2) you can run it again later with the same outputs, without relying on memory. You will set folder and naming rules, choose the right execution style (notebook vs script), capture your environment and key settings, and finish with a one-page “how to run” guide.
Think of this chapter as building the lab bench before you start mixing chemicals. A clean bench doesn’t guarantee perfect science, but a messy bench guarantees confusion.
The rest of this chapter breaks the workspace into six practical parts you can apply to any open-data study.
Practice note for Set up your project folders and naming rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a first analysis notebook/script successfully: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Record your environment and key settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a one-page “how to run” guide: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up your project folders and naming rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a first analysis notebook/script successfully: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Record your environment and key settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a one-page “how to run” guide: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up your project folders and naming rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a first analysis notebook/script successfully: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
A reproducible “project” is not just a folder with code in it. It is a small system with clear boundaries: inputs come in, transformations happen, outputs come out, and the decisions are documented. If you can’t point to where each of those lives, the project will slowly turn into guesswork.
Use a simple folder layout that separates original inputs from generated outputs. This prevents a very common mistake: accidentally overwriting raw data with cleaned data and losing the ability to start over. A beginner-friendly layout looks like this:
Adopt naming rules that reduce ambiguity. Prefer lowercase, hyphen/underscore, and date stamps only when needed. For example, data/raw/titanic.csv is better than FinalData(2).csv. For outputs, include enough context to identify what produced them: outputs/tables/baseline_metrics_v1.csv or outputs/figures/roc_curve_logreg.png.
Engineering judgement: keep folders stable, but allow content to evolve. When you change the cleaning logic, regenerate data/processed/ from code rather than editing files manually. If you must inspect or sanity-check in a spreadsheet, do it on a copy and record that it was only for inspection. Your project should make the “right thing” the easy thing.
Beginners often choose tools based on comfort, then wonder why results won’t reproduce. Reproducibility improves when you match the tool to the job and keep the boundary clear between exploration and repeatable execution.
Notebooks (Jupyter, Colab) are ideal for exploration, quick plots, and learning. They are also where reproducibility goes wrong: running cells out of order, hidden state in memory, and “it worked on my machine” dependencies. Use notebooks to discover the workflow and to run a first analysis successfully, but aim to make the final workflow runnable from top-to-bottom in one go. Practical habit: use “Restart kernel and run all” (or the equivalent) before you trust outputs.
Scripts (Python files you run from a terminal) are better for repeatability. A script starts from a clean state every run, and it’s easy to record parameters and create logs. A good pattern is: explore in notebooks/, then copy stable steps into src/ (for example, src/clean.py, src/train.py, src/evaluate.py). This does not require advanced software engineering—just a separation between “thinking space” and “execution space.”
Spreadsheets are useful for viewing data, filtering to understand columns, and making quick sanity checks. They are dangerous for cleaning because changes can be invisible and hard to replay. If you do any editing in a spreadsheet, you must treat it like a manual transformation and document it precisely (what changed, why, and how to redo it). In most reproducible projects, spreadsheets should be “read-only viewers.”
data/processed; train and evaluate consistently; save outputs to outputs/.Common mistake: leaving critical steps only in notebook cells with no clear execution order. If a step matters to the final result, it should be runnable end-to-end without manual clicks or remembering “which cell I ran last time.”
Your “runtime” is the place where code executes: Python version, libraries, and the operating system environment. For beginners, the goal is not to build a perfect DevOps system—it is to pick one runtime option and make it consistent.
Two beginner-friendly paths work well:
If you choose local Python, a practical baseline is: install Python (from python.org or via conda), create a project-specific environment, and install dependencies into it. The key reproducibility rule is: one project, one environment. Don’t reuse a global Python setup across unrelated projects.
To run a first analysis successfully, choose one “entry point” and make it boring. For example:
notebooks/01_baseline.ipynb that loads data/raw/..., creates data/processed/..., trains a baseline model, and writes metrics to outputs/tables/.python -m src.run_baseline that does the same steps and prints where outputs are saved.Engineering judgement: prefer fewer moving parts. Avoid “clever” solutions early (complex pipelines, many configuration layers). At beginner level, repeatability comes from clarity: one environment, one dataset location, one command to run.
Common mistakes include installing packages in the wrong environment (then wondering why imports fail), mixing different Python versions, or hard-coding file paths like C:\Users\Name\Desktop\data.csv. Instead, build paths relative to the project folder (e.g., data/raw/data.csv) so another person can run it on their machine.
When a result changes, the first question is: “What changed?” Capturing versions turns that from detective work into a quick check. In reproducible AI work, you typically need to track three categories: the dataset version, the code version, and the environment version.
Data version: If your dataset is downloaded from a public source, record the URL, access date, and any dataset version identifier provided by the host (release number, DOI, commit hash, or “last updated” date). Save the raw file under data/raw/ and do not edit it. If the dataset is large and you don’t store it in your repo, store a small text file like notes/data_source.txt with the exact download instructions. For extra safety, record a checksum (e.g., SHA256) so you can confirm later that the file is identical.
Library versions: Your code might rely on specific behavior of pandas, scikit-learn, PyTorch, etc. Capture installed versions in a requirements file. In Python, common approaches are:
requirements.txt (exact versions).environment.yml that includes Python version and packages.Pinning versions is a judgement call. For learning projects, pin the major libraries at least (e.g., pandas==..., scikit-learn==...) so a rerun next month doesn’t silently change behavior. If you leave versions unpinned, you are implicitly accepting “latest” behavior, which is often the opposite of reproducibility.
Operating system and hardware: For most beginner CPU-based studies, OS details are enough. Record: OS name/version, Python version, and whether you used CPU or GPU. Some models (especially deep learning) can vary across hardware and GPU libraries. Even if you can’t fully control that yet, writing it down makes differences explainable.
Practical outcome: create a small “run record” file in outputs/logs/ each time you run the baseline—date/time, git commit (if using git), dataset identifier, and package versions. This turns your project into a series of traceable runs rather than a single mysterious result.
Many AI workflows involve randomness: splitting data into train/test sets, shuffling rows, initializing model weights, sampling mini-batches, or using randomized algorithms. If you rerun the same code and get slightly different results, it is often not a bug—it’s an uncontrolled random process.
A seed is a starting point for a pseudo-random number generator. If you set the same seed and keep other conditions stable, you usually get the same “random” sequence again. In practice, you should set seeds in all libraries you use (for example, Python’s random, NumPy, and your ML framework) and pass a random_state or equivalent parameter to functions like train/test split.
However, seeds are not magic. Some operations are nondeterministic due to parallelism, GPU behavior, or low-level math libraries. That means you can do everything “right” and still see tiny metric differences, especially in deep learning. The reproducibility goal at this level is:
Engineering judgement: decide what “repeatable” means for your study. For a baseline model in a beginner reproduction, it’s often reasonable to fix the split seed and expect the same metrics on rerun within the same environment. If your results still change, check common causes: you reshuffled data without noticing, you are reading files in a different order, you forgot to set random_state in one key function, or your notebook used cached variables from a previous run.
Practical habit: store the seed (and any other key settings like test size) in one place—either a small config section at the top of your notebook/script or a simple config file. Then print it (or log it) during the run so it becomes part of the evidence trail.
A project is not reproducible until another person can run it without asking you questions. The simplest test is: imagine you are helping “future you” three months from now. A one-page “how to run” guide (typically README.md) is the difference between a reusable study and an abandoned folder.
Your run guide should be short, concrete, and command-focused. Include:
data/raw/), and how to confirm it’s correct (filename, row count, checksum if you have it).outputs/, typical runtime, and a reference metric range if available.Be explicit about “no manual steps.” If a user must click around in a UI, say exactly what to click and what it should produce. Better yet, convert that step into code. Also list the most common failure modes and fixes: missing file path, environment not activated, package install errors, or trying to run from the wrong working directory.
Practical outcome: once your README exists, use it yourself immediately. Close your notebook, restart your runtime, and follow your own instructions exactly. If you can’t follow them, nobody else can. This self-check is the fastest way to catch hidden assumptions—and it’s the moment your workspace becomes truly repeatable.
1. According to Chapter 3, what most often blocks reproducing an AI result?
2. What is the chapter’s core goal for your workspace?
3. Which practice best turns “guesswork” into explicit, repeatable choices?
4. What problem is the chapter trying to prevent when it asks whether you ran a notebook top-to-bottom or skipped a cell?
5. Which outcome best matches “a project folder that explains itself” in Chapter 3?
This chapter is where reproducibility becomes real: you will take an open dataset and run the same cleaning steps every time, split the data fairly, train a baseline model, and produce a small set of outputs you can use in a report. A “baseline” is not a fancy model; it is a trustworthy starting point that proves your pipeline works and gives you a reference number to beat later. Many failed reproductions are not caused by advanced statistics—they fail because the workflow is inconsistent: a cleaning step is applied once but not again, a random split changes each run, or results are not saved in a way that allows later verification.
To keep this chapter practical, think in terms of a repeatable run: you start from raw data, produce a cleaned dataset, split it, fit a simple model, compute plain-language metrics (like accuracy or error), and save 2–3 clear charts/tables plus a small results file. If you can do that twice on your machine and get the same answers, you are already practicing strong AI reproducibility.
Engineering judgment matters here: you will make small choices (how to handle missing values, what to do with outliers, which features to include). Reproducibility does not mean “no choices.” It means the choices are explicit, justified, and applied consistently so another person (or future-you) can follow the same path and arrive at the same results.
Practice note for Apply the same cleaning steps every time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Split data into training and testing fairly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Train a baseline model and get results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create 2–3 clear charts/tables for the report: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Confirm the run is repeatable on your machine: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Apply the same cleaning steps every time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Split data into training and testing fairly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Train a baseline model and get results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
A reproducible study starts with a clear mapping from the research question to what goes into the model (inputs) and what comes out (outputs). Beginners often begin by “loading a dataset and trying models,” but a study is easier to recreate when you can answer three concrete questions before writing much code.
1) What is the unit of prediction? Is each row a person, a transaction, a day, or a document? Your unit determines how you split and how you interpret metrics. If each row is a patient visit, you must be careful not to put the same patient in both train and test (that would inflate performance).
2) What is the target? This is the outcome you want to predict or explain. For a classification study, the target is a category (e.g., “spam” vs “not spam”). For regression, it is a numeric value (e.g., house price). Write it down exactly as the column name and define allowable values. If the target requires derivation (e.g., “late payment” from a due date and payment date), implement the derivation in code and treat it as part of your pipeline.
3) What counts as an input? Inputs are the columns you allow the model to use. A common reproducibility mistake is accidentally including “future information” that would not be available at prediction time (for example, including “discharge outcome” when predicting “risk at admission”). Decide what information is realistically available at the moment of prediction, and restrict inputs accordingly.
Finally, define the outputs you will produce each run: the cleaned dataset (or a summary), the split definition (or seed), model parameters, metrics, and a small set of figures/tables. When you can list these outputs up front, it becomes much easier to confirm later that your run is repeatable on your machine.
Cleaning is where many reproductions quietly break. Someone filters out rows “that look wrong,” changes a column type by hand, or tests multiple missing-value strategies without recording which one was used. The rule of thumb: if it changes the data, it must be written as a deterministic step that can be rerun.
Start by categorizing common issues: missing values, inconsistent formats, duplicates, impossible values (negative ages), and outliers. For each category, define a rule and a record.
A practical pattern is a “cleaning log” produced automatically each run. For example: total rows at load; rows dropped due to missing target; number of duplicates removed; per-column missingness before/after; summary stats for key numeric columns. Save this as a small CSV or JSON next to your results. That log turns cleaning into something reviewable instead of mysterious.
Two important reproducibility judgments:
Common mistakes include silently converting strings to numbers with errors becoming missing values, “fixing” outliers by deleting them without a documented threshold, and applying different cleaning steps depending on what the model needs. Your goal this chapter is simple: apply the same cleaning steps every time, and be able to prove what changed.
A feature is just a piece of information you feed to a model to help it make a prediction. If your dataset row is a single house, then “number of bedrooms” is a feature. If your row is an email, then “contains the word ‘free’” could be a feature. Thinking this way keeps you grounded: features are not mystical—they are measurable signals.
For reproducibility, you want features to be (1) well-defined, (2) computed the same way every run, and (3) available at prediction time. In beginner projects, most features come directly from columns after light preparation: turning categories into numbers, filling missing values, and scaling numeric columns if needed.
Three practical feature types you will meet often:
A beginner-friendly guideline is: start with minimal feature engineering. The goal of Chapter 4 is not to win a benchmark; it is to establish a clean baseline that someone else can reproduce. Choose a small, justified set of features, document why you included them, and avoid “target leakage” features that act like hidden answers (for example, a column that is derived from the target or recorded after the outcome happens).
Also make feature decisions stable. If you drop columns, list them explicitly. If you create new columns, name them consistently and keep the transformation code in one place. This will make your later comparison to a reference study much easier, because you can explain differences in results as differences in features or preprocessing—not as accidental drift.
The train/test split is a fairness rule: the model must be evaluated on data it did not learn from. From first principles, training is “practice,” testing is “exam.” If the exam questions leak into practice—directly or indirectly—your score becomes meaningless.
A reproducible split has two properties: it is appropriate for the data structure and it is repeatable. Repeatable usually means fixing a random seed and recording it. Appropriate means choosing a split strategy that matches how the data is generated.
Leakage is not only about the split function. It also happens when preprocessing is fit on the full dataset. If you compute scaling parameters, imputation values, or category vocabularies using all rows, you have indirectly used test information. The safe mental model is: everything that “learns” parameters must be fit on training data only, then applied to test data unchanged.
To confirm the split is fair, save basic diagnostics: counts of rows in train/test, target distribution in each, and (if grouped) counts of unique groups. This makes it easier to debug when your reproduced score differs from a reference: sometimes the difference is simply a different split strategy or random seed.
Finally, write the split in code and store the seed (or even the exact row IDs for train/test) as an output. That is how you “lock” the exam so you can take it again later and compare results honestly.
A baseline model answers: “What performance do we get with a simple, standard approach?” It is your first reproducible checkpoint. If your baseline cannot be reproduced, adding complexity will not help—your pipeline is unstable.
Start with two baselines:
Use beginner-friendly metrics and explain them in plain language:
Report metrics on the test set only, and keep training metrics separate. A common mistake is to tune decisions (features, thresholds, cleaning) based on test performance. If you iterate, either keep a separate validation split or use cross-validation—but for beginners recreating a simple study, a fixed train/test split plus a baseline is often enough.
Interpretation is part of reproducibility. Record: model type, key hyperparameters (even if default), feature list, and the exact metric computation. Save the raw predictions alongside the true values for the test set. If your results differ from a reference, you can compare prediction-by-prediction rather than arguing over a single number.
A reproducible run leaves artifacts behind: files that prove what happened and can be rechecked. This section turns your work into a small “report bundle” that you can regenerate on demand.
Create 2–3 clear charts/tables that match the story of the study:
Save these outputs with stable filenames (e.g., fig_target_distribution.png, table_data_summary.csv, metrics.json). Also save a single machine-readable results file that includes: dataset version or download date, cleaning log counts, split seed/strategy, model name and parameters, and final test metrics. When you later compare your work to a reference, you can line up these fields and quickly spot where differences originate.
To confirm the run is repeatable on your machine, rerun the full pipeline from scratch at least once: delete intermediate outputs (or use a clean output directory), run again, and verify that metrics and saved artifacts match. If they do not match, look for hidden randomness (unfixed seeds), nondeterministic operations, or steps that depend on row order. Reproducibility is not a slogan; it is demonstrated by identical outputs from the same inputs.
When this chapter is complete, you should be able to point to a folder and say: “Everything needed to verify my cleaning, split, baseline model, and results is here, and I can regenerate it.” That is the foundation for recreating a study end-to-end.
1. In Chapter 4, what is the main purpose of training a baseline model?
2. Which workflow best matches the chapter’s definition of a repeatable run?
3. Why does the chapter stress applying the same cleaning steps every time and writing them in code?
4. What does splitting data into training and testing 'fairly' primarily aim to prevent?
5. According to the chapter, what does reproducibility mean when you must make engineering choices (e.g., missing values, outliers, features)?
You ran the pipeline, got a score, and now you’re staring at a reference number from the original study (or a tutorial notebook) that doesn’t quite match yours. This is the moment where beginners often assume they “did it wrong.” In real reproducibility work, small differences are normal—and your job is to diagnose them methodically, not emotionally. This chapter gives you a practical workflow for comparing results to a reference, running checks to catch mistakes early, explaining why numbers differ without panic, and improving stability with small, safe adjustments.
Think like an engineer and a researcher at the same time: you want the analysis to be correct, and you also want it to be repeatable. “Correct” here means your data processing and evaluation are faithful to the intended design; “repeatable” means you can rerun the same steps tomorrow and get the same outputs (or the same distribution of outputs, when randomness is part of the method). We’ll move from defining what a “match” is, to quick sanity checks, to common failure points, to logging changes, and finally to a clean, locked “final run” you can archive or share.
As you read, keep your own reproduction attempt open. You’ll get the most value by applying each section immediately: compare one metric, run one check, fix one discrepancy, rerun, and record what changed. This habit—small, verified steps—is what makes reproducibility approachable.
Practice note for Compare your results to a reference result: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run checks to catch mistakes early: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Explain why numbers differ without panic: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Improve stability with small, safe adjustments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare your results to a reference result: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run checks to catch mistakes early: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Explain why numbers differ without panic: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Improve stability with small, safe adjustments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare your results to a reference result: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Before debugging, define what you are trying to match. In beginner projects, people often aim for a single number (e.g., accuracy = 0.873). But studies usually report results under a particular evaluation setup: a specific split, metric definition, averaging method, and sometimes multiple runs. If any of those differ, your number can differ while your work is still valid.
A practical definition of “match” depends on context. If the reference uses a fixed train/test split and deterministic training, you should match very closely (often identical up to rounding). If the reference uses cross-validation, random initialization, or stochastic training, you should match within a reasonable tolerance (e.g., within a couple of tenths of a percent for accuracy, or within a small margin for RMSE). When the reference reports a mean and standard deviation across runs, your goal is to land in that range, not to hit the exact mean on the first try.
Start your comparison by aligning three things: (1) the dataset version and preprocessing steps, (2) the evaluation protocol, and (3) the metric definition. For example, “F1 score” can mean micro, macro, weighted, or binary F1 with a specific positive class; “AUC” requires a probability score, not hard labels. If the paper uses stratified splitting and you used a random split, you should expect drift—especially with imbalanced classes.
Write down your target: “match within ±0.5% accuracy” or “match within the reported 95% CI.” This prevents endless tinkering and keeps you focused on reproducibility rather than perfection.
Sanity checks are the fastest way to catch mistakes early—before you waste time interpreting a misleading score. They are simple, mechanical checks that confirm your data and labels look like they should after every major step (load → clean → split → transform → train). The goal is to detect “obviously wrong” states: empty datasets, duplicated rows, swapped labels, or incorrect scaling.
Start with counts: number of rows, number of columns, number of missing values per column, and the class distribution (for classification). Compare these to the dataset documentation and to any intermediate counts reported in the reference. A common surprise is that your cleaning step removes far more rows than intended (e.g., dropping all rows with any missing value instead of imputing).
Then check ranges and types. Numeric features should have plausible min/max values (e.g., ages shouldn’t be 350; probabilities should be between 0 and 1). Categorical columns should have expected unique values and not include “nan” as a string. After encoding and scaling, verify the transform behaved as expected (e.g., standardized columns have roughly mean 0 and standard deviation 1 on the training set only).
Spot checks are underrated: pick a single record and manually trace it through preprocessing. If a date parsing step changes “03/04/2020” to the wrong month/day order, your model may still run but will train on incorrect signals. These quick checks make debugging concrete and prevent “mystery differences” later.
When your result differs from the reference, the cause is often mundane. Reproducibility failures usually cluster into a small set of issues: file paths and data versions, dependency versions, randomness and nondeterminism, and data leakage. Build a habit of checking these in a consistent order.
Paths and data versions: confirm you are reading the dataset you think you are. If you downloaded a “latest” CSV but the reference used an older snapshot, counts and distributions can shift. Ensure your code uses explicit paths (or a config variable) and prints the resolved path and file hash. A silent path bug—loading a cached file from a different folder—is very common.
Library versions: small changes in scikit-learn, pandas, numpy, or a deep learning framework can change defaults (solver choices, regularization interpretation, tokenization behavior). Record your environment (e.g., a requirements.txt or conda env export) and compare it to the reference. If you must upgrade, note it and expect some drift.
Randomness and nondeterminism: if you split data randomly, initialize model weights randomly, shuffle batches, or use GPU operations, you can get different results run-to-run. Set random seeds in all relevant places and keep them in your log. Also note that some operations are not fully deterministic across hardware; in that case, aim for a stable distribution rather than identical values.
Data leakage: this is a correctness issue, not just a reproducibility issue. Leakage happens when information from the test set sneaks into training—often via preprocessing. Examples include fitting a scaler on the full dataset instead of only the training set, selecting features using the full label set, or performing imputation using global statistics. Leakage can make your score “too good” and also make it differ from the reference if they avoided leakage.
When troubleshooting, ask: “Could this difference be explained by (1) different inputs, (2) different code, (3) different randomness, or (4) a bug/leakage)?” That framing keeps you calm and systematic.
Once you start fixing differences, you need a lightweight change-tracking system. Without it, you will lose track of what you changed, why the score moved, and whether you can reproduce your own best run. You do not need a heavy process—just a simple log plus consistent reruns.
Create a plain text file (e.g., repro_log.md) and record every intentional change as an entry. Each entry should include: date/time, what changed, why you changed it, and the resulting key metrics. Keep it short but specific. For example: “Changed train/test split to stratified with seed=42 to match reference; accuracy moved from 0.861 to 0.872.” This transforms debugging from guesswork into an experiment history.
Pair the log with a consistent rerun procedure. Decide what “one run” means: delete derived artifacts (or not), rerun the pipeline from raw data, and recompute metrics. If your pipeline caches intermediate files, confirm the cache is either cleared or properly invalidated when upstream steps change. Many confusing differences come from stale artifacts (e.g., an old preprocessed dataset reused after you changed cleaning rules).
This approach also helps you “explain differences without panic.” When you can point to a logged change that explains a metric shift, the work becomes controlled and teachable—exactly what reproducibility is about.
Even when you do everything “right,” results can differ because many modeling pipelines are sensitive to small choices. Sensitivity analysis is the practice of checking how robust your outcome is to reasonable variations. For beginners, this does not need to be statistical heavy lifting—just a few targeted reruns that reveal whether your result is stable or fragile.
Start with the choices that commonly matter most: the train/test split (or CV folds), preprocessing parameters, and model hyperparameters. If changing the random seed swings accuracy by 5 percentage points, your model is not stable under the current setup, and matching a single reference number becomes less meaningful. In that situation, compare ranges or averages across multiple seeds instead.
Next, check preprocessing sensitivity. Examples: imputation strategy (mean vs. median), categorical encoding (one-hot vs. target encoding), text tokenization rules, or outlier handling. These are “small, safe adjustments” you can make while keeping the overall study design intact. The key is to vary one decision at a time and keep everything else fixed.
Sensitivity results help you interpret differences intelligently. If the reference score sits comfortably inside your measured variation, you have likely reproduced the study in spirit. If it lies far outside, focus on protocol alignment or bugs rather than endlessly tuning.
Once you are satisfied that your results match the reference within your defined tolerance—and you understand any remaining differences—finish with a “final reproducible run.” This is a clean, documented execution of the entire pipeline that you can rerun later and get the same outputs. Think of it as packaging your work for your future self (or a reviewer).
First, lock the inputs and environment. Use an exact dataset snapshot (store the file, record its URL and hash, or use a versioned data source). Freeze your dependencies with a lockfile (e.g., requirements.txt with pinned versions) and record the Python version and OS basics. If you rely on GPU behavior, note the hardware and driver/CUDA versions as well, since nondeterminism can appear there.
Second, lock the run configuration: put seeds, split parameters, preprocessing options, and model hyperparameters into a single config file. Avoid “magic numbers” scattered in notebooks. Then produce a single command (or notebook cell sequence) that runs from raw data to metrics without manual intervention.
Third, lock the outputs. Save (1) metrics in a machine-readable format (JSON/CSV), (2) a confusion matrix or error breakdown, (3) model artifacts if appropriate, and (4) a small table of predictions with IDs so you can compare across runs. Name outputs with a run ID (timestamp or git commit hash) and keep them in a predictable folder structure.
This final step converts your reproduction from “I got a similar number once” into a repeatable, auditable result. That is the practical outcome of reproducibility: not just matching the reference, but being able to explain, rerun, and trust your workflow.
1. When your reproduced score doesn’t exactly match the reference result, what is the most appropriate next step?
2. In this chapter, what does “repeatable” mean for a pipeline run?
3. Which pairing best reflects the chapter’s guidance to “think like an engineer and a researcher at the same time”?
4. What workflow habit does the chapter recommend to make reproducibility approachable?
5. According to the chapter, why should you avoid panicking when numbers differ from a reference?
Reproducing a study is only “finished” when another person can rerun your work and understand what you did, why you did it, and what the results mean. This chapter turns your notebook-and-files into a reproducibility report and a shareable bundle. The goal is not to impress with complexity; it is to remove ambiguity. A good reproducibility report reads like a set of careful instructions, backed by evidence, with honest caveats about what the work does and does not prove.
In practice, you will write a methods section that someone can follow without guessing, summarize results in plain language, and package everything needed to rerun (or at least re-check) your analysis. You will also add citations and licenses so your work is shareable. Think of your report as a handoff: if you stepped away, your future self—or a reviewer—could still reproduce the run.
As you write, keep engineering judgment front and center. Reproducibility is not only about “same code, same outputs.” It is also about documenting decisions: which rows you dropped, how you handled missing values, what random seed you used, and why. The most common failure mode is leaving these choices implicit. Your report is where you make them explicit.
Practice note for Write a clear methods section anyone can follow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Summarize results with limits and honest caveats: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package files for rerun and include citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Publish or submit your reproducibility bundle confidently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a clear methods section anyone can follow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Summarize results with limits and honest caveats: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package files for rerun and include citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Publish or submit your reproducibility bundle confidently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a clear methods section anyone can follow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Summarize results with limits and honest caveats: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with a predictable structure. Readers should not hunt for essentials. A simple template works well: (1) research question and reference study, (2) data source and access, (3) methods and steps to reproduce, (4) results, and (5) limitations. If you follow this order consistently, you make it easy for someone to compare your work to the original study and to rerun your pipeline.
Question. Write one paragraph defining what you are recreating. Name the original paper/blog/report, the task (e.g., binary classification), and the target metric (e.g., accuracy, AUC). Specify what “success” means: “I consider this reproduction successful if my metric is within X of the reference and the qualitative conclusion matches.” This prevents you from moving the goalposts later.
Data. Provide a crisp description: dataset name, version/date accessed, link, license, and any filters (e.g., “only U.S. records, years 2018–2020”). State how the data is obtained (download, API, Kaggle) and any credentials needed. If data cannot be redistributed, say so and provide pointers rather than including the raw file.
Steps. Give a numbered “runbook” that mirrors your folder structure: create environment, install dependencies, run preprocessing, train model, evaluate, generate figures. Include exact commands and expected runtime. A reader should be able to copy-paste.
conda env create -f environment.yml; python src/make_dataset.py; python src/train.py --seed 42; python src/evaluate.py.reports/figures, models/, data/processed).Results and limits. Separate “what happened” from “what it means.” Put metrics and plots in the results section, and reserve interpretation and caveats for limitations. A common mistake is burying limitations in footnotes; instead, treat them as a first-class section so your conclusions stay honest and reviewable.
Most reproduction gaps come from undocumented decisions rather than missing code. Your report must capture the “decision trail” for data cleaning and modeling parameters so someone can recreate your exact conditions. Think of it as translating your internal reasoning into a checklist others can audit.
Cleaning rules. Describe each transformation as a rule with a reason. Avoid vague phrases like “cleaned the data.” Instead: “Dropped rows where age is missing because the model requires numeric age; this removed 2.3% of records.” If you imputed missing values, specify the method (median, most frequent, KNN), the columns affected, and whether imputation statistics were computed on the training set only (best practice to prevent leakage).
Safe record-keeping. Report counts before and after each step: number of rows, columns, duplicates removed, invalid entries fixed. Readers should be able to sanity-check that their run matches yours. A practical way is to include a small table in the report and a machine-readable log (CSV/JSON) generated during preprocessing.
Parameter choices. Document model type, hyperparameters, random seed(s), train/validation/test split strategy, and evaluation protocol. If you used cross-validation, state the number of folds and whether it was stratified. If you used a baseline, explain why it is a reasonable baseline (e.g., logistic regression as a transparent starting point). When you diverge from the reference study (different split, different library version, different preprocessing), call it out clearly and explain the impact you expect.
Engineering judgment matters: it is acceptable to make pragmatic choices, but it is not acceptable to hide them. Your reader should never have to guess why you dropped a column or chose a threshold.
Results are communication, not just numbers. A reproducibility report should let a reader quickly answer: Did the reproduction match the reference? If not, how did it differ, and what might explain the gap? Present results in layers: a short takeaway, a small set of core metrics, and supporting visuals.
Plain-language takeaways. Start the results section with 2–4 sentences that translate metrics into meaning. Example: “The baseline model correctly identified positives 78% of the time on the held-out test set (AUC 0.84). This is within 0.02 of the reference AUC, suggesting the main conclusion reproduces.” Avoid overclaiming; describe what the metric indicates and the context (test set, split method).
Core metrics, consistently defined. Provide the metric definitions or links (especially for beginners): accuracy, precision/recall, F1, AUC, RMSE—whichever matches the study. State the decision threshold if relevant. Include confidence intervals if you can (even via bootstrap), or at minimum repeat runs with different seeds and report mean and standard deviation. This helps readers understand stability.
Visuals that support diagnosis. Include at least one plot that helps debug differences: confusion matrix, ROC curve, calibration plot, residual plot, or feature importance for interpretable models. Label axes, include units, and specify the dataset split used. A frequent mistake is making plots that look nice but cannot be traced to a specific run; tie each figure to an output file produced by your pipeline.
Keep visuals honest: do not crop axes to exaggerate improvements, and do not show only the best run. Reproducibility is strengthened when you show the typical outcome and the variability.
A strong reproduction report is explicit about what is unknown. Limitations are not an apology; they are boundary markers that prevent misuse of your results. Separate limitations into categories so readers can act on them: data, methods, and interpretation.
Data limitations. Mention sampling bias, missingness patterns, label quality, and any filtering you applied. If the dataset represents a specific population or time period, say so. If you used a public dataset with imperfect documentation, explain which assumptions you made. Also state license constraints: whether others can redistribute, modify, or use commercially. This is part of ethical sharing.
Method limitations. Address threats like data leakage, metric mismatch with the reference, nondeterminism, and differences in library versions. If your pipeline depends on external APIs that can change, note it and pin versions where possible. If you could not reproduce a specific preprocessing step from the original study, say exactly what was missing and what substitute you used.
Interpretation limits. Be careful with claims of causality or generalization. A reproduction that matches metrics does not prove the model is “true”; it shows that under similar conditions you obtained similar outputs. Likewise, a mismatch does not automatically disprove the original result—it may indicate hidden dependencies, unclear methods, or sensitivity to seeds and splits.
Conclude the limitations section with practical next steps: what experiment would reduce uncertainty (e.g., rerun with stratified splits, test robustness across seeds, compare preprocessing variants). This turns limitations into a roadmap rather than a dead end.
Your reproducibility bundle is the deliverable that makes rerun possible. It should be organized, lightweight, and legally shareable. The key idea is to include everything needed to reproduce the workflow while respecting dataset licenses and privacy.
Recommended folder layout. A practical structure is: README.md, environment.yml or requirements.txt, src/ (scripts), notebooks/ (optional, but keep scripts as the source of truth), data/ (usually with raw/ empty and a DATA_SOURCES.md pointer), models/, reports/ (final report + figures), and outputs/ (metrics JSON/CSV).
Data pointers, not always data. If the license allows redistribution, you may include raw data. If not, include: the exact URL, dataset version, checksum if available, and the download instructions. When possible, include a small “sample” dataset that is license-safe for smoke tests, and ensure your code can run end-to-end on the sample.
Code and rerun commands. Provide one command that rebuilds everything from scratch (or as close as practical), such as make all or a single Python entry point. Pin dependencies to versions. Record system info (Python version, OS) in a reproducibility.txt file generated automatically during runs.
Outputs and citations. Include generated metrics files and the final figures so readers can compare without rerunning immediately. Add citations for the reference study, dataset documentation, and key libraries. A CITATIONS.bib or a short “References” section in your report helps reviewers verify provenance.
This bundle is what you will publish or submit. If you treat it as a product (clear structure, clear entry point, clear licensing), you make reproducibility the default rather than the exception.
Before you publish or submit, run a final “cold start” test: pretend you are a stranger. Clone your project into a new folder (or a clean environment), follow your README exactly, and see where you get stuck. Every confusion point you hit is a documentation bug you can fix immediately.
Final checklist. Use a short, practical checklist to build confidence:
Publishing options. Choose a sharing route that matches your audience: a GitHub repository for open collaboration, an OSF project for research workflows, a course submission portal, or an institutional repository. Include a short release note describing what the bundle contains and how to cite it.
Next-step replications. Once the baseline reproduction works, extend responsibly: run robustness checks across multiple random seeds, compare two preprocessing variants, test a second baseline model, or replicate on a related dataset to probe generalization. Keep the same reporting discipline: document decisions, present results plainly, and update limitations. Reproducibility is a habit—your report and bundle are the start of a reusable research workflow.
1. What does Chapter 6 describe as the point when a reproduction is truly “finished”?
2. What is the primary goal of a reproducibility report according to the chapter?
3. Which best describes what the methods section should enable a reader to do?
4. What is a key part of summarizing results in a reproducibility report?
5. What common failure mode does the chapter warn about, and how does the report address it?