HELP

Reproducible AI for Beginners: Replicate a Study & Report Changes

AI Research & Academic Skills — Beginner

Reproducible AI for Beginners: Replicate a Study & Report Changes

Reproducible AI for Beginners: Replicate a Study & Report Changes

Repeat an AI study step-by-step and explain exactly what changed.

Beginner reproducible-ai · replication · ai-research · beginner-friendly

Reproducible AI, explained for absolute beginners

AI results often look solid in a paper, a blog post, or a demo—until you try to run the same steps yourself. This course teaches reproducible AI from first principles, with no coding background required. You will learn how to repeat (replicate) a small AI study and report exactly what stayed the same, what changed, and why that happened.

Instead of overwhelming you with advanced tools, we focus on a simple, reliable workflow: choose a safe target study, set up a clean workspace, run a baseline, make controlled changes, compare outcomes, and write a short replication report. You finish with a clear project you can show in school, at work, or in a research setting.

What you will do in this course

This is a book-style course with six chapters that build in a straight line. Each chapter includes milestones that move your project forward.

  • Pick a small study to replicate and define what “success” means for your replication
  • Set up a repeatable folder structure and a run log so you can track every attempt
  • Run a baseline replication and save the evidence you’ll need later
  • Change one factor at a time (like randomness, data handling, or versions) and observe what moves
  • Compare results across runs and turn differences into clear findings
  • Write a short replication report and package your work so others can rerun it

Why beginners struggle with replication—and how we make it easier

Beginners often think replication fails only because they “did something wrong.” In reality, AI studies can change because of randomness, small differences in data, hidden defaults in tools, or missing details in the original description. You will learn a calm, practical way to troubleshoot without guesswork: record what you did, change one thing at a time, and keep outputs organized so comparison is easy.

We also teach you how to communicate uncertainty responsibly. Sometimes you can’t fully replicate a study due to missing data, unclear steps, or tool constraints. That is not failure—if you document it well, it becomes a useful result that others can learn from.

Who this is for

This course is for absolute beginners: students, professionals, and curious learners who want a concrete way to understand AI research claims. It also fits teams in business or government who need repeatable internal experiments and clear audit trails.

How to get started

You can begin immediately and build your replication project step-by-step. If you’re ready to start, Register free. Want to compare options first? You can also browse all courses.

What you will have at the end

By the final chapter, you will have a beginner-friendly replication package: a clean folder structure, a run log, saved outputs, a comparison table, and a short report that states what changed and what likely caused it. This is a practical foundation for future AI study, better academic writing, and more trustworthy AI work.

What You Will Learn

  • Explain what “reproducible AI” means in plain language and why it matters
  • Pick a small, beginner-friendly AI study to replicate safely and ethically
  • Set up a simple, repeatable workspace (files, folders, versions) without advanced tools
  • Run a baseline replication and record inputs, settings, and outputs
  • Identify common causes of changes (randomness, data differences, environment, choices)
  • Track experiments with a clear log so someone else can repeat your steps
  • Compare results and summarize what stayed the same vs what changed
  • Write a short replication report with limitations and next-step recommendations

Requirements

  • No prior AI, coding, or data science experience required
  • A laptop or desktop computer with internet access
  • Willingness to follow step-by-step instructions and take notes
  • Optional: ability to install free software (we provide alternatives if you can’t)

Chapter 1: Reproducibility—What It Is and Why It Breaks

  • Milestone 1: Define reproducible vs repeatable vs replicable (with examples)
  • Milestone 2: Identify the “moving parts” in an AI study
  • Milestone 3: Spot common failure points using a simple checklist
  • Milestone 4: Choose your target study for this course project
  • Milestone 5: Set your success criteria and time box

Chapter 2: Build a Repeatable Workspace (No Fancy Tools Required)

  • Milestone 1: Create a clean project folder structure
  • Milestone 2: Record environment details in a simple “setup note”
  • Milestone 3: Capture data and source information (provenance) the easy way
  • Milestone 4: Create a run log template you’ll reuse
  • Milestone 5: Do a dry run to confirm everything is accessible

Chapter 3: Run the Baseline Replication and Capture Evidence

  • Milestone 1: Run the study exactly as described (baseline run)
  • Milestone 2: Save outputs in a compare-ready format
  • Milestone 3: Take “evidence screenshots” and minimal artifacts
  • Milestone 4: Summarize baseline results in plain language
  • Milestone 5: Confirm another person could follow your steps

Chapter 4: Make Controlled Changes and Observe What Moves

  • Milestone 1: Plan one-change-at-a-time experiments
  • Milestone 2: Change randomness settings and compare
  • Milestone 3: Change data handling slightly and compare
  • Milestone 4: Change environment/tool versions (or document constraints)
  • Milestone 5: Log results and label changes clearly

Chapter 5: Compare Results and Turn Differences into Findings

  • Milestone 1: Build a simple comparison table across runs
  • Milestone 2: Create a “what changed/what didn’t” list
  • Milestone 3: Classify causes: data, randomness, environment, choices
  • Milestone 4: Write limitations and confidence statements
  • Milestone 5: Produce a final reproducibility checklist for the study

Chapter 6: Write the Replication Report and Share Responsibly

  • Milestone 1: Draft a 1–3 page replication report structure
  • Milestone 2: Add method, logs, and artifacts as an appendix
  • Milestone 3: Write a clear results and discussion section
  • Milestone 4: Create a shareable package (files + instructions)
  • Milestone 5: Do a final self-audit and publish or submit

Sofia Chen

AI Research Methods Educator

Sofia Chen teaches practical research skills for working with AI systems, focusing on reproducibility, documentation, and clear reporting. She has supported student and workplace teams in turning messy experiments into repeatable workflows and publishable results.

Chapter 1: Reproducibility—What It Is and Why It Breaks

Reproducible AI is not a buzzword; it is the skill of making your work re-runnable so a result can be checked, trusted, and built on. In this course you will replicate a small AI study and report what changes—openly and carefully—rather than hoping you match a headline number on the first try. That mindset matters because AI results are often fragile: a different random seed, library version, or data file can shift the output enough to change a conclusion.

This chapter sets your foundation and your workflow. You’ll learn three related terms (repeatable, reproducible, replicable), map the “moving parts” of an AI study, and use a checklist to spot where breakage tends to happen. You will also choose a beginner-friendly target study and define success criteria with a time box—so your project stays safe, ethical, and finishable. The goal is not perfection; the goal is a clear, auditable trail: what you ran, with what inputs and settings, and what you got.

Most beginner frustration comes from an invisible gap: you think you “did the same thing,” but you didn’t capture the details that make “same” meaningful. Reproducibility is the habit of turning those details into a simple system: a clean folder structure, named runs, saved configs, and an experiment log a stranger could follow.

Practice note for Milestone 1: Define reproducible vs repeatable vs replicable (with examples): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Identify the “moving parts” in an AI study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Spot common failure points using a simple checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Choose your target study for this course project: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Set your success criteria and time box: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Define reproducible vs repeatable vs replicable (with examples): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Identify the “moving parts” in an AI study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Spot common failure points using a simple checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Choose your target study for this course project: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What counts as an AI study (in beginner terms)

An “AI study,” in beginner terms, is any documented attempt to answer a question using a model and evidence. The question might be “Can a classifier detect spam?” or “Does model A outperform model B on dataset X?” A study usually includes (1) a dataset (or a way to collect it), (2) a method (preprocessing + model + training/evaluation procedure), and (3) a reported result (metrics, plots, examples, or a table). The “paper” could be a formal academic publication, a conference workshop report, a blog post with code, or even a well-structured GitHub repository with a README and an experiment table.

For this course, you want a study that is small enough to run on a laptop (or free cloud notebook) and concrete enough that you can compare your outcome with the author’s. Good beginner studies often use classic datasets (MNIST, CIFAR-10 subset, IMDb reviews) or small tabular datasets (UCI). They usually report one or two main metrics (accuracy, F1, RMSE) and include a baseline model that trains quickly.

Practical test: if you can describe the study in one sentence—“Train a simple CNN on MNIST and report test accuracy”—you’re likely in the right scope. If you need specialized hardware, huge datasets, proprietary APIs, or complex data access agreements, it is not beginner-friendly. Your target is a study with clear steps and publicly accessible resources.

  • Study artifact: paper/blog/README that states the goal and the result.
  • Runnable method: code, notebook, or enough detail to implement.
  • Comparable output: metric(s) you can reproduce and record.

Think of a study as a recipe plus a taste test. Your job is to see whether the recipe works in your kitchen, and to document what changed if it doesn’t.

Section 1.2: Reproducible, repeatable, replicable—clear meanings

These terms are often used loosely, so you need a practical set of meanings you can apply while working. In this course, treat them as different levels of “same-ness,” from easiest to hardest.

Repeatable means you can run the same code, in the same environment, with the same inputs, and get the same outputs (or statistically indistinguishable outputs if randomness is expected). Example: you rerun a notebook in the same Python environment with fixed seeds and the accuracy matches within 0.1%.

Reproducible means someone else can do that too using your shared artifacts and instructions. Example: a classmate clones your repo, installs dependencies, runs one command, and gets the same evaluation table. This requires documentation, pinned versions, and recorded parameters—not just working code on your machine.

Replicable (or replication) means an independent re-implementation can reach the same scientific conclusion, even if the codebase is different. Example: you implement the described model from the paper from scratch (or using different libraries) and still observe the claimed performance trend (e.g., model A > model B) on the stated dataset.

  • Repeatable: same person, same setup, rerun.
  • Reproducible: different person, same shared setup.
  • Replicable: different implementation, same conclusion.

Engineering judgment: beginners should aim first for repeatability, then reproducibility, and only then attempt deeper replication. If you skip straight to “replicate the paper,” you may not know whether a mismatch is due to your new implementation, a missing preprocessing step, or a library difference.

Milestone 1 in this chapter is simply being able to label what you are doing at each step. When you run the baseline code as provided, that is usually repeatability/reproducibility work. When you intentionally change elements (re-implement, swap libraries), that is replication work.

Section 1.3: Why results change even when you “do the same thing”

Results change because “the same thing” is rarely fully specified. AI pipelines have hidden degrees of freedom—random seeds, data ordering, nondeterministic GPU kernels, and preprocessing defaults. Even small shifts can move a metric enough to matter, especially on small datasets or near decision boundaries.

Start with the most common cause: randomness. Training often uses random initialization, shuffled minibatches, dropout, and data augmentation. If the original study ran five seeds and reported the mean, but you run one seed, your number can legitimately differ. Fixing seeds helps, but note that seeds do not guarantee identical results across hardware or library versions.

Data differences are next. You might download a newer dataset version, use a different train/test split, or apply slightly different tokenization. “Same dataset” can still vary if the study used a curated subset, filtered examples, or a particular preprocessing script. A common beginner mistake is to rely on dataset names rather than verifying checksums, record counts, label distributions, and split logic.

Environment drift is subtle but powerful. A minor change in Python, NumPy, PyTorch/TensorFlow, CUDA, or even BLAS libraries can change numerical behavior and therefore training dynamics. Another common failure: dependency installation silently picks newer versions than the author used, because versions were not pinned.

Choices not written down also matter: early stopping patience, learning-rate schedules, batch size, number of epochs, and evaluation details. Two people can “train the same model” but evaluate on different checkpoints, compute metrics differently, or include/exclude preprocessing in the evaluation pipeline.

  • Randomness: seeds, shuffling, augmentation, dropout.
  • Data: versions, splits, preprocessing, filtering.
  • Environment: library versions, hardware, nondeterministic ops.
  • Procedure: hyperparameters, stopping rules, metric calculation.

Milestone 3 is learning to see mismatches as diagnostic signals, not personal failure. When your results differ, your task is to identify which moving part changed, then record it. Your final report can be valuable even if you do not match the original metric exactly—because you will explain why it shifted.

Section 1.4: Inputs, process, outputs—your reproducibility map

To replicate safely and systematically, you need a “reproducibility map” that breaks an AI study into three buckets: inputs, process, outputs. This is Milestone 2 (identify moving parts) expressed as a practical diagram you can keep in your project README.

Inputs include data files, labels, splits, and any external resources (pretrained embeddings, checkpoints). Also include configuration inputs: hyperparameters, random seeds, and command-line arguments. A good habit is to treat every run as a function call: if you can’t list the function’s inputs, you can’t expect the same output later.

Process includes preprocessing code, model architecture, training loop, evaluation procedure, and the environment in which it runs. For beginners, you can track the environment without advanced tools by recording: OS, Python version, core library versions, and whether you used CPU or GPU. Also record the exact command used to run training and evaluation.

Outputs include metrics, plots, logs, and saved artifacts (model weights). Outputs must be stored in a predictable place and named so you can compare runs. Avoid overwriting results; instead, create a new run folder per attempt.

  • Minimal folder structure: project/data/, src/ (or notebooks/), runs/, reports/.
  • Run naming: runs/2026-03-28_seed42_baseline/ with config.txt, metrics.json, stdout.log.
  • Experiment log: a simple experiment_log.md table: date, goal, changes, command, result, notes.

Milestone 4 (choose a target study) and Milestone 5 (success criteria/time box) become easier when you can point to what you will hold constant (inputs/process) and what you will compare (outputs). This map also protects you from “accidental improvements” that are really untracked changes.

Section 1.5: A beginner reproducibility checklist

This checklist is designed for beginners who are not using containers or heavy experiment platforms. It is intentionally lightweight: a handful of files and habits that prevent most confusion. Use it before you claim a run “matches” or “doesn’t match.”

  • Study reference captured: save the paper/blog link and commit hash (if code). Copy key claims (metric, dataset, model) into reports/notes.md.
  • Data pinned: record dataset source URL, download date, file counts, and (when possible) checksums. Record exact split procedure or store split files.
  • Environment recorded: write down OS, Python version, and key package versions. Keep a requirements.txt (even if not perfect) generated after installation.
  • Randomness controlled: set seeds where possible (Python, NumPy, framework). Note if GPU nondeterminism remains.
  • Commands saved: store the exact command(s) used for training and evaluation in the run folder.
  • Configs preserved: save hyperparameters and any config files alongside outputs.
  • Outputs not overwritten: each run gets its own directory with metrics and logs.
  • Single source of truth log: maintain experiment_log.md with what changed and why.

Common mistakes this checklist prevents: reusing the same output folder (so you compare the wrong results), “fixing” a bug but forgetting to note the change, and changing two variables at once (e.g., new seed and new learning rate). Engineering judgment here is simple: change one thing per run unless you are intentionally doing a larger migration, and then write it down clearly.

This checklist is also your bridge from repeatable to reproducible: once your log is clean and your run folders are consistent, another person can follow your steps with far fewer questions.

Section 1.6: Selecting a safe, small replication target

Your project study should be safe (ethically and legally), small (computationally), and clear (method and metric). Milestone 4 is choosing the target; Milestone 5 is setting success criteria and a time box so you finish. Aim for something you can run end-to-end in 1–3 hours once the environment is set up, and iterate on in short cycles.

Safety and ethics first. Avoid studies involving private user data, scraping that violates terms of service, medical/biometric identification, or anything requiring access-controlled datasets. Prefer widely used public datasets with clear licenses. Also avoid “harmful capability” targets (e.g., malware generation). Your goal is academic skill-building, not deploying a high-risk system.

Beginner-friendly targets often include: a baseline classifier on a standard dataset; a simple sentiment model; a small image model; or a tabular regression. Choose a study that reports a baseline and provides either code or enough detail to re-create it. If the study’s result depends on extensive hyperparameter search, it’s harder to time-box.

  • Good target signals: public data, single GPU not required, clear metric, clear preprocessing, a stated baseline.
  • Red flags: “we tuned extensively” with no details, proprietary data, heavy compute, unclear evaluation protocol.

Set success criteria: define what “success” means before you run anything. For example: “I can run the author’s code and get within ±1–2 percentage points of the reported accuracy,” or “I reproduce the ranking between two models even if absolute numbers differ.” Also define what you will deliver: a filled experiment log, a baseline run folder, and a short report describing differences.

Time box: commit to a fixed window (for example, two evenings or one weekend). If you hit a blocker (missing data, broken code), record it and choose a backup study rather than sinking the whole project. Replication work is real research work: knowing when to stop and document is part of the skill.

Chapter milestones
  • Milestone 1: Define reproducible vs repeatable vs replicable (with examples)
  • Milestone 2: Identify the “moving parts” in an AI study
  • Milestone 3: Spot common failure points using a simple checklist
  • Milestone 4: Choose your target study for this course project
  • Milestone 5: Set your success criteria and time box
Chapter quiz

1. In this chapter, what is the core purpose of reproducible AI?

Show answer
Correct answer: Make work re-runnable so results can be checked, trusted, and built on
The chapter defines reproducibility as making your work re-runnable with enough detail for others to verify and extend it.

2. Which situation best illustrates why AI results can be fragile?

Show answer
Correct answer: Changing a random seed, library version, or data file shifts the output enough to change a conclusion
The chapter highlights that small changes (seed, versions, data) can alter outputs and even conclusions.

3. What is the mindset the course emphasizes when replicating a study?

Show answer
Correct answer: Report what changes openly and carefully rather than only chasing a single best number
The course goal is to replicate a small study and transparently report differences, not just hunt a matching metric.

4. According to the chapter, what causes much beginner frustration in reproducibility work?

Show answer
Correct answer: An invisible gap: thinking you did the same thing without capturing the details that make 'same' meaningful
The chapter points to missing or uncaptured details as the main source of 'I did the same thing' confusion.

5. Which combination best matches the chapter’s recommended habits for an auditable trail?

Show answer
Correct answer: Clean folder structure, named runs, saved configs, and an experiment log a stranger could follow
The chapter frames reproducibility as a simple system of organizing and logging runs so others can follow exactly what happened.

Chapter 2: Build a Repeatable Workspace (No Fancy Tools Required)

Reproducible work starts long before you train a model. It starts with a workspace that makes it hard to lose files, forget settings, or accidentally “improve” something without noticing. In this chapter you’ll build a repeatable workspace using plain folders, simple text files, and habits you can follow on any computer. The goal is not perfection—it’s being able to answer, later, “What exactly did I run, with what data, and what did I get?”

Beginner replications often fail for boring reasons: a dataset got re-downloaded and changed, you overwrote outputs, a library version updated silently, or you can’t remember which script produced which chart. None of these are “AI problems”; they’re workflow problems. The fixes are also boring—in a good way. You will (1) create a clean folder structure, (2) record environment details in a setup note, (3) capture data/source provenance, (4) create a run log template, and (5) do a dry run to prove everything is accessible.

One engineering judgement to adopt early: you are building evidence, not just code. If someone else can follow your evidence trail and get the same outputs (or understand why they differ), you’re doing reproducible AI.

  • Milestone 1: Create a clean project folder structure
  • Milestone 2: Record environment details in a simple “setup note”
  • Milestone 3: Capture data and source information (provenance) the easy way
  • Milestone 4: Create a run log template you’ll reuse
  • Milestone 5: Do a dry run to confirm everything is accessible

Keep your tools minimal: a file manager, a text editor, and whatever you use to run the study (Python, R, a notebook, or even a browser). You do not need containers, pipeline frameworks, or experiment-tracking platforms to make meaningful progress. You need consistency.

Practice note for Milestone 1: Create a clean project folder structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Record environment details in a simple “setup note”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Capture data and source information (provenance) the easy way: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Create a run log template you’ll reuse: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Do a dry run to confirm everything is accessible: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Create a clean project folder structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Record environment details in a simple “setup note”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Capture data and source information (provenance) the easy way: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Project folders that prevent confusion

Start with a single project folder dedicated to the replication. Do not mix it into “Downloads,” a class folder with unrelated assignments, or a drive full of old experiments. Confusion is the number-one enemy of repeatability because it leads to accidental overwrites and missing context. Your project folder should separate: (a) inputs you obtained, (b) code you wrote or adapted, (c) outputs you generated, and (d) notes describing what happened.

Use a small, predictable structure. Here is a beginner-friendly layout that works for most replications:

  • 01_readme/ (what this project is, links to the study, how to run)
  • 02_setup/ (environment notes, install notes, version snapshots)
  • 03_data/
    • raw/ (downloaded or original data—treat as read-only)
    • processed/ (your cleaned/filtered data)
  • 04_code/ (scripts, notebooks, small utilities)
  • 05_runs/ (one subfolder per run, with outputs and run log)
  • 06_reports/ (figures you’ll include, write-up drafts, final report)

This structure supports Milestone 1: creating a clean folder structure. The key design choice is the “raw vs processed” split. If raw data is ever edited in place, you can’t reliably re-create the processed version. Treat 03_data/raw as read-only: if you must change something, save a new file in processed and document the transformation.

Common mistake: storing outputs directly next to code (“results.png” next to “train.py”). That invites overwriting and makes it unclear which settings produced which file. Instead, each run gets its own folder under 05_runs. This single habit prevents many replication headaches.

Section 2.2: Naming files so you can find and compare runs

File names are part of your reproducibility system. Good names let you compare runs without opening files and guessing. Aim for names that answer three questions: what is it, when was it produced, and under which configuration (or run ID).

A practical naming scheme for runs is a timestamp plus a short label:

  • 2026-03-28_1530_baseline/
  • 2026-03-28_1610_seed42/
  • 2026-03-29_0905_lr-0p001/

Inside each run folder, keep a consistent set of filenames so you can diff runs quickly:

  • run_log.md (your filled-in template)
  • metrics.json or metrics.csv (accuracy, loss, etc.)
  • stdout.txt (captured terminal output, if relevant)
  • fig_*.png (plots with meaningful prefixes)
  • model.bin or checkpoint/ (only if needed)

For individual files, avoid spaces and ambiguous names like “final,” “new,” or “results2.” Prefer lowercase with underscores, and encode key distinctions: train_split_v1.csv, train_split_v2_stratified.csv. If you export a figure, name it for the comparison it supports, not its aesthetics: fig_loss_curve_baseline.png is better than plot1.png.

Engineering judgement: don’t over-encode everything in the filename. If you find yourself creating names like run_seed42_lr0p001_bs32_augStrong_dropout0p3_v7.png, move those details into the run log instead. The filename should help you navigate; the run log should preserve the full truth.

Common mistake: reusing the same output path each time (“outputs/”). That silently mixes artifacts from multiple runs. If you must keep a single output folder, force it to include a run ID subfolder (even manually).

Section 2.3: Recording versions (OS, browser, libraries, tools)

Milestone 2 is a “setup note”: a simple text file that records your environment. This matters because small version differences can change results—sometimes slightly (floating-point behavior, default seeds) and sometimes dramatically (model implementations, preprocessing defaults). Your goal is not to freeze the world; it is to leave a clear trail so another person (or future you) can approximate your environment and understand discrepancies.

Create 02_setup/setup_note.md and include the following, filled in as best you can:

  • Date created and your name/initials
  • Operating system (e.g., Windows 11 23H2, macOS 14.3, Ubuntu 22.04)
  • Hardware (CPU model; GPU model if used; RAM optional but helpful)
  • Python/R version (or “browser only” if that’s the tool)
  • Key libraries with versions (e.g., numpy, pandas, torch, tensorflow, scikit-learn)
  • How you installed (pip, conda, system packages) and where (system, venv)
  • Browser version if you rely on web tools or hosted notebooks

Keep it simple: copy-paste version outputs from your terminal or “About” pages. If you use Python, it’s usually enough to save a list of installed packages at the time of the run (for example, exporting a “requirements” style list). If that feels advanced, record only the top 5–10 libraries you directly import. The point is coverage of the likely causes of changes.

Common mistake: recording versions once and never updating them. Instead, update the setup note when you notice something changed (you upgraded Python, your library auto-updated, you moved machines). Reproducibility is a living record, not a one-time ceremony.

Practical outcome: when your replication differs from the paper, you can rule in/out environment drift as a cause rather than guessing.

Section 2.4: Data provenance: where data came from and what you changed

Milestone 3 is about data provenance: documenting where your data came from, which exact version you used, and what you did to it. In replications, “same dataset” is often the hidden variable—especially when data is hosted online and can be updated, reprocessed, or mirrored. Provenance is your defense against invisible data drift.

Create 03_data/data_provenance.md (or put this in 01_readme if you prefer) and record:

  • Source: URL(s), paper citation, or repository link
  • Date accessed: when you downloaded or copied it
  • Identity: filename(s), dataset version tag, commit hash (if available)
  • Integrity: checksum if you can (optional), or at least file size and row counts
  • License/terms: usage permissions and any restrictions
  • Transformations: what you changed to create processed data

Keep raw data untouched in 03_data/raw. Put any cleaning scripts in 04_code and write outputs to 03_data/processed. Then, in your provenance note, describe transformations at the level a careful peer would need: “Removed rows with missing label; lowercased text; tokenized with X; split 80/10/10 using stratified sampling; random seed 42.” If you used a tool that does preprocessing automatically (some notebooks and libraries do), write that down too—defaults count as choices.

Common mistakes include: renaming raw files without recording the original name, copying data into a notebook cell (hard to track), or using “latest” links that change over time. Prefer stable links and archived releases when possible. If the study used a dataset snapshot, try to find that snapshot; if you can’t, record that mismatch explicitly because it may explain differences in results.

Practical outcome: you can re-run the same preprocessing months later and know whether a difference came from data content, data cleaning decisions, or model training.

Section 2.5: The run log: inputs, settings, outputs, notes

Milestone 4 is the heart of beginner reproducibility: a run log template you reuse every time. Many “replications” are not reproducible because the researcher cannot reconstruct what they did between attempt #1 and attempt #6. A run log turns that fog into a sequence of testable steps.

Create 05_runs/run_log_template.md and copy it into each run folder as run_log.md. Keep it short enough that you will actually fill it in. Here is a practical template:

  • Run ID: (folder name)
  • Date/time:
  • Goal: (baseline replication, change seed, match paper preprocessing, etc.)
  • Inputs: dataset files used (paths), counts (rows, classes), any downloaded resources
  • Code: script/notebook name(s), plus commit hash if you use git (optional)
  • Settings: key hyperparameters, random seed, number of epochs, batch size, learning rate
  • Environment: reference to setup note; note any deviations
  • Command: exact command or steps to run (copy-pasteable)
  • Outputs: paths to metrics/figures/models; headline numbers
  • Notes: warnings, errors, surprises, interpretation, “what to try next”

This log directly supports the course outcomes “Run a baseline replication and record inputs, settings, and outputs” and “Track experiments with a clear log so someone else can repeat your steps.” It also prepares you for Chapter 3’s work of explaining changes. When results differ, you can compare run logs line-by-line: was the seed different, the split different, the library version different, or did you change preprocessing?

Common mistake: writing narrative notes but not recording the exact command or the exact data file paths. If a step cannot be copied and executed by a stranger, it is not yet reproducible. Another mistake is recording metrics without recording how they were computed (e.g., macro vs micro F1). Put evaluation choices in “Settings.”

Section 2.6: Backups and snapshots for beginners

Milestone 5 is a dry run: confirm that everything is accessible and you could repeat your own steps. Before that, you need one more protection: simple backups and snapshots. Beginners often postpone backups until after something breaks. In reproducible AI, backups are not just safety—they’re part of the evidence.

You can do snapshots without fancy tools:

  • Copy the whole project folder to a second location (external drive or cloud) at key moments: after baseline run, after matching paper settings, after writing the report draft.
  • Export important artifacts (metrics files, final figures, run logs) into 06_reports so they survive cleanup in 05_runs.
  • Use “read-only” protection for raw data and for baseline outputs once confirmed, to avoid accidental changes.

Now do the dry run. Pick your baseline run folder (or create one), then verify:

  • You can locate the paper/study link from 01_readme.
  • Your setup note contains enough detail to reinstall the main tools.
  • Your data provenance note identifies the exact raw files present.
  • Your run log includes a copy-pasteable command or step list.
  • Outputs are stored inside the run folder, not scattered elsewhere.

If the dry run fails, treat it as a gift: it found a reproducibility gap early. Fix the gap immediately (update the note, rename the file, move the output). This is the practical mindset shift: you are not “done” when the model runs; you are done when the work can be rerun on purpose.

Common mistake: backing up only code. In replication work, the data snapshot and the run logs are often more valuable than the scripts, because they capture the conditions that produced your observed results.

Chapter milestones
  • Milestone 1: Create a clean project folder structure
  • Milestone 2: Record environment details in a simple “setup note”
  • Milestone 3: Capture data and source information (provenance) the easy way
  • Milestone 4: Create a run log template you’ll reuse
  • Milestone 5: Do a dry run to confirm everything is accessible
Chapter quiz

1. What is the main goal of building a repeatable workspace in this chapter?

Show answer
Correct answer: To be able to answer later what you ran, with what data, and what you got
The chapter emphasizes being able to reconstruct exactly what was run, on which data, and what outputs resulted.

2. Which situation best reflects a common reason beginner replications fail, according to the chapter?

Show answer
Correct answer: A library version updates silently and changes results
The chapter highlights workflow issues like silent library updates, overwritten outputs, and changed datasets—not core AI complexity.

3. How does the chapter recommend you make meaningful progress toward reproducibility without “fancy tools”?

Show answer
Correct answer: Rely on consistent folders, simple text notes, and repeatable habits
It stresses minimal tooling (file manager, text editor, and your run method) and consistency over complex platforms.

4. Why does the chapter treat capturing data/source information (provenance) as a milestone?

Show answer
Correct answer: So you can prove what data you used and detect if it changed
Provenance helps you document which data was used and prevents failures caused by re-downloaded or changed datasets.

5. What is the purpose of doing a “dry run” at the end of the chapter’s workflow?

Show answer
Correct answer: To confirm everything needed is accessible and the workflow runs end-to-end
The dry run is meant to verify that files, data, scripts, and outputs are accessible and the process is repeatable.

Chapter 3: Run the Baseline Replication and Capture Evidence

This chapter is where your replication becomes real: you will run the study exactly as described (your baseline run) and capture evidence that proves what you did and what you got. Beginners often think “running the code” is the hard part. In practice, the hard part is running it in a way someone else can verify, and saving outputs so you can compare later without re-running everything from memory.

We will treat the paper’s method section like an instruction manual, turn it into a checklist, run the baseline once without “improving” anything, and then save compare-ready outputs. You will also collect minimal artifacts—screenshots and small files that confirm your steps—without drowning in clutter. Finally, you’ll write a plain-language baseline summary and do a quick “could another person follow this?” check before moving on to changes in later chapters.

  • Milestone 1: Run the study exactly as described (baseline run)
  • Milestone 2: Save outputs in a compare-ready format
  • Milestone 3: Take evidence screenshots and minimal artifacts
  • Milestone 4: Summarize baseline results in plain language
  • Milestone 5: Confirm another person could follow your steps

Two key mindsets will keep you out of trouble: (1) baseline means “no creativity yet”—match the paper as closely as you reasonably can; (2) evidence beats confidence—record what happened, not what you think happened.

Practice note for Milestone 1: Run the study exactly as described (baseline run): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Save outputs in a compare-ready format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Take “evidence screenshots” and minimal artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Summarize baseline results in plain language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Confirm another person could follow your steps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Run the study exactly as described (baseline run): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Save outputs in a compare-ready format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Take “evidence screenshots” and minimal artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Summarize baseline results in plain language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Translating a method section into a to-do list

The method section in a study is rarely written like a tutorial. Your job is to translate it into an actionable to-do list that you can execute and later prove you executed. Start by scanning for anything that sounds like a setting, a choice, or a measurable outcome. Then rewrite those details into steps you can check off.

A practical way is to create a “baseline checklist” in a plain text file (for example, runs/2026-03-28_baseline/checklist.md). Use short, testable statements such as “Download dataset X version Y,” “Use model architecture Z,” “Train for N epochs,” “Evaluate on test split,” “Report metric M.” If the paper provides a command or pseudo-code, copy it verbatim into your checklist and mark what must be substituted (paths, GPU/CPU, etc.).

  • Inputs: dataset name, source URL, version/date, preprocessing rules, train/val/test split rules
  • Model: architecture, initialization, pretrained weights, tokenizer/vocabulary, loss function
  • Training: optimizer, learning rate schedule, batch size, epochs/steps, early stopping, augmentation
  • Evaluation: which split, which metric(s), averaging method, thresholding rules
  • Hardware/software: OS, Python version, library versions, GPU/CPU notes

Then run the baseline exactly as described—no extra regularization, no “better” split, no updated hyperparameters. If a detail is missing, make the smallest reasonable assumption and write it down as an explicit decision (e.g., “Paper does not state random seed; defaulted to 0”). Common mistake: silently “fixing” things (like changing a deprecated API or using a newer dataset mirror) and forgetting to record it. A baseline is not about perfection; it is about establishing a documented reference run you can compare against later.

Section 3.2: Seeds and randomness (explained simply)

Most AI experiments contain randomness: shuffled data order, random weight initialization, dropout, data augmentation, and sometimes non-deterministic GPU operations. A seed is a number that initializes the random number generator so that “random” choices become repeatable. If you use the same code, same data, same environment, and the same seed, you often (not always) get the same result.

For a beginner baseline replication, you should do two things: (1) set the seed wherever the framework allows; (2) record the seed value and any determinism settings. In practice this may include Python’s random, NumPy, and your ML framework (PyTorch/TensorFlow/JAX). Some frameworks also need flags for deterministic behavior, and some GPU kernels can still vary slightly run-to-run. That’s okay—your goal is to reduce avoidable randomness and make remaining randomness visible.

Engineering judgment: if the paper reports a mean and standard deviation across multiple runs, you should ideally match that later, but your baseline can start with a single run if time or compute is limited. Just label it clearly as “1-run baseline.” If the paper reports the best of several seeds, note that too—“best-of” results are usually higher than a single typical run.

  • Record: seed values, whether deterministic mode was enabled, number of runs
  • Watch for: different results when switching CPU vs GPU, or when changing library versions
  • Don’t panic: tiny metric differences can happen even with the same seed

Common mistake: setting a seed in one place but not others, then assuming the run is deterministic. Another mistake: re-running until you “get the paper’s number” and calling that replication. That approach hides the variability you are supposed to understand.

Section 3.3: What to record during a run (and what to ignore)

During the baseline run, you are collecting evidence. Evidence should answer: “What exactly did I run?” and “What did it output?” The simplest approach is to capture three categories: configuration, environment, and results. You do not need to record everything; you need to record the right things so someone else can reproduce your baseline without guessing.

Configuration includes command-line arguments, config files, and any defaults that matter (learning rate, batch size, epochs, dataset path, model name). If the code uses a config file, save a copy of the resolved config (the final config after defaults are applied). Environment includes OS, Python version, and library versions; also note CPU/GPU type if relevant. If you can, save a small environment snapshot: a pip freeze output or equivalent is usually enough for beginners.

Results includes the final metric(s), plus any training curves or intermediate evaluations that help diagnose issues later. A run log is valuable evidence; save the console output to a text file (redirect stdout/stderr) so you can prove what happened without relying on memory.

  • Record (minimum): date/time, git commit or code version, command run, seed(s), dataset version, final metrics
  • Record (helpful): training/validation curves, runtime, warnings, and any failed attempts with brief notes
  • Ignore (usually): every single batch-level printout, massive debug dumps, or duplicative logs that you will never read

Milestone 3 (evidence screenshots) fits here: take screenshots only when they add trust—e.g., the terminal showing the command and final metric, the dataset version page, or a configuration screen. Screenshot discipline matters: name them predictably (e.g., evidence_terminal_final-metric.png) and capture the smallest area that proves the point.

Section 3.4: Saving outputs: tables, plots, metrics, and samples

To compare baseline vs changed runs later, you need outputs in formats that are stable, diff-friendly, and easy to load. Think in layers: (1) raw metrics in a machine-readable file; (2) a human-readable summary table; (3) visualizations; (4) small qualitative samples when applicable.

Metrics: Save a single metrics.json (or metrics.csv) containing the final numbers you will compare: accuracy/F1/loss, plus any key settings (seed, epochs, dataset version). Keep it one row per run for easy later aggregation. Tables: If the paper reports a table, recreate it with the same columns and save it as CSV so you can compare values precisely. Plots: Save training curves as PNG or PDF with consistent axes labels and units; also save the underlying data (CSV) if possible so you can re-plot later without rerunning.

Samples: For tasks like text generation or classification, save a small set of representative predictions (e.g., 20 examples) as a CSV with input, model output, and ground truth. Choose a fixed sample set (same IDs) so comparisons are meaningful. This is especially helpful when metrics are close but behavior differs.

  • Compare-ready rule: prefer .json/.csv for numbers, .png for plots, and small, fixed-size sample files for qualitative checks
  • Name outputs clearly: include run ID and content (e.g., baseline_metrics.json, baseline_confusion-matrix.png)
  • Avoid: screenshots of tables as the only record—screenshots are evidence, not data

This section completes Milestone 2: once outputs are in stable formats, you can compare without reinterpreting pictures or re-reading logs.

Section 3.5: Organizing artifacts for later comparison

Artifacts are everything your run produced that you might need later: logs, metrics files, plots, model checkpoints, configs, and evidence screenshots. Without organization, you will mix files from different runs, overwrite outputs, and lose the ability to explain differences. Your goal is to make each run self-contained.

A beginner-friendly structure is “one folder per run,” with a short run ID that encodes date and purpose:

  • runs/2026-03-28_baseline/
    • config/ (resolved config, command used)
    • logs/ (console output, warnings)
    • metrics/ (metrics.json, metrics.csv)
    • plots/ (curves, confusion matrix)
    • samples/ (predictions on fixed examples)
    • evidence/ (screenshots)

Keep checkpoints only if they are needed for later analysis; large model files can bloat your project. If you do keep them, note the file size and why it’s kept (e.g., “needed for error analysis in Chapter 5”). Add a short README.md inside the run folder stating: what this run is, how to reproduce it, and where the key results are stored.

Milestone 5 is basically an organization test: hand the run folder to someone else (or pretend you are that person one week later). If they cannot locate the command, the config, and the final metric in under two minutes, your artifacts are not organized enough. Common mistake: storing everything in the project root or relying on “latest.log” files that get overwritten every run.

Section 3.6: Baseline summary: what you got and how confident you are

After the baseline run, you will write a plain-language summary that captures both the result and your confidence in it. This is Milestone 4, and it should be understandable to a beginner who has not read the paper. Keep it short but specific: what you tried, what you matched, what you could not match, and what you observed.

A useful template is:

  • Goal: “Replicate Table 1 accuracy on dataset X using model Y.”
  • What I ran: “Used authors’ code commit abc123, dataset version 1.2, seed 0, trained 10 epochs on GPU.”
  • What I got: “Test accuracy 0.842 (paper reports 0.850).”
  • Differences/assumptions: “Paper did not specify tokenizer version; used default from repository.”
  • Confidence: “Medium—metric is close; warnings about non-deterministic ops present.”

Confidence is not a feeling; it’s a judgment based on evidence. You can be confident even when numbers differ if you can explain why differences are plausible (seed variance, minor version differences, dataset mirror changes). Conversely, you should be low-confidence even when numbers match if you relied on undocumented tweaks or reran until it matched.

End the chapter by re-checking Milestone 1–5: you ran the baseline as described, saved outputs in compare-ready formats, captured minimal evidence, wrote a plain summary, and confirmed another person could repeat your steps. With that baseline locked in, you are ready to explore changes intentionally in later chapters—without losing track of what “baseline” really means.

Chapter milestones
  • Milestone 1: Run the study exactly as described (baseline run)
  • Milestone 2: Save outputs in a compare-ready format
  • Milestone 3: Take “evidence screenshots” and minimal artifacts
  • Milestone 4: Summarize baseline results in plain language
  • Milestone 5: Confirm another person could follow your steps
Chapter quiz

1. What is the primary goal of the baseline run in this chapter?

Show answer
Correct answer: Run the study exactly as described so results are verifiable
A baseline run matches the paper as closely as possible and produces evidence someone else can verify.

2. Why does the chapter emphasize saving outputs in a compare-ready format?

Show answer
Correct answer: So you can compare later changes without relying on memory or re-running everything
Compare-ready outputs let you evaluate changes later without reconstructing what happened from memory.

3. Which action best reflects the mindset 'evidence beats confidence'?

Show answer
Correct answer: Capture screenshots and minimal artifacts showing what you did and what you got
The chapter stresses recording what happened (evidence) rather than trusting what you think happened.

4. How should you treat the paper’s method section during Chapter 3?

Show answer
Correct answer: As an instruction manual to convert into a checklist for the baseline run
The method section is used to build a checklist that guides an exact baseline replication.

5. What is the purpose of Milestone 5 ('Confirm another person could follow your steps')?

Show answer
Correct answer: Ensure your process is reproducible and verifiable before making changes later
Milestone 5 is a quick reproducibility check: someone else should be able to follow your steps.

Chapter 4: Make Controlled Changes and Observe What Moves

You have a baseline run that works. That’s a big deal: it means you can execute the study’s code (or your simplified version) and get outputs you can compare. Now you’ll do the most educational part of reproducible AI: make small, controlled changes and observe what moves. The goal is not to “improve” the model. The goal is to learn which parts of the system are sensitive, which parts are stable, and how to report those findings so another person can reproduce your steps.

This chapter is built around a simple rule: treat your replication like a science experiment. You will plan one-change-at-a-time trials, change randomness settings, tweak data handling, optionally adjust environment/tool versions (or document why you can’t), and keep a clear experiment log. By the end, you should be able to say: “When I changed X, Y changed by about this much, under these conditions,” and back it up with files someone else can run.

Keep your scope beginner-friendly. Do not change ten things and then guess what caused the difference. Choose small, controlled changes, run them, and write down exactly what you did. This is how you build trustworthy evidence about what is reproducible, what is fragile, and what requires tighter documentation.

Practice note for Milestone 1: Plan one-change-at-a-time experiments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Change randomness settings and compare: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Change data handling slightly and compare: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Change environment/tool versions (or document constraints): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Log results and label changes clearly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Plan one-change-at-a-time experiments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Change randomness settings and compare: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Change data handling slightly and compare: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Change environment/tool versions (or document constraints): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Log results and label changes clearly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Controlled experiments without math-heavy jargon

In reproducible AI, a “controlled experiment” simply means you change something on purpose and keep everything else as close to the baseline as possible. You’re trying to answer questions like: “If I change the random seed, do my results wobble a little or a lot?” or “If I shuffle the dataset differently, does accuracy change in a meaningful way?” You do not need advanced statistics to do this well; you need discipline and good notes.

Start by defining your baseline as a snapshot: the exact code version, data files, environment details, and the command you ran. If you can’t recreate the baseline on demand, you can’t interpret any differences later. A good practical baseline includes: a saved configuration file (or documented parameters), the dataset version or checksum, and the resulting metrics plus any plots or example outputs.

  • Control: what you keep constant (same dataset, same training steps, same evaluation script).
  • Treatment: the one thing you change (seed, split ratio, library version).
  • Outcome: what you measure (accuracy, loss curve, runtime, sample outputs).

Common mistake: “silent changes.” For example, you rerun the notebook and it downloads a newer dataset version, or your code uses the current date/time to name output folders and you later compare the wrong runs. Practical outcome: create a consistent run folder structure (e.g., runs/ with timestamps plus a short label), and always save the configuration and metrics in the same place for every run.

Section 4.2: Changing one variable at a time

Milestone 1 is planning one-change-at-a-time experiments. This sounds obvious, but most replication failures come from changing multiple factors at once. If you change the seed, the data split, and the learning rate, then see a difference, you won’t know which change caused it. Your job is to make attribution easy.

A practical workflow is to write a tiny “experiment plan” before you run anything. Make a table in a text file called experiment_plan.md with rows like: baseline, seed change, data split change, preprocessing change, version change. For each row, include: the exact parameter you will alter, what you expect might happen, and what files you will save.

  • Choose a small knob: a single config value or a single line edit.
  • Keep runtime manageable: if training takes hours, reduce epochs consistently for all runs and say so.
  • Use a fixed evaluation: same metric script, same thresholding, same test set.

Common mistake: accidentally changing two variables through a hidden dependency. Example: changing batch size can also change the number of optimization steps per epoch or memory usage, which can trigger different backend kernels. If you must change something that has side effects, document those side effects explicitly and consider it “more than one change.” Practical outcome: you end up with a sequence of runs you can line up and compare, each with a clear cause.

Section 4.3: Randomness changes: seeds, sampling, and “small drift”

Milestone 2 is changing randomness settings and comparing results. Many ML pipelines contain randomness: weight initialization, data shuffling, dropout, augmentation, sampling, and sometimes nondeterministic GPU operations. Even with “the same code,” results can drift. Your job is to observe whether that drift is small (expected noise) or large (a warning sign).

First, locate where randomness is controlled. In beginner setups this usually means setting seeds for Python, NumPy, and your ML framework (such as PyTorch or TensorFlow). Also check for a shuffle=True in your data loader and any randomized augmentations. Then run a mini-seed sweep: baseline seed (e.g., 0), plus two or three other seeds (e.g., 1, 2, 3). Keep everything else identical.

  • Record: seed value(s), whether determinism flags were used, and whether you ran on CPU/GPU.
  • Compare: final metric, but also the training curve shape and qualitative outputs (a few predictions).
  • Look for “small drift”: tiny metric differences that don’t change conclusions are normal.

Common mistake: believing a fixed seed guarantees identical results across machines. In practice, different hardware, different library versions, or parallelism can break bit-for-bit determinism. Practical outcome: you can report a “range” of outcomes under different seeds (for example, accuracy varies by about ±0.5 points), which is more honest than a single number.

Section 4.4: Data changes: splits, preprocessing, and leakage (basic)

Milestone 3 is changing data handling slightly and comparing. Data differences are one of the biggest causes of replication gaps, and they’re easy to introduce accidentally. Start with “safe” changes that teach you sensitivity without breaking ethics or invalidating the task: change the train/validation split seed, adjust a preprocessing step, or modify how you handle missing values—always one at a time.

Splits: If the original study specifies a split, match it. Then try a controlled variation: keep the same split ratio but change the split random seed. Some datasets are small enough that the specific split matters a lot. Save the exact indices (or a file listing the IDs) for each split so others can reproduce it.

Preprocessing: Change one preprocessing step in isolation—e.g., normalize inputs vs. not, lowercase text vs. keep case, resize images with a different method. Document the exact transformation and where it happens (before split vs. after split).

  • Leakage warning: do not “learn” preprocessing from the full dataset if it should be learned from training only (e.g., scaling parameters, vocabulary, feature selection).
  • Label hygiene: ensure labels aren’t accidentally included in features (common in CSV merges).

Common mistake: applying preprocessing using statistics computed on the entire dataset (train + test), then reporting test performance. That makes results look better and undermines reproducibility. Practical outcome: you learn which data handling choices are fragile and can explain differences responsibly.

Section 4.5: Environment changes: versions, hardware, and settings

Milestone 4 is changing environment/tool versions (or documenting constraints). Even if you never touch the code, results can change because the software stack changes. Library updates can alter default behaviors, numerical precision, or random number generators. Hardware differences (CPU vs GPU, different GPUs) can affect speed and sometimes determinism.

Begin by writing down what you have, not what you wish you had. Capture: operating system, Python version, key libraries (framework + NumPy/Pandas), and whether you used CPU or GPU. A lightweight approach is to save pip freeze (or a short curated list of important packages) into a file like environment_baseline.txt. If you’re using a hosted environment (Colab, Kaggle), note that it can change over time; include the date.

  • Version bump experiment: upgrade exactly one library (e.g., your ML framework) and rerun the baseline.
  • Hardware change experiment: run once on CPU if you used GPU (or vice versa) and compare metrics and runtime.
  • Constraints: if you cannot change versions, document that limitation and treat it as part of your replication report.

Common mistake: “mystery upgrades” where a new environment is created and many packages change at once. If you do need a fresh setup, rebuild it stepwise and log each change. Practical outcome: you can explain whether a result difference might plausibly come from a version change, and you can help others rebuild your environment.

Section 4.6: Interpreting differences: noise vs meaningful change

Milestone 5 is logging results and labeling changes clearly, then interpreting what you see. Not every difference matters. Your job is to separate expected noise from changes that alter the study’s conclusion. This is where engineering judgment shows up: you decide what comparisons are fair, what variability is normal, and what needs deeper investigation.

Create an experiment log that is easy to scan. A simple format works: one row per run in experiment_log.csv (or a Markdown table). Include: run ID, change label, command/config used, seed, data version, environment note, key metrics, and a link/path to artifacts (plots, model file, predictions). Use consistent naming like R03_seed2, R04_split_seed99, R05_torch2.2.

  • Noise indicators: small metric shifts across seeds, similar curves, similar error types.
  • Meaningful change indicators: large metric shifts, instability (diverging loss), or different qualitative failure modes.
  • Fairness check: ensure evaluation data and metric code are identical across runs.

Common mistake: chasing the “best” run and reporting only that. Replication is about transparency, not leaderboard behavior. Practical outcome: you can write a clear statement like: “Changing only the seed produced minor variation; changing the data split produced larger swings; upgrading the framework changed results slightly and slowed training,” backed by your logged evidence.

Chapter milestones
  • Milestone 1: Plan one-change-at-a-time experiments
  • Milestone 2: Change randomness settings and compare
  • Milestone 3: Change data handling slightly and compare
  • Milestone 4: Change environment/tool versions (or document constraints)
  • Milestone 5: Log results and label changes clearly
Chapter quiz

1. What is the main goal of making small, controlled changes after you have a working baseline run?

Show answer
Correct answer: Learn which parts of the system are sensitive or stable and how to report reproducible findings
Chapter 4 emphasizes learning what changes affect results and documenting it so others can reproduce your steps, not optimizing performance.

2. Which approach best follows the chapter’s “treat your replication like a science experiment” rule?

Show answer
Correct answer: Change one factor at a time, run, and compare results to the baseline
One-change-at-a-time trials make it possible to attribute differences to a specific change.

3. Why does the chapter recommend changing randomness settings as a controlled experiment?

Show answer
Correct answer: To see how sensitive outputs are to random variation under the same conditions
Adjusting randomness helps reveal whether outcomes shift due to stochasticity and how stable the system is.

4. If you tweak data handling slightly and observe different results, what should you be able to say by the end of the chapter?

Show answer
Correct answer: When I changed X, Y changed by about this much, under these conditions, supported by runnable files
The chapter’s target outcome is a clear, condition-specific claim backed by reproducible artifacts.

5. What is the purpose of logging results and labeling changes clearly during these experiments?

Show answer
Correct answer: So another person can reproduce your exact steps and understand what caused differences
A clear experiment log links each controlled change to its outcomes, enabling trustworthy reproduction.

Chapter 5: Compare Results and Turn Differences into Findings

You ran a baseline replication. You logged settings, inputs, and outputs. Now comes the part that turns “I ran it” into “I learned something”: comparing results and translating differences into clear, defensible findings. Beginners often stop at “my accuracy is lower,” but reproducible AI work goes one step further: you show what changed, how much it changed, and why it likely changed, using evidence you collected.

This chapter gives you a workflow you can repeat for almost any small ML paper or tutorial. You will build a simple comparison table across runs (Milestone 1), create a “what changed/what didn’t” list (Milestone 2), classify causes (Milestone 3), write limitations and confidence statements (Milestone 4), and finish with a reproducibility checklist (Milestone 5). The goal is not perfection. The goal is clarity: someone else should be able to understand your differences, and you should be able to defend your conclusions without guessing.

Think of differences as data. In reproducible AI, discrepancies are not automatically failures; they are signals. Some signals are expected (randomness), some indicate a meaningful methodological gap (data preprocessing), and some are simply reporting issues (metric mismatch). Your job is to separate these categories carefully and communicate them with the right level of confidence.

Practice note for Milestone 1: Build a simple comparison table across runs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Create a “what changed/what didn’t” list: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Classify causes: data, randomness, environment, choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Write limitations and confidence statements: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Produce a final reproducibility checklist for the study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Build a simple comparison table across runs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Create a “what changed/what didn’t” list: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Classify causes: data, randomness, environment, choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Write limitations and confidence statements: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Metrics and outcomes: reading them as a beginner

Section 5.1: Metrics and outcomes: reading them as a beginner

Start by confirming you are comparing the same outcome. Many replication “failures” are metric misunderstandings. A paper might report test accuracy, while your notebook prints validation accuracy; a leaderboard might use macro F1, while you used micro F1; the reported number might be averaged over 5 seeds, while you ran once.

As a beginner, you can read metrics with a few simple questions: What is the unit (accuracy %, loss, BLEU points)? Which split (train/validation/test)? Which aggregation (single run, mean±std, best checkpoint)? Which decision threshold (0.5 by default, tuned on validation)? If any of these differ, you are not yet comparing like-with-like.

  • Match the split: confirm that your evaluation is on the same dataset partition as the study.
  • Match the definition: for F1, specify macro/micro/weighted; for AUC, specify ROC vs PR.
  • Match the checkpoint rule: “last epoch” vs “best validation” can change results substantially.
  • Match preprocessing: tokenization, normalization, resizing, and label mapping are part of the metric pipeline.

Practical outcome: write one short “Metric spec” paragraph in your log (or README) that states exactly what you computed. This becomes the anchor for all later comparisons and prevents you from chasing differences that are only bookkeeping.

Section 5.2: Comparing outputs side-by-side (practical approach)

Section 5.2: Comparing outputs side-by-side (practical approach)

Milestone 1 is a simple comparison table across runs. Keep it boring and explicit: one row per run, one column per variable or outcome you care about. The table is not just for metrics—it should include the key inputs and settings that could plausibly affect outcomes.

A practical minimum table for beginners includes: run ID, date/time, code version (commit hash or zip name), dataset version (file checksum or download URL + date), environment (Python and library versions), seed, training steps/epochs, and primary metric(s). Add 1–2 “sanity check” outputs such as number of training examples, label distribution, or a small sample prediction.

  • Run IDs: use a consistent naming convention (e.g., run_003_seed42_augOff).
  • Keep raw outputs: save evaluation JSON/CSV, not just screenshots.
  • Include failure runs: crashes and partial runs often reveal environment issues.

Milestone 2 is the “what changed/what didn’t” list. After you fill the table, write two bullets lists: (A) fields that differ across runs (e.g., seed, GPU type, tokenizer version), and (B) fields that stayed constant (e.g., dataset split file, metric function). This list is your guardrail against over-explaining: you can’t attribute a change to something that did not change.

Common mistake: comparing only the final metric. If your accuracy dropped, but your number of training examples also changed because of a filtering step, the “result difference” is downstream of a data difference. Your side-by-side view should expose that immediately.

Section 5.3: Effect size in plain language (how big is the change?)

Section 5.3: Effect size in plain language (how big is the change?)

Once you’ve confirmed metrics match and you have a comparison table, you need to answer: is the change small noise or a meaningful shift? “Effect size” here does not require advanced statistics. For beginner replications, you can use clear, plain-language comparisons that still communicate magnitude responsibly.

Start with absolute difference and relative difference. Example: accuracy went from 0.84 to 0.82 (absolute −0.02, relative −2.4%). For loss, lower is better, so state direction explicitly. Then add one stability check: run multiple seeds (even 3 is helpful) or rerun evaluation if training is expensive but evaluation is cheap.

  • Small change: within the typical run-to-run wiggle you observe across seeds (e.g., ±0.5% accuracy).
  • Medium change: larger than seed variation but still plausible from minor preprocessing or library updates.
  • Large change: clearly outside observed variability; likely a pipeline mismatch, data shift, or major environment change.

If the original study reports mean±std over multiple runs, compare your score to that range. If it reports only a single number, be careful: you don’t know its variability. In that case, your best move is to report your own variability (“Across 3 seeds, mean=…, std=…”) and interpret differences with humility.

Practical outcome: in your report, include one sentence that quantifies size and one sentence that interprets it. Example: “Our test F1 is 0.71 vs 0.74 reported (−0.03). Across 3 seeds our std is 0.01, so this gap is larger than our observed randomness and suggests a methodological difference.”

Section 5.4: Root-cause thinking for result differences

Section 5.4: Root-cause thinking for result differences

Milestone 3 is classifying causes: data, randomness, environment, and choices. This is engineering judgment: you won’t always “prove” the cause, but you can build a strong case by testing one change at a time and keeping careful notes.

Data causes include different dataset versions, different split files, missing examples, preprocessing differences (tokenization, normalization, resizing), label mapping, and leakage (train/test contamination). Data issues often show up as changed dataset counts, different class balance, or surprising examples when you spot-check.

Randomness causes include different seeds, nondeterministic GPU operations, data loader shuffling, and dropout. A telltale sign is that results move around with seeds but stay within a narrow band.

Environment causes include library version differences, different CUDA/cuDNN behavior, CPU vs GPU execution, and OS-level differences. These can change performance or even numerics. Record versions in every run so you can connect a score change to a version change.

Choices causes are your decisions: hyperparameters, early stopping criteria, batch size, learning rate schedule, augmentation settings, and evaluation thresholds. These are not “mistakes,” but they must be labeled as deviations from the original protocol.

  • Make one controlled change: if you suspect preprocessing, rerun with only that preprocessing aligned.
  • Use ablation logic: revert changes until the result returns, then reapply to confirm.
  • Prefer simplest explanations first: metric mismatch, split mismatch, seed differences, version drift.

Common mistake: changing five things to “get closer” to the paper and then not knowing what mattered. Your comparison table plus controlled edits keeps the story traceable.

Section 5.5: When you can’t replicate: documenting blockers

Section 5.5: When you can’t replicate: documenting blockers

Sometimes replication fails for reasons that have nothing to do with your skill. Data may be unavailable, code may not run, a dependency may be deprecated, or the study may omit key details. Milestone 4 is writing limitations and confidence statements that are honest and useful.

Document blockers as specific, testable statements. “Couldn’t reproduce” is vague; “Dataset download link returns 404 as of 2026-03-28; attempted mirrors X and Y; no checksum provided in paper” is actionable. For missing hyperparameters, list what you tried and why: “Paper did not specify max sequence length; tested {128, 256} based on GPU memory; results varied by 0.02 F1.”

  • Blocker types: unavailable data, incomplete method description, unreleased code, licensing/ethics constraints, compute limits.
  • Evidence to include: error logs, screenshots of missing links, environment files, minimal failing example.
  • Confidence language: “We are confident the metric implementation matches; we are uncertain about preprocessing because …”

Your limitations section should separate what you verified from what you inferred. This protects you from overclaiming and makes your replication valuable even when it’s incomplete: future readers can pick up exactly where you left off.

Section 5.6: Creating a reproducibility “scorecard” for your project

Section 5.6: Creating a reproducibility “scorecard” for your project

Milestone 5 is a final reproducibility checklist for the study—your “scorecard.” This turns your work into something others can quickly trust and reuse. Keep it short enough to fit in a README, but concrete enough that someone can rerun your pipeline without a meeting.

A practical scorecard has two parts: (1) replication status (did you match the headline metric within an acceptable range?), and (2) reproducibility readiness (can a new person rerun your exact results?). Include both, because you can sometimes match results without being reproducible (e.g., manual steps), and you can be reproducible even if results differ (e.g., clear documented deviations).

  • Runs logged: run table includes code version, data version, env versions, seeds, metrics.
  • Outputs saved: raw predictions/metrics files stored with run IDs.
  • Determinism: seeds set; nondeterminism noted; multi-seed variability reported.
  • Data traceability: source URLs, checksums (or counts), and split method documented.
  • Protocol match: deviations from the original study listed (choices) and justified.
  • Limitations: blockers and uncertainties stated with evidence.

End your chapter deliverable with a brief “findings” paragraph that ties everything together: what stayed the same, what changed, how big the change was, and your best-supported cause category. When you can do that clearly, you have moved from “I tried it” to “I produced a reproducible research artifact.”

Chapter milestones
  • Milestone 1: Build a simple comparison table across runs
  • Milestone 2: Create a “what changed/what didn’t” list
  • Milestone 3: Classify causes: data, randomness, environment, choices
  • Milestone 4: Write limitations and confidence statements
  • Milestone 5: Produce a final reproducibility checklist for the study
Chapter quiz

1. In this chapter’s workflow, what turns “I ran it” into “I learned something”?

Show answer
Correct answer: Comparing results and translating differences into clear, defensible findings using collected evidence
The chapter emphasizes showing what changed, how much, and why—based on evidence—not just reporting a metric.

2. Which sequence best matches the milestones described in Chapter 5?

Show answer
Correct answer: Build a comparison table → make a what changed/what didn’t list → classify causes → write limitations/confidence → produce a reproducibility checklist
The milestones are presented as a repeatable workflow in that order from comparison to checklist.

3. What is the chapter’s main goal when reporting differences between runs?

Show answer
Correct answer: Clarity so others can understand the differences and you can defend conclusions without guessing
The chapter explicitly states the goal is not perfection; it is clarity and defensible conclusions.

4. How does the chapter suggest you should treat discrepancies between your replication and the original?

Show answer
Correct answer: As signals that may be expected or meaningful, to be categorized and communicated with appropriate confidence
Discrepancies are framed as data/signals: some expected (randomness), some meaningful (method gaps), some reporting issues.

5. According to Milestone 3, which set best represents the cause categories you should use to explain differences?

Show answer
Correct answer: Data, randomness, environment, and choices
Milestone 3 focuses on classifying causes into data, randomness, environment, and choices.

Chapter 6: Write the Replication Report and Share Responsibly

You ran the replication, logged your settings, and observed where results match (or drift). Now you need to turn that work into a report another person can trust and rerun. A good replication report is not a dramatic story; it is a careful, minimal explanation of what you attempted, what you changed, and what happened. It also acts as a “map” to your artifacts: code, configs, seeds, outputs, and notes.

This chapter walks you through five practical milestones: drafting a 1–3 page structure, moving the messy details into an appendix, writing results and discussion clearly, packaging the work for reruns, and completing a final self-audit before publishing or submitting. The goal is a beginner-friendly deliverable that is honest, checkable, and respectful of data rights and privacy.

Keep your audience in mind: a classmate, a reviewer, or your future self in three months. They do not want prose first—they want decisions, versions, and commands. When in doubt, prefer specifics over adjectives. “Used Python 3.11.6 and seed=42” is better than “used the latest Python and fixed randomness.”

Practice note for Milestone 1: Draft a 1–3 page replication report structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Add method, logs, and artifacts as an appendix: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Write a clear results and discussion section: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Create a shareable package (files + instructions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Do a final self-audit and publish or submit: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Draft a 1–3 page replication report structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Add method, logs, and artifacts as an appendix: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Write a clear results and discussion section: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Create a shareable package (files + instructions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Do a final self-audit and publish or submit: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Replication report template (beginner-friendly)

Section 6.1: Replication report template (beginner-friendly)

Milestone 1 is to draft a 1–3 page structure that you can fill in quickly. A replication report is different from a typical research paper: you are not introducing a new method, you are validating (or stress-testing) a published claim. Your structure should make it easy to see (1) what the original study did, (2) what you did, and (3) how close the outcomes are.

Use a template like this (keep headings, keep it short):

  • Title + citation: “Replication of [Paper Title], [Year]” + full citation.
  • What you tried to replicate: one paragraph describing the dataset/task/metric and the specific result/figure/table you targeted.
  • Replication scope: what you did not replicate (e.g., “no hyperparameter sweep,” “smaller subset due to compute limits”).
  • Environment summary: OS, Python version, key libraries, hardware basics (CPU/GPU), and where these details are recorded (appendix or files).
  • Method: your pipeline steps, aligned with the original paper’s steps; note any necessary substitutions.
  • Results: side-by-side comparison (original vs yours) with the metric definition.
  • Discussion: reasons for matches/differences (random seeds, data versions, preprocessing, hardware, library changes).
  • How to rerun: one command sequence; point to your packaged instructions.

Common beginner mistake: mixing “what I changed” into every paragraph. Instead, write the method as if it is the official procedure, then call out deviations explicitly in a short “Deviations from the original” bullet list. This reduces confusion and makes your engineering judgment visible.

Section 6.2: How to cite sources, data, and tools correctly

Section 6.2: How to cite sources, data, and tools correctly

Replication work stands or falls on traceability. Milestone 2 (appendix with logs and artifacts) becomes more valuable when you cite every dependency precisely: the paper, the dataset, the codebase, and the tools. Proper citation is not just academic style—it is the reproducibility “address” that lets someone fetch the same inputs you used.

At minimum, cite:

  • Original paper: include DOI/URL and the exact version you read (conference version vs arXiv preprint can differ).
  • Datasets: name, official homepage, license, and version/date. If the dataset is dynamic, record the snapshot date or checksum.
  • Code: repository URL plus commit hash/tag. If you used your fork, cite both upstream and your fork.
  • Tools and libraries: key packages and versions (PyTorch/TensorFlow, NumPy, scikit-learn). Cite major tools if requested by their maintainers (some provide “Cite this software” text).

Include a short “Provenance” subsection listing where each artifact came from and how you verified it (e.g., checksum, official release). If you used a pre-trained model, treat it like data: cite the model card, version, and any constraints. Avoid vague statements like “used CIFAR-10” without specifying which source and split procedure.

Common mistake: citing only URLs. URLs rot. Add a date accessed and, when possible, a permanent identifier (DOI, Zenodo record, commit hash). This small habit prevents future confusion when an online file changes silently.

Section 6.3: Presenting comparisons: tables, bullets, and visuals

Section 6.3: Presenting comparisons: tables, bullets, and visuals

Milestone 3 is writing results and discussion so the reader can evaluate “same or different” quickly. Your job is not to overwhelm with plots; it is to present the core comparison clearly, then back it up with evidence.

Start with a compact comparison table. Include: metric name, original reported value, your value (mean ± std if you ran multiple seeds), difference (absolute and/or relative), and notes (e.g., “different tokenizer,” “dataset version mismatch”). Tables force precision and reduce the temptation to over-explain.

  • Use bullets for deviations: one list labeled “Differences from original implementation,” each item starting with a noun (“Preprocessing: …”, “Training schedule: …”).
  • Use visuals only when they add meaning: a learning curve plot can reveal underfitting; a confusion matrix can show class-specific drift. Don’t add charts that merely restate the table.
  • Report uncertainty: if randomness matters, run at least 3 seeds when feasible and report variability. If you cannot, say so and explain why.

In the discussion, connect changes to likely causes. Good discussion uses testable language: “Accuracy is 1.8 points lower; likely due to data preprocessing mismatch (normalization differs). Future check: rerun with the paper’s mean/std.” Avoid claims like “the paper is wrong” unless you have ruled out common causes (seed control, data version, metric definition, evaluation code).

Common mistake: comparing against the wrong metric or split. Always restate: which dataset split, which thresholding, which averaging method (macro vs micro), and which evaluation script. If you used a different evaluation function, put the reasoning and code location in the appendix.

Section 6.4: Ethics basics: privacy, licenses, and responsible claims

Section 6.4: Ethics basics: privacy, licenses, and responsible claims

Replication is not only technical; it is also about responsible sharing. Before you publish anything, verify you have the right to redistribute code, data, and outputs. This is where beginners often make avoidable mistakes—especially when datasets include people, text scraped from the web, or images with unclear rights.

Three practical checks:

  • Privacy: do not share raw personal data, identifiers, or model outputs that could reveal private information. If you used sensitive data, share only aggregated metrics and a description of how access is controlled.
  • Licenses: read dataset and code licenses. Some allow research use but forbid redistribution. If redistribution is restricted, provide scripts to download from the official source instead of bundling the files.
  • Claims: phrase conclusions carefully. A replication with small compute is evidence, not a verdict. Say “In our setup…” and list constraints. Avoid overstating generality.

Also check whether the original paper has known issues or retractions; note them neutrally. If you discovered a potential bug, describe it precisely and respectfully, and provide a minimal reproduction. Responsible reporting improves the community; accusatory language shuts down collaboration.

Common mistake: uploading a full “results/” folder that includes copyrighted images or private text. A safer pattern is to share derived artifacts (plots, summary tables) and the code to regenerate them, plus a clear README explaining what is intentionally excluded.

Section 6.5: Packaging your work so others can rerun it

Section 6.5: Packaging your work so others can rerun it

Milestone 4 is to create a shareable package: files plus instructions that allow a rerun with minimal guessing. Think of packaging as a product: if a stranger cannot run it in 15–30 minutes (excluding training time), your package is incomplete.

A beginner-friendly folder layout might look like this:

  • README.md: what this is, what it replicates, prerequisites, and the exact commands to run.
  • environment.txt or requirements.txt: pinned versions where possible.
  • src/: code.
  • configs/: JSON/YAML configs used for runs; name them with dates or experiment IDs.
  • logs/: experiment log (CSV/Markdown) and console outputs.
  • results/: final metrics tables and plots (avoid raw restricted data).
  • appendix/: full method notes, command history, and references to artifacts.

Include a “Quickstart” section with copy-paste commands: create environment, download data (from official source), run training/evaluation, and reproduce the main table/figure. If the project requires a GPU, state it and provide a CPU fallback if possible (even if slower or lower accuracy). Record random seeds in config files and ensure your code actually uses them (set seeds for Python, NumPy, and your ML framework).

Common mistake: packaging only code, not configs. If someone cannot see your hyperparameters, preprocessing, and evaluation settings, they cannot replicate your replication. Treat configs as first-class artifacts.

Section 6.6: Final audit: completeness, clarity, and next steps

Section 6.6: Final audit: completeness, clarity, and next steps

Milestone 5 is a final self-audit before you publish or submit. Your goal is to catch “silent gaps” that make reruns fail or make your conclusions ambiguous. Use a checklist and be strict—this is how you earn trust.

Completeness checks:

  • One-click path: can you reproduce the main result from a clean folder using only the README?
  • Versions recorded: OS, Python, libraries, and any external tools (CUDA version if relevant).
  • Data provenance: dataset source, version/snapshot date, and any filtering steps.
  • Experiment log: each run has an ID, config reference, seed, and output location.
  • Appendix present: method details, deviations, and links to artifacts.

Clarity checks:

  • Metric definition: explicitly state how the metric is computed and on which split.
  • Comparison is readable: a single table or figure that answers “how close are we?”
  • Discussion is constrained: you separate observed facts from hypotheses and note limitations.

Next steps: if you plan to share publicly, choose an appropriate venue (class submission, GitHub, OSF, or a simple zip). If you found a likely issue in the original work, consider contacting the authors with a short, courteous summary and a minimal reproduction package. Replication is a skill; every careful report you ship makes your future projects faster, safer, and more credible.

Chapter milestones
  • Milestone 1: Draft a 1–3 page replication report structure
  • Milestone 2: Add method, logs, and artifacts as an appendix
  • Milestone 3: Write a clear results and discussion section
  • Milestone 4: Create a shareable package (files + instructions)
  • Milestone 5: Do a final self-audit and publish or submit
Chapter quiz

1. What is the primary purpose of a replication report according to Chapter 6?

Show answer
Correct answer: To provide a careful, minimal explanation of what you attempted, what you changed, and what happened so others can trust and rerun it
The chapter emphasizes a replication report as a trustable, rerunnable record—not a dramatic story or persuasive essay.

2. How should you handle “messy details” like logs, configs, and outputs in the report?

Show answer
Correct answer: Put them mainly in an appendix so the main report stays clear while remaining checkable
Milestone 2 recommends moving method details, logs, and artifacts into an appendix while keeping the main report readable.

3. Which writing choice best matches Chapter 6’s guidance on clarity?

Show answer
Correct answer: Use specifics like versions and seeds (e.g., “Python 3.11.6 and seed=42”) rather than adjectives
The chapter advises preferring concrete, reproducible specifics over vague language and prose-first writing.

4. What does Chapter 6 mean when it says the report should act as a “map” to your artifacts?

Show answer
Correct answer: It should point readers to the code, configs, seeds, outputs, and notes needed to rerun and verify the work
A key function of the report is to guide others to the exact materials needed for checking and rerunning.

5. Which sequence best reflects the five milestones in Chapter 6?

Show answer
Correct answer: Draft a 1–3 page structure → add method/logs/artifacts as an appendix → write results and discussion clearly → create a shareable package → do a final self-audit and publish/submit
The chapter outlines a practical progression from report structure to appendix, clear results/discussion, packaging, and a final self-audit.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.