HELP

+40 722 606 166

messenger@eduailast.com

Proctored-Ready ML Portfolio Project: Tests, Docs, and CI

AI Certifications & Exam Prep — Intermediate

Proctored-Ready ML Portfolio Project: Tests, Docs, and CI

Proctored-Ready ML Portfolio Project: Tests, Docs, and CI

Build and ship a cert-grade ML project that holds up under proctoring.

Intermediate machine-learning · certification · portfolio · testing

Build a certification-grade ML project that stands up to proctoring

This course is a short, book-style build sprint focused on one outcome: a proctored-ready machine learning portfolio project you can confidently submit for certification reviews, technical screenings, or capstone evaluations. Instead of “toy notebooks,” you’ll ship a structured repository with repeatable training, defensible validation, automated tests, and professional documentation—exactly the evidence reviewers look for when they assess competence under time pressure.

You’ll work through a coherent 6-chapter progression that mirrors real-world ML delivery: define the problem and evidence plan, build a reproducible baseline, improve modeling with rigorous validation, harden reliability with tests, publish documentation that communicates intent and limitations, and finish with CI plus a submission-ready release. Throughout, the emphasis is on clarity, traceability, and reproducibility—so your project can be evaluated fairly and rerun on demand.

What makes this “proctored-ready”

Proctored and certification-style assessments often require you to explain your work and prove it functions without hidden steps. This course bakes those constraints into the build:

  • Evidence-first planning: you’ll define what must be proven (metrics, quality gates, reproducibility) before coding.
  • Deterministic runs: pinned dependencies, controlled randomness, and consistent artifacts.
  • Quality gates: unit + integration tests, data checks, and regression safeguards around performance.
  • Reviewer-friendly docs: README, model card, data documentation, and runbooks that answer predictable questions.
  • Automation: CI that runs checks on every change so reviewers can trust the repository state.

Who this is for

This is for learners who already know basic Python and have seen scikit-learn workflows, but want to level up into certification-ready execution. If you’ve built models before yet struggle to package them as a credible portfolio artifact—with tests, documentation, and repeatability—this is your bridge from “it works on my machine” to “it passes review.”

How the 6 chapters flow

You’ll start by choosing a tractable ML problem and defining acceptance criteria and an evidence map. Next, you’ll scaffold a clean repo and implement a baseline with a reproducible data flow. Then you’ll improve the model with rigorous validation and responsible evaluation practices. After that, you’ll add a full testing strategy tailored to ML (including data validation and metric regression checks). You’ll document the project like a professional, producing artifacts that reviewers can scan quickly. Finally, you’ll automate checks in CI, package the project for repeatable execution, and assemble a submission-ready release with a proctored walkthrough script.

What you’ll finish with

  • A clean, structured ML repository you can share publicly or privately
  • Repeatable training/evaluation entry points (scripted, not manual)
  • Automated tests and quality gates aligned to ML failure modes
  • Professional documentation: README + model card + data documentation
  • CI automation to validate every commit/PR
  • A final checklist and walkthrough narrative for timed reviews

Get started

If you want a portfolio project that reads like a professional submission and behaves like a reliable software product, this course is designed to get you there quickly. Register free to begin, or browse all courses to compare options in AI certification prep.

What You Will Learn

  • Design a certification-grade ML project plan with clear scope, risks, and evaluation criteria
  • Set up a reproducible Python ML repository structure with environment pinning and data version strategy
  • Implement a baseline-to-improvement modeling workflow with defensible metrics and validation
  • Write robust unit, integration, and data validation tests for ML code and pipelines
  • Create professional documentation: README, model card, data sheet, and API/CLI usage guides
  • Add CI pipelines for linting, tests, and build checks to meet proctored-review expectations
  • Produce a final portfolio package: release notes, reproducibility checklist, and review-ready evidence
  • Practice a proctored-style walkthrough: explain design choices, tradeoffs, and results under time constraints

Requirements

  • Basic Python (functions, modules, virtual environments)
  • Familiarity with pandas and scikit-learn at an introductory level
  • Git and GitHub basics (clone, commit, push, pull request)
  • A laptop capable of running local Python environments

Chapter 1: Define the Cert-Grade Project and Evidence Plan

  • Select an ML problem and define success metrics
  • Write the scope statement and acceptance criteria
  • Create the evidence map (tests, docs, reproducibility)
  • Draft the proctored walkthrough script and rubric
  • Set the repo charter: licensing, ethics, and constraints

Chapter 2: Build a Reproducible Repo, Data Flow, and Baseline

  • Scaffold the repository with a clean ML project layout
  • Pin environments and add one-command setup
  • Implement data ingest and deterministic splits
  • Train and log a baseline model with a repeatable script
  • Capture initial results and known limitations

Chapter 3: Modeling, Validation, and Responsible Evaluation

  • Upgrade from baseline to a stronger model with justification
  • Implement cross-validation and hyperparameter search
  • Add error analysis and slice-based evaluation
  • Calibrate metrics and thresholds for the use case
  • Freeze a candidate model for release

Chapter 4: Tests for ML: Unit, Data, and Pipeline Reliability

  • Write unit tests for transforms, metrics, and utilities
  • Add data validation tests for schema and ranges
  • Create integration tests for train/evaluate scripts
  • Add regression tests to detect metric drift in PRs
  • Generate coverage and enforce quality gates

Chapter 5: Documentation That Passes Reviews: README, Cards, and Usage

  • Write a portfolio-ready README with quickstart and evidence links
  • Publish a model card and a data sheet
  • Document API/CLI usage with examples and edge cases
  • Create runbooks: troubleshooting, reproducibility, and FAQs
  • Add a changelog and release checklist

Chapter 6: CI, Packaging, and Proctored-Ready Submission

  • Set up CI to run linting, tests, and build checks on every PR
  • Add reproducible packaging and a single entry-point command
  • Create a release artifact and tag a versioned submission
  • Run a timed proctored-style walkthrough and fix gaps
  • Assemble the final portfolio packet and review evidence

Sofia Chen

Senior Machine Learning Engineer (MLOps, Model Quality)

Sofia Chen is a senior machine learning engineer who builds production ML systems with strong testing, reproducibility, and governance. She has mentored candidates preparing certification portfolios and code reviews, focusing on measurable reliability and clear documentation.

Chapter 1: Define the Cert-Grade Project and Evidence Plan

A certification-grade ML portfolio project is not judged only by the final metric. In a proctored or reviewer-led evaluation, you are assessed on whether your work is reproducible, defensible, and communicable under time pressure. That means you must plan your project like an engineering deliverable: clear scope, measurable success, documented constraints, and a premeditated “evidence trail” that proves you built and validated what you claim.

This chapter turns the common “I built a model” story into a cert-ready artifact: a scope statement with acceptance criteria, an evidence map (tests, docs, CI, reproducibility), and a short walkthrough script you can follow during a proctored review. You will also set your repository charter—license, privacy posture, ethics considerations, and constraints—so a reviewer can quickly determine whether your project is safe, legitimate, and professionally executed.

  • Outcome for this chapter: a written project brief (problem + metrics + scope), a risk log, an evidence map, and a delivery plan you can execute in short sprints.

The rest of the course will implement the plan. But the plan is the part that prevents wasted work: it stops you from chasing fancy models without data quality checks, building APIs without tests, or claiming performance improvements without appropriate validation. Think of Chapter 1 as designing your “proof strategy” before you write code.

Practice note for Select an ML problem and define success metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write the scope statement and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create the evidence map (tests, docs, reproducibility): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft the proctored walkthrough script and rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set the repo charter: licensing, ethics, and constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select an ML problem and define success metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write the scope statement and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create the evidence map (tests, docs, reproducibility): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft the proctored walkthrough script and rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What proctors and reviewers look for

Proctored reviews typically reward clarity and reliability over novelty. A reviewer wants to verify that you can make good engineering decisions, explain trade-offs, and produce artifacts that another person could run and trust. In practice, that means your project must be auditable: inputs, transformations, modeling choices, and results should be traceable and reproducible.

Expect reviewers to look for: (1) a well-defined ML problem with business or user relevance, (2) an explicit baseline and a justified improvement path, (3) strong validation discipline (train/validation/test separation and leakage controls), and (4) professional hygiene—linting, tests, pinned environments, and clear documentation. They also look for an “operator mindset”: can someone else clone your repo and run make test or python -m ... without guessing?

  • Common mistake: showing a single notebook with results but no explanation of data splits, no fixed seeds, and no way to reproduce the same numbers.
  • Common mistake: presenting an impressive metric but no error analysis, no baseline, or no reason the metric matches the real goal.

Start thinking like a reviewer from day one. If you cannot explain your project in five minutes, you will struggle in a proctored setting. Write down your “walkthrough story” early: what problem, what data, what baseline, what improvement, what evidence you will show (tests, docs, CI runs), and what risks you considered. This chapter’s deliverables become your script and your checklist.

Section 1.2: Problem framing and measurable outcomes

Select an ML problem that is narrow enough to complete, but rich enough to demonstrate core competencies: data handling, modeling, evaluation, and reliability. Good candidates include tabular classification/regression, text classification, or time-series forecasting with careful validation. Avoid problems where success depends on massive compute, proprietary data, or complex labeling pipelines unless your course or exam specifically expects it.

Define success metrics before modeling. A cert-grade plan includes (a) a primary metric aligned to the task and cost of errors (e.g., F1 for imbalanced classification, MAE for forecasting), (b) secondary metrics (calibration, latency, memory, fairness slices if relevant), and (c) a baseline target. The baseline is not “state of the art”; it is a simple, defensible reference such as logistic regression, a decision tree, or a naive forecast. Your “improvement” is measured relative to that baseline with consistent evaluation.

  • Write metric definitions: name, formula reference (or library function), averaging strategy (macro/micro), and threshold selection rules.
  • Define acceptance thresholds: e.g., “F1 ≥ 0.78 on held-out test; calibration error ≤ 0.05; inference ≤ 50ms per sample on CPU.”

Engineering judgment: pick metrics you can defend. For example, accuracy is often misleading on imbalanced datasets; AUC can be useful but can hide poor precision at relevant thresholds. If your project includes an API/CLI, you may also define a “non-ML” acceptance criterion, such as “CLI returns a prediction with schema validation and helpful errors.” Your goal is to turn “works on my machine” into “meets written acceptance criteria.”

Section 1.3: Dataset selection, constraints, and risk log

Dataset selection is part of the evidence plan. Choose a dataset that is stable, legally usable, and sized appropriately for local development. Public datasets from reputable sources (UCI, Kaggle with clear licensing, Hugging Face datasets with documented provenance, government open data portals) are typically acceptable—provided you record the dataset version and access method.

List constraints explicitly. Constraints often include: limited compute (CPU-only), timebox (2–4 weeks), no external internet during proctored runs, or restricted dependencies. These constraints influence your tooling choices, such as using a small model family, caching artifacts, and pinning a Python environment. Also plan a data version strategy: will you store raw data in the repo (usually no), download via script, or use DVC/Git LFS? A reviewer wants to see that you thought about reproducibility and storage limits.

  • Risk log template: risk, likelihood, impact, mitigation, evidence.
  • Typical ML risks: data leakage via timestamp or target-derived features; label noise; train/test contamination; class imbalance; distribution shift; privacy concerns in text datasets; non-determinism due to GPU/parallelism.

Common mistake: picking a dataset with unclear licensing or personal data and only realizing late that you cannot publish the repo. Another mistake is ignoring leakage until the model “looks too good,” then losing days reworking the pipeline. Create the risk log now, and keep it in the repo (e.g., docs/risk_log.md). Reviewers appreciate seeing risks acknowledged and mitigations implemented—especially if you connect each mitigation to tests or validation checks you will build later.

Section 1.4: Evidence-first planning (what to prove and how)

Evidence-first planning means you decide what you need to prove, then you design artifacts that make the proof easy to verify. Your “evidence map” ties each claim to concrete checks: tests, documentation, and reproducibility steps. This section is where you translate the lessons “create the evidence map” and “write the scope statement and acceptance criteria” into a practical verification plan.

Start with claims you expect to make in your README or walkthrough, such as: “the pipeline is reproducible,” “data is validated,” “the model outperforms the baseline,” and “the API behaves correctly.” For each claim, specify the evidence type and where it lives in the repo.

  • Reproducibility evidence: pinned dependencies (pyproject.toml/requirements.txt + lockfile), deterministic seeds, and a single command to reproduce results (e.g., make train + make eval).
  • Testing evidence: unit tests for feature functions; integration tests for the training pipeline; data validation tests (schema, missingness, ranges, drift checks for a fixed reference split).
  • Documentation evidence: README with setup/run steps; model card describing intended use, limitations, metrics; datasheet describing collection, preprocessing, and known issues; API/CLI usage guide with examples.

Draft a proctored walkthrough script and rubric now. The script is a timed outline (often 5–10 minutes) that points to evidence: “Here is the scope and acceptance criteria; here is the baseline; here is the improvement; here are the tests; here is the CI run; here is the model card.” The rubric is your self-check: can a reviewer verify each item without assumptions? A powerful habit is to keep a docs/walkthrough.md that references exact commands and file paths. In later chapters you will implement CI so that the evidence is automatically regenerated on every push.

Section 1.5: Repository charter (license, conduct, privacy)

A repository charter is a short set of commitments and constraints that govern how the project is built and shared. For cert-grade work, this is not bureaucratic overhead; it prevents disqualification risks (license incompatibility, privacy violations, missing attribution) and signals professionalism. Your charter typically lives in the README and supporting files.

Start with licensing. Pick a standard open-source license appropriate for your goals (MIT, Apache-2.0, BSD-3). If you include third-party code or pretrained models, check their licenses and document attribution. Next, add a code of conduct (even a minimal one) if you expect public contributions, and include a security contact or note about responsible disclosure if relevant.

  • Privacy posture: confirm whether the dataset contains personal data; state how you handle it (e.g., no raw PII committed, only derived features, or fully public anonymized data).
  • Ethics and intended use: describe what the model should and should not be used for; note potential harms and bias risks; include evaluation slices if applicable.
  • Constraints: e.g., “Runs on CPU in <10 minutes,” “No external services required,” “No internet required after data download.”

Common mistake: omitting dataset terms or using a dataset that forbids redistribution, then committing raw data to GitHub. Another mistake is writing vague ethics language that does not connect to evaluation. Keep it concrete: if bias is a concern, define what group attribute exists (if any), what metric you will compute, and what limitations remain. A reviewer does not expect perfection; they expect you to recognize and manage constraints responsibly.

Section 1.6: Delivery plan and timeboxing for exam prep

Certification prep succeeds when you timebox aggressively and prioritize evidence-producing work. A useful delivery plan is milestone-based: each milestone produces a reviewer-visible artifact (a passing CI run, a documented baseline result, a completed model card). This prevents the common trap of spending a week tuning models before you have reliable data splits or tests.

Plan your work in short sprints (1–3 days) and define “done” in terms of acceptance criteria and evidence. Example sequence: (1) repository scaffold + environment pinning + minimal README, (2) data ingestion script + data validation checks, (3) baseline model + evaluation report, (4) improved model + error analysis, (5) tests expanded + CI enforced, (6) documentation polishing + walkthrough rehearsal.

  • Timebox modeling: allocate a fixed window for baseline and a fixed window for one improvement; resist endless hyperparameter searching.
  • Build the walkthrough early: rehearse the script after milestone (3), then refine it as you add CI, tests, and docs.
  • Proctored readiness check: can you clone from scratch, run one setup command, reproduce metrics, and show CI/test output within a short session?

Keep a single “source of truth” for results—preferably a versioned evaluation report in reports/ generated by code, not copied into slides manually. As you progress through the course, treat every new feature as needing evidence: a new data transform needs a unit test; a new training step needs an integration test; a new claim in the README needs a command that reproduces it. By the end, your project won’t just be impressive—it will be verifiable, which is exactly what proctored assessments reward.

Chapter milestones
  • Select an ML problem and define success metrics
  • Write the scope statement and acceptance criteria
  • Create the evidence map (tests, docs, reproducibility)
  • Draft the proctored walkthrough script and rubric
  • Set the repo charter: licensing, ethics, and constraints
Chapter quiz

1. In a proctored or reviewer-led evaluation, what is Chapter 1 emphasizing beyond achieving a strong final model metric?

Show answer
Correct answer: Demonstrating that the work is reproducible, defensible, and communicable under time pressure
The chapter frames certification-grade work as an engineering deliverable evaluated on reproducibility, defensibility, and clear communication—not just metrics.

2. Which combination best represents the core artifacts Chapter 1 aims to produce to make the project cert-ready?

Show answer
Correct answer: A scope statement with acceptance criteria, an evidence map, and a walkthrough script
Chapter 1 focuses on planning: scope + acceptance criteria, evidence trail (tests/docs/CI/repro), and a proctored walkthrough script.

3. What is the primary purpose of creating an 'evidence map' in Chapter 1?

Show answer
Correct answer: To premeditate the proof you will provide (tests, docs, CI, reproducibility) that supports your claims
An evidence map defines how you will prove correctness and validation through planned tests, documentation, CI, and reproducible workflows.

4. How do a scope statement and acceptance criteria help prevent wasted work, according to Chapter 1?

Show answer
Correct answer: They keep the project focused on measurable outcomes and required validation rather than chasing unverified improvements or unnecessary features
Clear scope and acceptance criteria reduce aimless iteration (e.g., fancy models without data checks or claims without validation).

5. Why does Chapter 1 ask you to set a repository charter (license, privacy posture, ethics, constraints)?

Show answer
Correct answer: So a reviewer can quickly judge whether the project is safe, legitimate, and professionally executed
The repo charter makes expectations and safeguards explicit, helping reviewers assess legitimacy, safety, and professionalism.

Chapter 2: Build a Reproducible Repo, Data Flow, and Baseline

Proctored reviews reward projects that behave like production systems: predictable setup, repeatable runs, and traceable results. In this chapter you will turn an idea into a repository that someone else can clone, install, and execute with one or two commands—then obtain the same baseline metrics you report. That means committing to a clear project layout, pinning environments, defining how data is accessed and versioned, and building a baseline pipeline that can be rerun on demand.

The “baseline” is not a throwaway model; it is the reference point that makes improvements defensible. If your baseline cannot be reproduced, every later improvement becomes suspect. You will also start capturing known limitations early (dataset gaps, label noise, class imbalance, leakage risks) so that your portfolio reads like an honest engineering report rather than a demo.

By the end of this chapter you should be able to: scaffold a clean ML repo, install dependencies deterministically, ingest data in a controlled way, create deterministic splits, train a baseline via a single repeatable script, and record initial results with enough metadata to be audited.

Practice note for Scaffold the repository with a clean ML project layout: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pin environments and add one-command setup: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement data ingest and deterministic splits: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Train and log a baseline model with a repeatable script: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Capture initial results and known limitations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Scaffold the repository with a clean ML project layout: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pin environments and add one-command setup: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement data ingest and deterministic splits: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Train and log a baseline model with a repeatable script: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Capture initial results and known limitations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Project structure (src, tests, docs, configs)

Section 2.1: Project structure (src, tests, docs, configs)

A certification-grade repository is easy to navigate. Reviewers should immediately see where code lives, how it is executed, how it is tested, and where documentation and configuration are stored. The most common mistake is “notebook sprawl”: logic split across ad-hoc notebooks with hidden state and implicit paths. Your goal is to make notebooks optional and keep the runnable workflow in scripts and importable modules.

A practical structure that scales from baseline to full project looks like this:

  • src/<package_name>/ importable Python package (e.g., data loading, features, models, evaluation)
  • scripts/ entry points (train, evaluate, ingest) that call into src
  • tests/ unit and integration tests (start small now; expand in later chapters)
  • configs/ YAML/TOML config files for dataset paths, split seeds, model params
  • docs/ README assets, model card, data sheet drafts, diagrams
  • data/ typically with subfolders like raw/, interim/, processed/ (often gitignored)
  • artifacts/ model outputs, metrics, and run logs (gitignored; optionally tracked via DVC)

Two rules keep this clean. First, code in src/ should not hardcode relative paths like ../data; it should accept paths from configuration or environment variables. Second, scripts should be thin: parse arguments, load config, call library functions, save outputs. This separation makes testing easier and prevents “works on my machine” path bugs.

Finally, define naming conventions now. For example: src/<pkg>/data/ingest.py, src/<pkg>/data/split.py, src/<pkg>/train.py, src/<pkg>/metrics.py. When reviewers see predictable modules, they infer engineering maturity—and it reduces the cognitive load when you expand to more complex pipelines.

Section 2.2: Environment management (uv/pip-tools/poetry) and lockfiles

Section 2.2: Environment management (uv/pip-tools/poetry) and lockfiles

Reproducibility begins with dependency pinning. A proctor (or hiring manager) may run your project weeks later on a different machine. If you depend on floating versions, small upstream changes can silently alter results or break installs. The fix is a lockfile-driven workflow and a one-command setup.

You have three common, acceptable options:

  • uv: fast resolver/installer; pairs well with pyproject.toml and produces lockfiles. Great for “clone and run” experiences.
  • pip-tools: classic approach with requirements.in compiled to fully pinned requirements.txt. Simple and explicit.
  • poetry: manages dependencies and packaging with poetry.lock. Good if you want consistent tooling and publishing-friendly metadata.

Whichever you choose, document it in the README as the canonical path. “One-command setup” typically means something like: create a virtual environment, install from lock, run a smoke test. For example, a Makefile target such as make setup can wrap these steps. If you support GPU/CPU variants, be explicit; hidden CUDA assumptions are a frequent failure point during proctored evaluation.

Engineering judgment: pin tightly for applications, allow flexibility for libraries. For a portfolio project you want strict pins so others can reproduce your exact baseline. A practical approach is: pin all direct dependencies in pyproject.toml (or requirements.in), then rely on the lockfile to pin transitive dependencies. Commit the lockfile to version control. Also record the Python version (e.g., 3.11) and enforce it via tooling (pyproject requires-python, or a .python-version file).

Common mistake: installing interactively until it “works,” then exporting a requirements file afterwards. That produces a pile of unreviewed pins and platform-specific packages. Start with intentional dependencies and let the lockfile be the single source of truth.

Section 2.3: Data access patterns and version strategy

Section 2.3: Data access patterns and version strategy

Data is the most common reproducibility gap in ML portfolios. You need to answer two questions clearly: (1) where does the data come from, and (2) how do we know we are using the same version you used? Proctored reviewers often cannot access private buckets or proprietary datasets, so your repo must support a controlled “data acquisition” step and a fallback path (sample data, synthetic data, or instructions to download a public dataset).

Adopt a simple access pattern: a single ingest script that produces a canonical raw dataset file (or folder) and records metadata about the source. For example, scripts/ingest.py can download from a URL, verify checksums, and write to data/raw/. If the data is provided manually, the ingest script can validate expected filenames, schemas, and row counts rather than downloading.

For version strategy, choose one of these defensible approaches:

  • Immutable snapshot + checksum: store raw data outside git (e.g., in a release asset) and record SHA256 checksums in configs/data.yaml or a datasheet.md. Your pipeline refuses to run if checksums mismatch.
  • DVC (Data Version Control): track data artifacts and their versions without committing large files to git. This is strong evidence of MLOps practice.
  • Public dataset + pinned version: if using a dataset hosted on a platform that provides versions (Kaggle, Hugging Face datasets, OpenML), record the dataset identifier and version/hash.

Regardless of tool, treat processed data as derived artifacts: you can regenerate it from raw data plus code. Keep the transformation steps deterministic and logged. Another common mistake is to perform cleaning in a notebook and save a “final.csv” without provenance. Instead, create src/<pkg>/data/preprocess.py that transforms raw to processed and writes to data/processed/ with a recorded schema and summary statistics.

Deterministic splits belong here too: define a single split function that takes a seed, stratification rules, and group leakage constraints (if applicable). Save the resulting train/val/test indices to disk so subsequent runs reuse the exact same split unless explicitly regenerated.

Section 2.4: Baseline training pipeline and configuration

Section 2.4: Baseline training pipeline and configuration

Your baseline pipeline should be runnable end-to-end from the command line and driven by configuration. That means a reviewer can execute python scripts/train.py --config configs/baseline.yaml and receive a trained model plus metrics. Keep the baseline intentionally simple: a logistic regression for classification, a linear model for regression, or a small tree-based model. The baseline is about establishing a floor with minimal moving parts.

Use configuration to separate “what we run” from “how we implement.” A baseline config typically includes: dataset location/version, split parameters (seed, test size, stratify key), feature settings (which columns, encoding choices), model hyperparameters, and evaluation metrics. YAML is popular because it is readable in reviews, but TOML/JSON are fine if you standardize.

A robust baseline training script generally follows this sequence:

  • Load config and validate required fields.
  • Load raw/processed data via a single data access module.
  • Create or load deterministic split indices.
  • Fit preprocessing (e.g., scaling/encoding) on train only, then transform val/test.
  • Train the baseline model with explicit hyperparameters.
  • Evaluate with agreed metrics and confidence intervals where appropriate.
  • Persist artifacts: trained model, preprocessing pipeline, metrics JSON, config copy.

Engineering judgment: “baseline” does not mean “sloppy.” Avoid leakage by fitting transforms only on training data. Avoid optimistic reporting by using a validation set for iteration and reserving test for final reporting. If you only have enough data for cross-validation, implement it explicitly and log fold-level results. A frequent portfolio mistake is to tune on the test set and report it as final performance; proctors will flag this immediately.

Practical outcome: after this section, you should have a baseline artifact that anyone can regenerate, and a config file that captures every decision needed to reproduce it.

Section 2.5: Reproducibility controls (seeds, determinism, artifacts)

Section 2.5: Reproducibility controls (seeds, determinism, artifacts)

Reproducibility is a spectrum: you may not get bit-for-bit identical results across all hardware, but you should make runs stable enough that a reviewer can match your reported metrics within a reasonable tolerance. Start with explicit seed control and deterministic data splits, then add artifact discipline.

At minimum, set seeds in every library you use (e.g., Python’s random, NumPy, and the ML framework). Also pass random_state in scikit-learn estimators and splitting utilities. Record the seed in the config and write it into your run metadata. If you use GPU frameworks (PyTorch/TensorFlow), understand that some operations are nondeterministic unless you enable deterministic modes—often with performance tradeoffs. Document what level of determinism you guarantee (e.g., “CPU runs deterministic; GPU runs best-effort”).

Artifacts are the second half of reproducibility. Decide which outputs are “source of truth” for a run:

  • Model artifact: serialized estimator and preprocessing pipeline.
  • Split artifact: saved indices or a split manifest file.
  • Metrics artifact: JSON with metric names, values, and evaluation dataset identifiers.
  • Run manifest: config snapshot, git commit hash, dependency lock hash, timestamp.

Common mistake: overwriting artifacts in-place (e.g., always writing to artifacts/model.pkl). Instead, write to a unique run directory such as artifacts/runs/<timestamp>_<shortsha>/. This prevents accidental mixing of old models with new metrics and makes comparisons auditable.

Practical outcome: if someone checks out your commit and runs the training script twice, they should either get the same result or understand exactly why not, based on the determinism statement and run manifests.

Section 2.6: Result logging and experiment traceability

Section 2.6: Result logging and experiment traceability

Proctored-ready projects treat results as evidence. It is not enough to say “accuracy is 0.91”; you must show how it was computed, on what split, with what data version, and with what code and dependencies. This is experiment traceability: the ability to trace a number in your README back to an executable run.

Start with lightweight logging before adopting heavier platforms. A practical baseline is:

  • Console logs for progress (epochs/iterations, dataset shapes, warnings).
  • A machine-readable metrics.json file saved per run.
  • A params.json or config snapshot saved per run.
  • A run.json manifest containing git commit hash, python version, platform info, and lockfile fingerprint.

If you already use an experiment tracker (MLflow, Weights & Biases, Aim), keep it optional and ensure a “no external service required” path exists for reviewers. For proctored settings, offline-first logging is safer: the run directory becomes your audit trail.

Capturing initial results also means capturing limitations. Add a short “Baseline findings” note (in the README or a docs/ page) that includes: metric values, data coverage issues, suspected leakage risks, error slices (e.g., worst-performing classes), and constraints (small dataset, noisy labels, class imbalance). This demonstrates professional judgment: you are not just optimizing metrics—you are evaluating reliability.

Common mistake: only logging the best metric and ignoring variance. Where feasible, log distributional information (per-class precision/recall, confusion matrix, calibration metrics) and run-to-run variability (multiple seeds or CV folds). Even if you do not implement all of that yet, establish the fields in your metrics schema so later chapters can extend it without breaking readers’ expectations.

Practical outcome: every baseline result you publish can be regenerated from a single command, and every number can be traced to a specific run folder containing config, code reference, data version evidence, and evaluation outputs.

Chapter milestones
  • Scaffold the repository with a clean ML project layout
  • Pin environments and add one-command setup
  • Implement data ingest and deterministic splits
  • Train and log a baseline model with a repeatable script
  • Capture initial results and known limitations
Chapter quiz

1. Why does the chapter emphasize that your repository should be runnable by someone else with one or two commands?

Show answer
Correct answer: To ensure predictable setup and repeatable runs that produce the same reported baseline metrics
Proctored reviews value production-like behavior: deterministic setup and repeatable execution that reproduces reported results.

2. What is the primary purpose of a baseline model in this chapter’s workflow?

Show answer
Correct answer: To serve as a reproducible reference point so later improvements are defensible
A baseline is the reference that makes improvements meaningful; if it can’t be reproduced, later gains are questionable.

3. Which approach best supports deterministic setup as described in the chapter?

Show answer
Correct answer: Pin dependencies/environments so installs are repeatable across machines
Pinned environments enable deterministic installation, which is required for reproducible runs.

4. What does it mean to implement deterministic splits in the data flow?

Show answer
Correct answer: Creating train/validation/test splits that are reproducible on reruns
Deterministic splits ensure the same data partitions are produced each time, making metrics comparable and auditable.

5. Why does the chapter recommend capturing known limitations early (e.g., leakage risks, label noise, class imbalance)?

Show answer
Correct answer: So results read like an honest engineering report with traceable constraints
Recording limitations up front makes results interpretable and trustworthy, aligning the portfolio with production-style reporting.

Chapter 3: Modeling, Validation, and Responsible Evaluation

This chapter turns your repository into something a proctored reviewer can trust: a modeling workflow that moves from baseline to improvement without “metric fishing,” a validation plan that holds up under scrutiny, and evaluation practices that acknowledge real-world risk. You are not just trying to score well—you are trying to demonstrate sound engineering judgment and responsible evaluation.

A common failure mode in portfolio projects is an impressive notebook that cannot defend why the model was chosen, how it was validated, or what happens when conditions change. A certification-grade project makes those decisions explicit. You will define feature boundaries to prevent leakage, implement cross-validation and hyperparameter search correctly, perform error analysis (including slice-based evaluation), calibrate thresholds to the use case, and then freeze a candidate model for release with reproducible artifacts.

As you work, keep two principles in mind: (1) evaluation is part of the system design, not an afterthought; and (2) the “best” model is the one you can justify, reproduce, monitor, and ship safely. The following sections walk you through the practical steps and the pitfalls reviewers look for.

Practice note for Upgrade from baseline to a stronger model with justification: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement cross-validation and hyperparameter search: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add error analysis and slice-based evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Calibrate metrics and thresholds for the use case: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Freeze a candidate model for release: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Upgrade from baseline to a stronger model with justification: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement cross-validation and hyperparameter search: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add error analysis and slice-based evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Calibrate metrics and thresholds for the use case: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Freeze a candidate model for release: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Feature engineering boundaries and leakage prevention

Section 3.1: Feature engineering boundaries and leakage prevention

Before you upgrade from a baseline to a stronger model, lock down what information the model is allowed to see at prediction time. Leakage is the fastest way to produce a high score that fails a proctored review. Treat feature engineering as a contract: every feature must be computable from inputs available at inference, at the same timestamp, without peeking at labels or future data.

Start by documenting feature sources and timing. If your dataset includes “post-event” fields (e.g., resolution codes, refund status, future engagement), explicitly exclude them. If your task is time-dependent, build features using only history up to the cutoff. Put that cutoff into code, not prose.

  • Use a pipeline: implement preprocessing inside a scikit-learn Pipeline / ColumnTransformer so the same transformations are applied in training and inference.
  • Fit on train only: scalers, imputers, target encoders, and text vectorizers must be fit only on training folds; never pre-fit them on the full dataset.
  • Guardrails: add checks that ban label-derived columns and detect suspicious correlations (e.g., a feature with near-perfect AUC on its own).

Engineering judgment: prefer simple, auditable features over clever but fragile ones. Reviewers will favor a clear feature list with rationale and anti-leakage tests over marginal gains from risky transformations. Your baseline model is useful here: if a simple model suddenly performs “too well,” investigate leakage before you celebrate.

Section 3.2: Validation design (CV strategy, holdout, stratification)

Section 3.2: Validation design (CV strategy, holdout, stratification)

Validation is where you prove your model generalizes. A defensible plan usually includes (1) a final holdout test set that is never touched until the end, and (2) cross-validation (CV) on the training set to compare models and tune hyperparameters. This structure prevents “accidental training on the test set” through repeated iteration.

Pick the CV strategy that matches the data generating process. For i.i.d. tabular data, use stratified k-fold for classification to keep class balance consistent across folds. For grouped data (multiple rows per user, device, patient), use group-aware splitting so information from the same entity doesn’t appear in both train and validation. For time series, use time-based splits (rolling/expanding window) and avoid shuffling.

  • Holdout: set aside 10–20% as a final test set early, store the indices, and never use it for model selection.
  • CV + search: use RandomizedSearchCV or Bayesian optimization for efficiency; reserve grid search for small spaces. Log the search space and random seed.
  • Nested CV (optional): if you need the cleanest estimate and can afford compute, use nested CV; otherwise, be transparent about using a holdout test for the final estimate.

Common mistakes include leaking preprocessing outside the CV loop, selecting hyperparameters based on the holdout test, and reporting only the best-fold score. Instead, report mean ± standard deviation across folds and preserve per-fold predictions for later error analysis. This becomes critical when you calibrate thresholds and compare candidate models fairly.

Section 3.3: Metric selection aligned to business/exam criteria

Section 3.3: Metric selection aligned to business/exam criteria

Metrics are not decoration; they encode what “good” means. A proctored-ready project states the primary metric, secondary metrics, and the decision threshold policy. Choose metrics that reflect the cost of errors and the constraints of the use case (or the exam prompt), then stick to them throughout baseline, improvement, and final evaluation.

For imbalanced classification, accuracy is often misleading. Prefer PR-AUC, F1, recall at a fixed precision, or cost-weighted metrics when false negatives/false positives have asymmetric impact. For ranking and retrieval tasks, use MAP or NDCG. For regression, consider MAE vs RMSE depending on whether outliers should dominate the penalty. For probabilistic models, include calibration metrics like Brier score or expected calibration error (ECE).

  • Primary metric: the one you optimize during hyperparameter search.
  • Guardrail metrics: metrics that must not degrade beyond a tolerance (e.g., recall must stay above 0.85).
  • Operational target: a threshold rule (fixed threshold, top-k, or threshold chosen to satisfy precision/recall constraints).

Calibrating metrics and thresholds is where engineering meets policy. If your model outputs probabilities, decide whether they must be well-calibrated (e.g., for risk scoring). You can apply Platt scaling or isotonic regression on CV predictions, but do it within the training process (fit calibrator on validation folds or via an internal split) to avoid leakage. Document the chosen threshold and show how it was derived—reviewers often reject projects that “optimize the threshold on the test set” or silently change thresholds between experiments.

Section 3.4: Error analysis, confusion analysis, and slices

Section 3.4: Error analysis, confusion analysis, and slices

Once you have a baseline and a stronger model, don’t stop at aggregate metrics. Error analysis explains why the model fails, guides feature work, and demonstrates responsible evaluation. Start with confusion analysis: inspect the confusion matrix at the chosen threshold, and quantify false positives vs false negatives. If your use case has different costs, translate errors into expected cost or workload.

Next, perform slice-based evaluation: measure metrics across meaningful subgroups (slices) such as region, device type, tenure band, language, or any domain-relevant segmentation. This is not just for fairness—it is also for robustness. A model can improve overall AUC while collapsing on a small but important slice.

  • Slice selection: define slices before looking at results when possible (to reduce cherry-picking). Include both business-critical and data-quality-relevant slices (e.g., missingness patterns).
  • Top errors: review a sample of the highest-confidence wrong predictions; they often reveal label noise, ambiguous cases, or leaked proxies.
  • Stability: compare per-slice metrics across CV folds to see if performance is consistent or driven by variance.

Common mistakes include treating one confusing slice result as definitive (small sample sizes can be noisy) and doing “manual relabeling” without a documented process. Practical outcome: you should end this section with a short list of failure modes, a hypothesis for each (data issue, feature gap, threshold choice), and a concrete next step (collect data, adjust preprocessing, add a feature, or change the decision rule). This is the evidence that your model improvements were justified rather than accidental.

Section 3.5: Bias/fairness checks and ethical considerations

Section 3.5: Bias/fairness checks and ethical considerations

Responsible evaluation means acknowledging that models can cause harm even when metrics look strong. In a certification context, reviewers expect you to demonstrate basic bias/fairness checks and to document limitations. Start with a simple question: who could be negatively impacted by errors, and how?

Run fairness checks on protected or sensitive attributes only if you are allowed to use them and have a legitimate reason; otherwise, use proxy slices carefully and note their limitations. Evaluate group-wise metrics (e.g., recall, false positive rate, calibration) and report disparities. If you cannot access sensitive attributes, be explicit: “We cannot measure demographic parity; we instead test robustness across available segments and monitor post-deployment.”

  • Disparity metrics: compare TPR/FPR gaps, precision gaps, and calibration curves by group.
  • Threshold impacts: a single global threshold can produce unequal error rates; document the trade-off if considering group-specific thresholds (often restricted by policy).
  • Ethical risks: identify misuse cases, feedback loops (model decisions affecting future data), and privacy concerns (PII in features or logs).

Engineering judgment here is about trade-offs and governance. You may not “solve fairness” in a portfolio project, but you can show a responsible process: predefine acceptable disparity thresholds if applicable, add monitoring recommendations, and document mitigations (data balancing, reweighting, constrained optimization, or decision review processes). The practical deliverable is a short responsible evaluation note that you can later incorporate into the model card: intended use, out-of-scope use, known limitations, and group performance summary.

Section 3.6: Model selection and artifact packaging

Section 3.6: Model selection and artifact packaging

After comparing candidates, you must freeze a model for release. “Freeze” means you can recreate the exact artifact from pinned code, pinned environment, and versioned data—and you can explain why it was chosen. Use your CV results to select the best model under the primary metric while satisfying guardrails (e.g., minimum recall, maximum latency, or interpretability requirements).

Then package artifacts for reproducibility and review. At minimum, save: the fitted pipeline (including preprocessing), the chosen threshold or decision policy, label mapping, and metadata describing training data version and metrics. Prefer joblib for scikit-learn pipelines; for deep learning, save weights plus architecture config. Include a predict entry point (CLI or Python function) that loads the artifact and runs inference consistently.

  • Model registry folder: store artifacts under a versioned path (e.g., models/v1/) with a manifest file (JSON/YAML) containing git commit hash, data hash, metric summary, and hyperparameters.
  • Reproducible training: fix random seeds where possible; log non-determinism sources; keep a single training script that can be run in CI.
  • Promotion rule: define criteria for moving from “candidate” to “release” (e.g., pass all tests, meet metric thresholds on holdout, no critical fairness regression).

Common mistakes include saving only raw model weights without preprocessing, changing feature order between training and inference, and relying on notebook state. The practical outcome is a release-ready, reviewable model package: a single command can train, evaluate, and export the candidate; another command can load it and score new data. This is the bridge from experimentation to a professional ML deliverable.

Chapter milestones
  • Upgrade from baseline to a stronger model with justification
  • Implement cross-validation and hyperparameter search
  • Add error analysis and slice-based evaluation
  • Calibrate metrics and thresholds for the use case
  • Freeze a candidate model for release
Chapter quiz

1. Which workflow best aligns with the chapter’s goal of avoiding “metric fishing” while upgrading from a baseline model?

Show answer
Correct answer: Predefine validation and selection criteria, then justify improvements over a baseline using that fixed plan
The chapter emphasizes making validation and selection decisions explicit up front to prevent chasing metrics without defensible methodology.

2. Why does the chapter treat evaluation as part of system design rather than an afterthought?

Show answer
Correct answer: Because evaluation choices affect real-world risk and must hold up under scrutiny
The chapter frames evaluation as central to safe, responsible deployment decisions and reviewer trust.

3. What is the main purpose of defining feature boundaries in the modeling workflow described?

Show answer
Correct answer: To prevent leakage and keep validation credible
Feature boundaries are called out specifically to avoid leakage that would invalidate results.

4. What combination of practices does the chapter recommend for responsible evaluation beyond aggregate metrics?

Show answer
Correct answer: Error analysis including slice-based evaluation to understand failures under changing conditions
The chapter highlights error analysis and slice-based evaluation to reveal risks hidden by overall scores.

5. After calibrating metrics and thresholds for the use case, what final step makes the project “proctored-reviewer” ready for release?

Show answer
Correct answer: Freeze a candidate model with reproducible artifacts so it can be justified and reproduced
The chapter emphasizes freezing a release candidate with reproducible artifacts to support justification, reproducibility, and safe shipping.

Chapter 4: Tests for ML: Unit, Data, and Pipeline Reliability

In a proctored-ready ML portfolio project, tests are not “nice to have.” They are how you prove to a reviewer that your results are reproducible, your pipeline behaves as described, and future changes won’t silently break the system. ML projects fail differently than typical software: data changes, distribution shifts, and seemingly harmless refactors can alter evaluation numbers without throwing an exception. This chapter gives you a practical testing stack: unit tests for transforms and metrics, data validation tests for schema and ranges, integration tests for training/evaluation scripts, regression tests for metric drift in pull requests, and quality gates like coverage thresholds.

The goal is engineering judgment, not maximal test volume. You will design tests that are fast, deterministic, and meaningful. Your project should have a “tight loop” test suite that runs in seconds on every PR, plus a slower suite (optional) that runs nightly or on demand. If your project can pass tests in CI without requiring secret data or GPU access, you are already aligning with proctored-review expectations.

Throughout the chapter, treat tests as executable documentation. Each test clarifies assumptions about input formats, feature engineering, metric computation, and pipeline contracts. A reviewer doesn’t need to trust your narrative when they can run your tests and observe the guarantees.

Practice note for Write unit tests for transforms, metrics, and utilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add data validation tests for schema and ranges: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create integration tests for train/evaluate scripts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add regression tests to detect metric drift in PRs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Generate coverage and enforce quality gates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write unit tests for transforms, metrics, and utilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add data validation tests for schema and ranges: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create integration tests for train/evaluate scripts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add regression tests to detect metric drift in PRs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Test strategy for ML (what to test vs what not to test)

Start by separating ML work into three layers: pure code (deterministic functions), data assumptions (schema and constraints), and pipeline behavior (scripts and orchestration). Your test strategy should mirror these layers. Unit tests cover deterministic utilities, transforms, and metric functions. Data validation tests enforce what “valid input” means. Integration tests assert that your train/evaluate entrypoints run end-to-end on a tiny dataset and produce expected artifacts. Finally, regression tests protect key metrics from accidental degradation.

What to test: (1) Feature transforms that can be expressed as pure functions (tokenization wrappers, normalization, categorical encoding maps, label conversion, time-window logic). (2) Metric computations (F1, AUROC wrappers, threshold selection, calibration) because small bugs here invalidate your claims. (3) File/IO utilities that build paths, read configs, and load datasets—especially when used by CLI scripts. (4) Contracts: “given inputs X, pipeline produces artifact Y with keys Z.”

What not to test (or test lightly): training convergence, exact model weights, or exact floating-point predictions on large datasets. Those are brittle and will fail due to non-determinism, hardware differences, or dependency updates. Also avoid testing third-party libraries directly (e.g., scikit-learn internals). Instead, test your usage: input validation, parameter passing, and postconditions (shapes, ranges, metrics).

  • Common mistake: writing slow tests that run full training, then disabling them in CI. A test that never runs provides no guarantee.
  • Common mistake: asserting exact metric values from a stochastic model without fixing seeds and tolerances; this produces flaky CI.
  • Practical outcome: a two-tier suite: pytest -m unit for fast deterministic tests, and pytest -m integration for end-to-end checks on a tiny dataset.

In a certification-grade repo, explicitly document this strategy in CONTRIBUTING.md or a “Testing” section in the README: what runs on PRs, what runs nightly, and how to reproduce locally.

Section 4.2: Pytest fundamentals and fixtures for ML projects

Pytest is the practical default for Python ML projects because fixtures let you manage temporary data, configs, and models cleanly. Structure your tests as tests/unit, tests/integration, and optionally tests/data. Keep naming consistent: test_*.py files, and tests as test_* functions.

Use fixtures to remove duplication and to make tests deterministic. Typical fixtures in ML projects include: a small pandas DataFrame with representative edge cases, a temporary directory for artifacts, a config object/dict with known parameters, and a fixed random seed.

  • Seed control: provide a fixture that sets numpy, random, and any framework seeds (e.g., PyTorch) at the start of each test module. This reduces flakiness.
  • tmp_path: rely on pytest’s tmp_path fixture to write model artifacts, metrics JSON, or cached features without polluting the repo.
  • Markers: mark slow tests (@pytest.mark.slow) and integration tests (@pytest.mark.integration) so CI can choose what to run.

Example pattern: for a transform function build_features(df), write a unit test that asserts output column names, dtypes, no unexpected NaNs, and stable behavior on edge cases (empty strings, out-of-range values, unknown categories). For a metric function, test known toy inputs where the correct result is hand-computable (e.g., binary labels with a fixed threshold). These tests become your defense when a reviewer asks how you ensured metric correctness.

Engineering judgment: keep unit tests small and single-purpose. If a test requires training a model, it is not a unit test; move it to integration and make it run on a tiny dataset with a small number of iterations.

Section 4.3: Data validation (pandera/great expectations patterns)

Most ML breakages are data breakages: missing columns, type changes, unexpected categories, or out-of-range values that distort features. Data validation tests make these assumptions explicit and fail fast. Two common tools are Pandera (Pythonic schema validation for pandas) and Great Expectations (suite-based validation with rich reporting). Choose one; reviewers care more that you validate than which library you pick.

With Pandera, define a schema for each dataset boundary: raw input, cleaned intermediate, and model-ready features. Include column presence, dtype, nullability, allowed ranges, and categorical sets when feasible. Then write tests that validate a sample file (or a synthetic DataFrame) against the schema. Make failures actionable by including clear error messages and by validating at the earliest point in your pipeline (often right after loading data).

With Great Expectations, create expectations like “column age is between 0 and 120,” “label is in {0,1},” and “timestamp is not null and parseable.” Commit the expectation suite to the repo. In CI, run validations on a small checked-in sample dataset (not the full training data) so the test is fast and portable.

  • Schema tests to include: required columns exist; no duplicate column names; dtypes match; IDs are unique if required; label distribution sanity checks (e.g., not all one class); numeric ranges (non-negative counts); text length constraints when relevant.
  • Common mistake: validating only training data, not inference inputs. Production failures often happen at prediction time when upstream formats drift.

Practical outcome: you can point to a single source of truth for “what valid data looks like.” This reduces hidden assumptions and makes your pipeline robust under refactors, new data pulls, or different environments.

Section 4.4: Pipeline integration tests and fast test design

Integration tests prove that your CLI/scripts actually work together: load data, build features, train a model, evaluate, and write artifacts. For proctored review, this matters because reviewers often run your entrypoints rather than importing internal functions. Your integration tests should mimic that behavior using subprocess calls or your script functions directly.

Design for speed. Create a tiny deterministic dataset fixture (for example, 200 rows) and a “test config” that reduces computation: fewer estimators, fewer epochs, smaller vectorizers, or limited feature sets. If your training script supports arguments like --max-rows, --limit, or --smoke-test, integration tests become straightforward and your project becomes more usable.

  • What to assert: the command exits with code 0; metrics file is created; model artifact is created; evaluation report contains required keys (e.g., accuracy, f1, roc_auc); and the run is reproducible given a fixed seed.
  • Contract tests: if your pipeline produces metrics.json, assert its schema (types and keys). If it produces a model file, assert it can be loaded and used for a single prediction.

Fast test design techniques: avoid network calls; avoid downloading large datasets; pin versions to prevent different default behaviors; and isolate tests with tmp_path so they don’t depend on local state. If you use DVC or another data version strategy, integration tests should use a small sample tracked in Git, not large remote data, so CI can run without credentials.

Practical outcome: every PR proves the project still trains and evaluates end-to-end, which is the core promise of a portfolio ML repository.

Section 4.5: Golden metrics, snapshot testing, and tolerances

Regression tests for ML focus on “golden signals” rather than exact outputs. The purpose is to detect metric drift in pull requests: if a refactor or feature change reduces your baseline performance (or unexpectedly inflates it due to leakage), you want CI to flag it. The key is to choose stable evaluation conditions and to set realistic tolerances.

Define a golden evaluation dataset: a small, fixed split that is checked in (or generated deterministically) and never used for training in tests. Run your evaluation script on this dataset and store a snapshot of the metrics (for example, golden_metrics.json). In CI, compare the newly produced metrics to the snapshot with tolerances (e.g., F1 must not drop by more than 0.01). This avoids brittle “exact match” assertions.

  • Tolerance patterns: absolute tolerance for bounded metrics (accuracy/F1); relative tolerance for unbounded losses; and “floor” checks for minimum acceptable performance (e.g., roc_auc >= 0.75).
  • Leakage guard: add a test that ensures no overlap between train and eval IDs, and that feature generation doesn’t peek at label columns.
  • Common mistake: using the same preprocessing fit on full data in the test; ensure pipelines fit only on training split, then transform eval split.

Snapshot testing can also apply to non-metric artifacts: expected feature column lists, expected label mapping, or expected JSON report structure. When snapshots change intentionally, update them in the PR with a clear explanation. A reviewer will interpret this as disciplined change control, which is exactly what certification-grade projects should demonstrate.

Section 4.6: Linting, type checks, and coverage thresholds

Tests alone are not the full quality story. Proctored-ready projects typically include automated checks that enforce baseline engineering hygiene: linting, formatting, type checks, and coverage gates. These are “cheap” signals for reviewers that your repo is maintained professionally and that changes won’t degrade readability or safety.

At minimum, add: (1) formatting (Black or Ruff format), (2) linting (Ruff), (3) import sorting (often covered by Ruff), and (4) type checking (mypy or pyright). Configure them to run in CI and locally via make targets or a task runner. Keep configs in pyproject.toml so the setup is discoverable.

Coverage is your enforcement mechanism. Use pytest-cov to generate a coverage report and fail CI if coverage drops below a threshold (for example, 75–85% depending on project size). Do not chase 100% coverage—ML code often includes thin wrappers around libraries where coverage adds little value. Instead, ensure high coverage on your critical logic: transforms, metrics, data validation, and pipeline glue.

  • Quality gate design: enforce coverage for your package (--cov=src_pkg) rather than the whole repo; exclude notebooks; and treat warnings as errors where appropriate.
  • Common mistake: allowing lint/type failures to be “advisory” in CI. If a check matters, make it blocking.

Practical outcome: every pull request runs the same battery—lint, type checks, unit tests, integration smoke tests, metric regression checks, and coverage gates—creating a defensible, review-ready ML project workflow.

Chapter milestones
  • Write unit tests for transforms, metrics, and utilities
  • Add data validation tests for schema and ranges
  • Create integration tests for train/evaluate scripts
  • Add regression tests to detect metric drift in PRs
  • Generate coverage and enforce quality gates
Chapter quiz

1. Why does the chapter argue that tests are essential (not optional) in an ML portfolio project?

Show answer
Correct answer: They prove reproducibility and prevent silent pipeline/evaluation breakage from data shifts or refactors
ML systems can change behavior without errors due to data and distribution shifts; tests demonstrate reproducibility and protect pipeline contracts.

2. Which testing approach best validates assumptions about input data formats and acceptable value ranges?

Show answer
Correct answer: Data validation tests for schema and ranges
Schema and range checks are data validation tests that confirm the data meets expected structure and constraints.

3. What is the main purpose of integration tests in this chapter’s testing stack?

Show answer
Correct answer: Verify that train/evaluate scripts work together correctly as a pipeline
Integration tests validate end-to-end behavior of training/evaluation scripts and their contracts across components.

4. How do regression tests for metric drift help during pull requests (PRs)?

Show answer
Correct answer: They detect unexpected changes in evaluation metrics caused by code changes before merging
Metric-drift regression tests catch unintended evaluation changes early, preventing silent performance shifts from entering main.

5. Which setup best matches the chapter’s recommended test strategy for proctored-ready projects?

Show answer
Correct answer: A fast, deterministic suite that runs in seconds on every PR plus an optional slower suite run nightly/on demand
The chapter emphasizes a tight-loop PR suite and optional slower runs, avoiding reliance on secret data or GPUs for CI reliability.

Chapter 5: Documentation That Passes Reviews: README, Cards, and Usage

In proctored or certification-style reviews, documentation is treated as evidence. Reviewers use it to answer a small set of questions quickly: Can they reproduce your results? Do you understand risk, limitations, and intended use? Is the interface usable without reading your source code? This chapter turns documentation from an afterthought into a review-ready artifact set: a portfolio-grade README, a model card, a data sheet, usage guides for CLI/API, and maintenance documents that prove you can operate the project responsibly over time.

The key mindset shift is that documentation is part of your system boundary. If your model requires a specific dataset snapshot, a fixed preprocessing version, and a particular random seed strategy, then those are requirements of the system and must be written down where a reviewer will look first. If your workflow includes baseline-to-improvement iterations, then the narrative of what changed, why, and what evidence improved should be visible through links: experiment reports, metrics tables, CI badges, and release notes.

Throughout this chapter, you will build a documentation “stack” that works together: the README provides the landing page and quickstart; the model card explains behavior and limitations; the data sheet explains what the data is and what it is not; the usage docs make it runnable via CLI/API with edge cases; and maintenance docs (runbooks, changelog, and release checklist) show operational maturity. By the end, a reviewer should be able to clone, install, run, evaluate, and understand limitations in under 15 minutes.

Practice note for Write a portfolio-ready README with quickstart and evidence links: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Publish a model card and a data sheet: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document API/CLI usage with examples and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks: troubleshooting, reproducibility, and FAQs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add a changelog and release checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a portfolio-ready README with quickstart and evidence links: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Publish a model card and a data sheet: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document API/CLI usage with examples and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks: troubleshooting, reproducibility, and FAQs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Documentation standards for certification portfolios

Certification and proctored portfolio reviews tend to reward clarity, reproducibility, and responsible disclosure. “Good docs” are not long docs; they are docs that minimize reviewer guesswork. Start by adopting a few standards you can point to explicitly in the repository: a consistent template per document type, a definition of “done” for documentation, and evidence links that connect claims to artifacts.

A practical standard is to treat every document as a contract with three promises: (1) it states prerequisites and assumptions, (2) it provides exact commands to reproduce key outputs, and (3) it records limitations and risks without defensiveness. Reviewers distrust projects that hide tradeoffs. They trust projects that list caveats and show mitigations (e.g., data leakage checks, baseline comparisons, and failure modes).

  • Single-source entry point: README is the landing page; everything else is linked from it.
  • Evidence-first: include links to metrics reports, CI runs, release tags, and generated docs. If results are in a notebook, export a static HTML/PDF artifact and link it.
  • Reproducibility checklist: pin Python and dependency versions, declare data snapshot IDs, and document seeding and determinism expectations.
  • Audience targeting: separate “user” docs (how to run) from “auditor” docs (why decisions were made, risks, limitations).

Common mistakes: burying install steps in a long narrative, omitting the exact dataset version, presenting metrics without defining the split, and failing to state intended use. In a review, those omissions look like gaps in engineering judgment. A good standard is: if a reviewer asks “What would I need to know to reproduce this?”, the answer should already be in the docs.

Section 5.2: README architecture (setup, usage, results, limitations)

Your README should read like an onboarding script. Reviewers skim; they want a predictable layout. A portfolio-ready README typically has: project summary and scope, quickstart, repository structure, dataset and environment notes, training/evaluation commands, results with evidence, limitations/risks, and contribution/maintenance links.

Begin with a short “What this is” and “What this is not.” Then include a Quickstart that is copy/pasteable and uses the same interface you test in CI (often a CLI entry point). Prefer 6–10 lines that get to a concrete output (e.g., running evaluation on a small sample dataset) over 30 lines that describe options.

  • Setup: Python version, environment creation, dependency install, and how to fetch data (or a synthetic/sample dataset). Mention GPU requirements only if truly necessary.
  • Usage: one or two canonical commands: train, evaluate, and predict. Link to deeper usage docs later.
  • Results: a small table with baseline vs improved model metrics, with the dataset split and metric definition. Link to the full report artifact.
  • Limitations: known failure modes, data constraints, fairness/privacy notes, and out-of-scope use cases.

Include “evidence links” near claims: CI badge for tests, a link to a release tag used for the reported results, and a link to a frozen environment file (e.g., requirements.txt with hashes or poetry.lock). If you publish a model artifact, include its checksum and where it was produced (local vs CI). A frequent reviewer complaint is “results not traceable to code state”; solve this by tying results to a commit SHA or release.

Engineering judgment shows in what you omit. Don’t paste long logs into README. Don’t list every experiment; summarize the key narrative (baseline → improvement) and link to supporting artifacts. Your README is not your lab notebook; it is your product front door.

Section 5.3: Model card: intended use, metrics, and caveats

A model card is your formal disclosure about what the model is for, how it was evaluated, and where it can fail. Reviewers look for responsibility and rigor: clear intended use, explicit out-of-scope use, transparent metrics, and caveats tied to data and deployment conditions.

Start with Intended Use: who should use the model, in what setting, and with what human oversight. Then add Out-of-Scope Use: decisions the model must not make (or must not make alone). This is not legalese; it is risk control. For example, if you built a text classifier trained on forum data, state that it may not generalize to formal writing, other languages, or sensitive domains.

  • Model details: algorithm family, key hyperparameters, training regime, and version identifiers.
  • Metrics and evaluation: define metrics (e.g., F1, AUROC) and why they match the problem; report confidence intervals if feasible; document the split strategy and any cross-validation.
  • Thresholding: if classification thresholds are used, record the selection method and tradeoffs (precision/recall impact).
  • Limitations and failure modes: known confusing classes, performance drops on certain subgroups, sensitivity to missing values or input length.

A practical tip: include “minimum viable evaluation” and “recommended evaluation.” Minimum viable might be running make eval against a fixed test split. Recommended evaluation might include stress tests (e.g., corrupted inputs), slice metrics, and drift checks. When reviewers see that you can articulate caveats and verification steps, they infer you can operate the model safely.

Common mistakes: reporting a single metric without context, not stating the evaluation dataset provenance, and implying real-world performance from offline tests. Your model card should explicitly differentiate offline evaluation from expected production behavior and name what would need monitoring (input distribution shifts, latency constraints, and feedback loops).

Section 5.4: Data documentation: provenance, splits, and privacy

Data documentation (often called a data sheet) is where you prove that you understand the dataset lifecycle: where it came from, what transformations you applied, how you split it, and what privacy or licensing constraints apply. In certification contexts, this is frequently the difference between a “toy project” and a professional project.

Start with Provenance: source URLs, access date, license, and any collection methodology. If the data is internal or simulated, document how it was generated and why it is representative. Then document Schema and semantics: feature definitions, units, label meaning, missing value conventions, and any known label noise.

  • Splits: define train/validation/test, including whether the split is random, stratified, time-based, or group-based. Justify the choice to prevent leakage (e.g., user-level grouping).
  • Versioning: record dataset snapshot identifiers (DVC hash, checksum, or storage version). Note any preprocessing versions that affect the final dataset.
  • Privacy: identify PII fields, anonymization steps, retention policy, and whether data can be redistributed. If you cannot share the full data, provide a small sample dataset and a clear “bring-your-own-data” pathway.

Engineering judgment appears in how you handle constraints. If licensing prevents redistribution, say so plainly and provide scripts that validate a user-supplied dataset matches the expected schema. If privacy considerations exist, document threat models (re-identification risk, membership inference concerns) at an appropriate depth for the project scope. Reviewers want to see you can name risks and design around them.

Common mistakes: forgetting to document how labels were produced, not specifying the split seed or strategy, and quietly filtering rows without recording criteria. Your data sheet should make every significant transformation discoverable and reproducible.

Section 5.5: Usage docs: CLI/API, configs, and examples

Usage documentation is where a reviewer verifies the system is runnable and stable. If your project has both a CLI and a Python API, document both, but pick one as canonical (usually the CLI) and ensure it matches your CI smoke tests. Usage docs should include examples, expected outputs, and edge cases—especially the edge cases your tests cover.

For a CLI, document the contract: required arguments, optional flags, defaults, and exit codes. Show at least three examples: a minimal run, a typical run with configuration, and an evaluation-only run. If you use configuration files (YAML/TOML), include a documented example config and explain precedence rules (CLI overrides config, environment variables override both, etc.).

  • Input validation: document what happens with missing columns, NaNs, wrong dtypes, empty files, or unseen categories.
  • Determinism: document seeds and which operations are nondeterministic (e.g., GPU kernels) and how you mitigate it.
  • Performance considerations: batch size, streaming vs loading into memory, and how to run a “small mode” for quick verification.

For a Python API, provide a short, stable snippet: import, load model, preprocess, predict. State what types are accepted and returned. If your interface changes, the usage docs must change in the same PR—this is why many teams treat documentation as part of the definition of done and add doc checks in CI (for example, failing builds when CLI help text drifts from documented examples).

Common mistakes: docs that only show happy paths, examples that rely on files not in the repo, and commands that diverge from what CI runs. A good practice is to copy commands directly from CI workflow steps into the docs, so they cannot drift without someone noticing.

Section 5.6: Maintenance docs: changelog, runbooks, and governance

Maintenance documentation is what convinces reviewers your project is not a one-off script. Even solo portfolio projects benefit from lightweight operational discipline: a changelog to track user-visible changes, runbooks to handle failures, and simple governance rules for changes and releases.

A CHANGELOG should follow a consistent format (commonly “Keep a Changelog”) and map entries to versions and dates. Focus on user-impacting changes: new features, bug fixes, breaking changes, and security/privacy notes. Avoid dumping commit messages; instead, summarize what changed and how to adapt. Tie releases to Git tags so results and artifacts can be traced to a version.

Runbooks are “what to do when something goes wrong.” Include at least three: (1) Troubleshooting (installation issues, missing system libraries, dataset download failures), (2) Reproducibility (how to rerun training with the same seed and data snapshot, how to verify environment), and (3) FAQs (common usage questions, expected runtime, where outputs are stored). Make runbooks command-oriented: symptoms → likely causes → diagnostics → fixes.

  • Release checklist: update version, run tests and lint, regenerate metrics report, update model card/data sheet if data or behavior changed, update changelog, tag release.
  • Governance: define how changes are proposed (issues/PRs), minimum CI checks, and documentation updates required for interface changes.

Common mistakes: changing model behavior without updating the model card, shipping breaking CLI flags without a changelog entry, and treating troubleshooting as ad-hoc. A small, explicit governance section (even in a solo project) signals maturity: you have a process for keeping the project correct as it evolves.

Chapter milestones
  • Write a portfolio-ready README with quickstart and evidence links
  • Publish a model card and a data sheet
  • Document API/CLI usage with examples and edge cases
  • Create runbooks: troubleshooting, reproducibility, and FAQs
  • Add a changelog and release checklist
Chapter quiz

1. In a proctored or certification-style review, what is the primary role of documentation?

Show answer
Correct answer: Evidence that lets reviewers quickly verify reproducibility, intended use, and usability without reading source code
The chapter frames documentation as evidence reviewers use to confirm they can reproduce results, understand risks/limits, and use the interface quickly.

2. What does the chapter mean by the mindset shift that documentation is part of the “system boundary”?

Show answer
Correct answer: All operational requirements (data snapshot, preprocessing version, seed strategy) are system requirements and must be documented where reviewers will look first
If the project depends on specific data versions, preprocessing, or seeding, those are part of the system and must be explicitly documented.

3. Which pairing best matches each document type to its purpose in the documentation “stack” described in the chapter?

Show answer
Correct answer: README: landing page and quickstart; Model card: behavior/limitations; Data sheet: what the data is and is not
The chapter assigns distinct roles: README for quickstart, model card for behavior and limits, and data sheet for dataset scope and constraints.

4. To make baseline-to-improvement iterations reviewable, what should documentation emphasize according to the chapter?

Show answer
Correct answer: A narrative of what changed and why, plus evidence links such as experiment reports, metrics tables, CI badges, and release notes
Reviewers should be able to see what changed, the rationale, and evidence of improvement through linked artifacts and indicators.

5. What is the target reviewer experience the chapter aims for by the end of the documentation work?

Show answer
Correct answer: A reviewer can clone, install, run, evaluate, and understand limitations in under 15 minutes
The chapter’s goal is fast reviewability: clone, run, evaluate, and grasp limitations quickly (under 15 minutes).

Chapter 6: CI, Packaging, and Proctored-Ready Submission

In earlier chapters you built a defensible ML workflow: pinned environments, data strategy, tests, and documentation. This chapter turns that work into a proctored-ready deliverable. Proctors and reviewers are not only evaluating whether your model “works”; they are checking whether your project is reproducible, reviewable, and safe to run on their machine under time pressure. That means you need three things working together: (1) continuous integration (CI) that runs linting, tests, and build checks on every pull request (PR), (2) packaging with a single entry-point command that a reviewer can run without guesswork, and (3) a release artifact and version tag that freezes what you submitted.

A common mistake is treating CI and packaging as “nice to have” add-ons. In a certification context, they are evidence. CI proves your tests are real and run automatically. Packaging proves your repository is installable and provides a stable interface (CLI/API). Releases prove exactly what code you submitted. The goal is not perfection; the goal is reducing reviewer uncertainty. If a reviewer has to infer how to run your project, you have already lost valuable credibility.

We will also simulate the proctored experience: a timed walkthrough where you explain your choices, run core commands, and respond to typical reviewer questions. Expect to find gaps—missing instructions, brittle scripts, or tests that pass locally but fail in clean environments. Fixing those gaps is part of making your portfolio “proctored-ready.”

  • Outcome: every PR runs linting, tests, and a build/import check in CI.
  • Outcome: one command to reproduce the baseline and improved runs (training/evaluation) and generate artifacts.
  • Outcome: a versioned release with a downloadable artifact and release notes.
  • Outcome: a rehearsed narrative that defends scope, metrics, validation, and risk decisions.
  • Outcome: a final submission packet that anticipates reviewer questions and provides evidence.

The rest of the chapter is structured as six practical sections you can implement directly in your repository.

Practice note for Set up CI to run linting, tests, and build checks on every PR: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add reproducible packaging and a single entry-point command: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a release artifact and tag a versioned submission: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a timed proctored-style walkthrough and fix gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Assemble the final portfolio packet and review evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up CI to run linting, tests, and build checks on every PR: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add reproducible packaging and a single entry-point command: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: CI pipeline design (GitHub Actions) and caching

Section 6.1: CI pipeline design (GitHub Actions) and caching

Your CI pipeline is a contract: “Any code merged into main meets minimum quality and reproducibility checks.” For proctored review, design it to run on every PR and on pushes to main. Keep it deterministic and fast. Reviewers often skim CI logs; make the job names readable (e.g., lint, tests, build) and fail early with clear error output.

A practical baseline GitHub Actions workflow usually includes: (1) checkout, (2) set up Python, (3) install dependencies, (4) run linters/formatters, (5) run unit + integration tests, and (6) run a minimal “build check” that ensures your package installs and imports, and that the CLI entry point responds (e.g., --help). In ML repos, also add a quick data validation test that runs against a tiny fixture dataset or schema-only checks, not the full training data.

Caching is the difference between a 12-minute pipeline and a 2-minute pipeline. Use actions/setup-python built-in pip caching or cache your virtualenv/poetry/pip-tools artifacts. Cache keys should incorporate your lockfile hash (requirements.txt, poetry.lock, or uv.lock) so dependencies invalidate correctly when they change. A common mistake is caching too aggressively and accidentally hiding dependency issues; if you hit suspicious behavior, temporarily disable caching to confirm the workflow is correct.

  • Lint stage: run ruff (or flake8) and optionally black --check. Fail fast.
  • Test stage: run pytest with coverage thresholds; keep integration tests bounded in time by using small fixtures.
  • Build stage: python -m build (PEP 517) and pip install dist/*.whl, then run your-cli --help.

Engineering judgment: don’t run full training in CI unless it is genuinely lightweight. Instead, validate that pipelines execute end-to-end on a miniature dataset and that metrics calculation code is correct. This proves correctness without burning minutes and compute. If you need GPU training, CI should still run CPU smoke tests and unit tests; document the “full run” instructions separately.

Practical outcome: when a reviewer opens your PR history, they see automated checks that consistently pass and guard the same standards you claim in your README.

Section 6.2: Secrets, dependency scanning, and supply-chain basics

Section 6.2: Secrets, dependency scanning, and supply-chain basics

Proctored reviewers will notice if your project leaks credentials or uses risky dependency practices. “Secrets hygiene” is not just security theater; it demonstrates professional discipline. First, ensure your repository never requires real secrets to run core workflows. If you need API keys (for optional data downloads or experiment tracking), design the code so it runs without them and clearly documents the optional path.

Use GitHub Actions secrets only for CI tasks that truly require them (e.g., publishing a release to PyPI, uploading an artifact to cloud storage). Never print secrets in logs. Avoid commands that echo environment variables. In your documentation, instruct users to export keys locally or use a .env file that is gitignored. Add a .env.example that contains placeholder variable names but no values.

Dependency and supply-chain basics are increasingly part of professional review. At minimum: pin dependencies (lockfile or exact versions), prefer reputable sources (PyPI, official wheels), and scan for known vulnerabilities. GitHub Dependabot can open PRs for dependency updates; GitHub’s dependency graph and alerts will then attach to your repo. If you are using GitHub Advanced Security features (where available), enable secret scanning and code scanning. If not available, you can still add lightweight checks such as running pip-audit in CI to flag vulnerable packages.

  • Secret scanning mindset: treat any token as compromised once committed; rotate immediately and purge history if needed.
  • Supply-chain mindset: minimize dependency count; prefer well-maintained libraries; avoid installing from random Git URLs in CI.
  • Reproducibility mindset: lock versions; ensure CI and local dev use the same install path.

Common mistake: adding a “download data” step in CI that pulls from a private bucket or requires credentials. This causes flaky CI and blocks reviewers. Instead, use small public sample data, generated fixtures, or schema-only checks in CI. Keep the “full dataset” process documented but separate from mandatory checks.

Practical outcome: a reviewer can safely run your repo, and your CI demonstrates that you understand basic security expectations without overcomplicating the project.

Section 6.3: Packaging patterns (pyproject, console scripts, make targets)

Section 6.3: Packaging patterns (pyproject, console scripts, make targets)

Packaging is how you turn a folder of code into a reproducible tool. In proctored settings, reviewers want a single, predictable entry point. The modern standard is pyproject.toml (PEP 621 metadata, PEP 517 builds). Whether you use Hatch, Poetry, setuptools, or uv, the key is consistency: one build system, one lock strategy, and one install command documented in the README.

A strong pattern is to expose console scripts. For example, define a CLI like mlp with subcommands: mlp data-validate, mlp train, mlp evaluate, mlp predict. This gives reviewers a stable interface and lets CI do meaningful smoke checks (mlp --help, mlp data-validate --sample). Your CLI should accept config via a YAML/JSON file and allow overriding key parameters via flags. Keep defaults safe and fast (small runs by default; full runs behind an explicit flag).

Pair packaging with make targets (or just documented shell commands) that standardize developer actions. A minimal Makefile can wrap: make install, make lint, make test, make build, make smoke. The benefit is not “Make” itself; it’s eliminating ambiguity. Reviewers are often time-boxed—your job is to reduce their cognitive load.

  • pyproject: defines metadata, dependencies, optional extras like dev (lint/test/docs).
  • console scripts: provide a single entry-point command; avoid calling internal modules directly.
  • make targets: provide consistent verbs; map them to CI steps.

Common mistakes: (1) relying on notebook execution as the primary interface, (2) hardcoding file paths that only exist on your machine, and (3) mixing multiple environment tools without explaining which is authoritative. If you include notebooks, treat them as supplementary; the “official” workflow should run from the CLI.

Practical outcome: the reviewer can install your package in a clean environment, run one command to reproduce baseline outputs, and see artifacts appear in documented locations.

Section 6.4: Release workflow: tags, artifacts, and release notes

Section 6.4: Release workflow: tags, artifacts, and release notes

A proctored-ready submission is a snapshot, not a moving target. That’s what releases and tags provide. Use semantic versioning (v1.0.0, v1.0.1) or a clear dated tag (2026.03-submission)—the key is that your submission references an immutable Git tag. Reviewers should be able to check out that tag and reproduce the same behavior you claim.

Create a release artifact that bundles what the reviewer needs. Typically: source archive (automatic), built wheel/sdist (python -m build), and optionally a “portfolio packet” zip containing your model card, data sheet, metrics report, and example outputs. If you generate an evaluation report (HTML/Markdown/JSON), include it as an artifact as well. Artifacts are not a substitute for reproducibility, but they provide quick evidence during review.

Release notes should read like a professional changelog entry. Include: scope, dataset version, metric definitions, validation approach, and known limitations. Call out reproducibility instructions: Python version, install command, and the exact commands to regenerate key results. Keep it short but concrete.

  • Tagging: tag the commit that passed CI; avoid tagging “dirty” states.
  • Artifacts: attach build outputs and a small evidence bundle; verify downloads work.
  • Notes: document what changed and how to reproduce; list known issues transparently.

Common mistake: releasing without verifying the release checkout works. Before publishing, do a “cold start” test: in a fresh directory (or container), clone the repo, checkout the tag, create a new environment, install, run mlp --help, and execute a smoke pipeline. This mirrors the reviewer experience and catches missing files, unpinned dependencies, or undocumented steps.

Practical outcome: your submission becomes a durable reference that can be audited later—exactly what certification-grade review expects.

Section 6.5: Proctored walkthrough practice: narrative and defense of choices

Section 6.5: Proctored walkthrough practice: narrative and defense of choices

Proctored evaluation rewards clarity under time constraints. Practice a timed walkthrough (e.g., 20–30 minutes) where you explain the repository as if the reviewer has never seen it. Your goal is to present a coherent narrative: problem → data → baseline → improvements → evaluation → safeguards → how to run. This is not marketing; it is a defensible technical story.

Start with scope and constraints: what the model does, what it explicitly does not do, and the risks you mitigated. Then show reproducibility: open the README “Quickstart,” create the environment, install, and run a single entry-point command that produces a visible result (a metrics report, a saved model, a prediction output). If your project has multiple modes (train/evaluate/predict), demonstrate the smallest credible end-to-end path.

Next, defend evaluation choices. Be prepared to explain why your metric matches the use case, how you split data (and how you prevented leakage), and why your validation scheme is appropriate. If you used cross-validation, justify the fold strategy. If you used a time split or group split, explain what entity you protected. Show where this logic is tested (unit tests for metric computation; integration tests for pipeline wiring; data validation for schema expectations).

  • Walkthrough script: 1) repo tour, 2) install, 3) lint/test, 4) run smoke pipeline, 5) open report artifact, 6) point to docs (model card/data sheet).
  • Defense points: leakage prevention, baseline rationale, improvement justification, error analysis, and known limitations.
  • Time discipline: prefer showing one complete run over describing five partial ideas.

Common mistakes: over-indexing on model novelty, ignoring operational concerns (how to run), and being unable to locate evidence quickly (where are the metrics? where is the model card?). Fix gaps by adding “evidence pointers” in your README: links to reports, commands, and file paths. If you stumble during the rehearsal, that is a signal to simplify interfaces or improve documentation.

Practical outcome: you can confidently run and explain your project in a clean environment, with a crisp narrative that aligns with your CI, docs, and release tag.

Section 6.6: Final submission checklist and reviewer Q&A playbook

Section 6.6: Final submission checklist and reviewer Q&A playbook

Your final submission should be a packet of evidence, not just a GitHub link. Assemble it intentionally so a reviewer can verify claims quickly. Think in terms of: “What would I need to trust this project without running it?” and “If I do run it, what commands guarantee success?” Your checklist should be short, binary, and testable.

Checklist (repository): CI passes on main; PR checks include lint, tests, and build/import checks; dependencies are pinned; secrets are not required for core runs; README includes a Quickstart with exact commands; model card and data sheet are present; licensing and citation notes are clear; example outputs are included (or generated by a documented command). Ensure the single entry-point command works from a fresh install and produces artifacts in documented locations.

Checklist (release): tag exists and matches the submission; release notes include reproduction commands and dataset/version identifiers; artifacts are attached (wheel/sdist, evidence bundle); the release checkout has been verified in a clean environment. If your course/exam expects a PDF packet, generate it from your docs and include it in the release.

Prepare a reviewer Q&A playbook: short answers with pointers to evidence. Examples: “How do you prevent data leakage?” (point to split function, tests, and docs), “Why this metric?” (point to evaluation section and business alignment), “What happens with missing values?” (point to preprocessing code and data validation tests), “How reproducible is training?” (point to seed control, environment pinning, and CI).

  • Evidence-first answers: always reference a file path, command, or CI job.
  • Known limitations: state them plainly and show mitigations or future work.
  • Failure modes: mention what breaks and how you detect it (tests, validation, monitoring hooks).

Common mistake: submitting “main” without a tag, which makes the project mutable after submission. Another is having instructions that only work on your machine due to unstated OS assumptions. If possible, test on at least one alternate environment (e.g., Linux CI plus local macOS/Windows) and document any platform notes.

Practical outcome: you submit a versioned, reproducible ML project with clear evidence and a practiced defense—exactly what “proctored-ready” means in a certification-grade portfolio.

Chapter milestones
  • Set up CI to run linting, tests, and build checks on every PR
  • Add reproducible packaging and a single entry-point command
  • Create a release artifact and tag a versioned submission
  • Run a timed proctored-style walkthrough and fix gaps
  • Assemble the final portfolio packet and review evidence
Chapter quiz

1. In this chapter’s context, why are CI and packaging treated as required evidence rather than “nice-to-have” add-ons?

Show answer
Correct answer: They reduce reviewer uncertainty by proving tests run automatically and the project is installable with a stable interface
CI shows lint/tests/build checks run on every PR, and packaging provides a reproducible, reviewable entry point—both are evidence in a certification review.

2. Which combination best matches the three components the chapter says must work together for a proctored-ready deliverable?

Show answer
Correct answer: CI on every PR, packaging with a single entry-point command, and a versioned release artifact/tag
The chapter explicitly calls out CI + single-command packaging + a release artifact and version tag as the proctored-ready trio.

3. What is a key risk if a reviewer has to infer how to run your project during evaluation?

Show answer
Correct answer: You lose credibility because the project is no longer clearly reproducible and reviewable under time pressure
The chapter emphasizes that ambiguity about how to run the project increases reviewer uncertainty and reduces credibility.

4. What is the main purpose of the timed proctored-style walkthrough described in the chapter?

Show answer
Correct answer: To surface gaps like missing instructions, brittle scripts, or CI failures in clean environments and then fix them
The walkthrough simulates the proctored experience and is meant to reveal and correct issues that appear under time pressure or clean setups.

5. Which outcome best reflects what the chapter expects from your final submission packet?

Show answer
Correct answer: A packet that anticipates reviewer questions and provides evidence, alongside a versioned release artifact and notes
The chapter’s outcomes include a final packet with evidence and a versioned release, designed to be reviewable and defensible.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.