AI Ethics, Safety & Governance — Intermediate
Classify your AI system and produce EU AI Act-ready documentation.
This course is a short technical book disguised as a hands-on lab: you will take one AI system (real from your organization or a realistic case) and walk it from “What are we building?” to an audit-friendly EU AI Act documentation package. The focus is practical execution—classification, obligations mapping, controls, and technical documentation structure—so you can collaborate effectively with legal, security, product, and engineering without getting lost in abstract policy.
Instead of treating the EU AI Act as a wall of text, you’ll learn a repeatable workflow that turns requirements into concrete artifacts: a risk classification memo, a control checklist with owners, an evidence register, and a technical file index that makes audits and internal reviews faster. By the end, you’ll have a blueprint you can reuse for future systems, model updates, and supplier-integrated components.
This lab is built for product managers, ML engineers, compliance owners, risk teams, and startup leaders who need to move from “we’ve heard about the EU AI Act” to “we can demonstrate compliance readiness.” It assumes you can describe an AI system’s purpose and deployment context, but it does not require a legal background.
Chapter 1 sets the foundation: the definitions that drive everything else (roles, intended purpose, boundaries) and the documentation discipline you’ll need. Chapter 2 applies a structured decision path to classify your system and capture the rationale you’ll later defend. Chapter 3 converts obligations into a manageable control framework with ownership, versioning, and traceability.
With the plan in place, Chapter 4 focuses on the technical documentation spine—architecture, data governance, evaluation evidence, robustness, logging, and how to organize a technical file for review. Chapter 5 makes the system deployable responsibly: human oversight design, transparency touchpoints, and instructions that operators can actually follow. Finally, Chapter 6 prepares you for real operations: post-market monitoring, incidents, corrective actions, and an audit-ready packaging approach.
Every chapter ends with milestones that produce artifacts. You’ll be encouraged to write in plain language, link every claim to evidence, and maintain a single source of truth for assumptions and version changes. This is the same style used by teams that need to move quickly while staying defensible under scrutiny.
If you want to practice the workflow immediately, Register free and begin with the Chapter 1 scoping exercises. To explore related governance and safety material, you can also browse all courses.
AI Governance Lead & Compliance Documentation Specialist
Sofia Chen leads AI governance programs for product teams operating in regulated markets. She specializes in translating EU AI Act obligations into practical engineering workflows, technical documentation, and audit-ready evidence. Her work focuses on risk classification, post-market monitoring, and human oversight design.
The EU AI Act is not “an ethics checklist.” It is a product-and-process regulation that asks builders to define what they are shipping, who is responsible for which obligations, what risks it creates in real use, and what evidence proves you did the required work. This course is a compliance lab, so we’ll treat the Act like an engineering spec: clarify boundaries, map roles, classify risk, and then build a documentation set that can survive external scrutiny.
This chapter establishes the builder’s workflow you will repeat throughout the lab. You will (1) define your system’s intended purpose and boundaries; (2) identify where you sit in the AI value chain (provider, deployer, importer, distributor, product manufacturer); (3) create an obligations map you can maintain as the system evolves; (4) set up a compliance evidence workspace with naming conventions; and (5) establish a reusable documentation template pack that mirrors EU AI Act expectations.
Two practical principles will guide you. First, classification is only as good as your scoping assumptions—if you don’t write them down, you will keep re-litigating the same decisions. Second, “documentation” is not prose; it is an evidence index that links claims (what you say is true) to artifacts (what proves it). You will learn to write like an auditor will read: quickly, skeptically, and with a bias toward traceability.
Practice note for Milestone: Define your system’s intended purpose and boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Identify where your system sits in the AI value chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create an obligations map you can maintain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Set up a compliance evidence workspace and naming conventions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Establish a reusable documentation template pack: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Define your system’s intended purpose and boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Identify where your system sits in the AI value chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create an obligations map you can maintain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Set up a compliance evidence workspace and naming conventions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The EU AI Act regulates the placing on the market, putting into service, and use of AI systems in the EU, with obligations that depend on risk and on your role. For builders, the first practical task is to identify what you are actually delivering: an AI system, a component of a larger product, or a more general model used downstream. This is not wordplay; it changes which technical documentation, instructions, and lifecycle controls are expected and who must maintain them.
In engineering terms, think in layers. A “system” is the end-to-end capability as used in a context (inputs, processing, outputs, and integration points). A “model” is a trained artifact (or family of artifacts) that may be embedded in many systems. A “component” is a module—possibly AI-driven—within a larger product (for example, an AI-based risk scoring component inside a loan origination platform). Your scoping milestone here is to draw the boundary: what is inside your responsibility (training, evaluation, configuration, monitoring hooks) and what is outside (customer data pipelines, business rules you do not control, downstream fine-tuning).
Common mistake: treating a model card as “the documentation.” Model cards are useful, but the Act expects system-level thinking: intended purpose, foreseeable misuse, integration constraints, and operational controls. Another mistake is assuming that because your product is “just an API,” you are exempt from system obligations. If you supply an API that determines or meaningfully influences decisions in regulated contexts, you will still need to document the intended purpose, performance limits, and safe integration requirements.
EU AI Act compliance starts with vocabulary. Teams fail audits not because they lack controls, but because they cannot consistently answer “who is the provider here?” and “what is the intended purpose?” across versions, markets, and customer deployments. Your second milestone—identify where your system sits in the AI value chain—depends on these definitions.
Provider is typically the entity that develops an AI system (or has it developed) and places it on the market or puts it into service under its own name or trademark. If you ship a hosted service under your brand, you are usually the provider. If you white-label a vendor system under your brand, you can still become the provider. Deployer is the entity using the system under its authority (often your customer). Importer and distributor matter when systems enter the EU supply chain; they inherit specific duties around verification, traceability, and cooperation with authorities. Product manufacturer matters when AI is part of a regulated product ecosystem (e.g., machinery, medical devices), where AI Act obligations interact with existing conformity regimes.
Intended purpose is your anchor definition: the specific use for which the system is meant, as described by the provider in instructions and marketing. Builders should treat intended purpose like a requirements baseline: it drives risk classification, required controls, evaluation design, and the content of user instructions. If you keep it vague (“improves productivity”), you will struggle to justify why you are not in a higher-risk category, and you will be unable to specify appropriate human oversight.
Common mistake: writing intended purpose from a product marketing perspective rather than an operational one. The intended purpose should specify the decision or workflow it supports, the target users, the deployment environment, and what the output should and should not be used for.
The Act uses a risk-based approach: prohibited practices (unacceptable risk), high-risk systems (stringent requirements), limited-risk systems (primarily transparency duties), and minimal-risk systems (few explicit obligations, but still subject to general legal requirements). Your job is not to “pick the lowest risk label.” Your job is to produce a defensible classification record with assumptions, boundaries, and evidence links—the kind of record you can maintain through product evolution.
Start classification from use case + context, not from model type. The same underlying model can be minimal-risk in one context and high-risk in another. For example, a text generator used for internal drafting may be limited-risk with transparency measures, while a system used to screen job applicants or evaluate creditworthiness can trigger high-risk obligations depending on how it influences decisions and whether it falls into listed high-risk domains.
Engineering judgement matters most in three places. (1) Foreseeable misuse: what users will realistically do, not what your terms of service hope they will do. (2) Decision influence: whether outputs materially shape outcomes in sensitive domains. (3) System boundaries: whether you control deployment configurations, thresholds, and monitoring—or whether customers can repurpose the system in ways that change the risk.
Common mistake: relying on a single sentence like “not for high-stakes use” without enforcement mechanisms. If you claim a use is excluded, you should show how you prevent or discourage it (technical constraints, contractual terms, user prompts, access controls, customer vetting, monitoring triggers).
Compliance readiness is a scheduling problem as much as a legal one. The EU AI Act obligations phase in over time, and organizations that treat compliance as a “last month before launch” effort usually fail because evidence cannot be manufactured retroactively. Training data provenance, evaluation baselines, and change logs must exist when the work happens.
Practical readiness means watching enforcement signals: regulator guidance, harmonized standards, and the behaviors of large buyers who will demand documentation in procurement. Even before formal deadlines, enterprise customers often require a “compliance posture” package: role mapping, intended purpose statement, risk classification record, and a documentation index. If you cannot provide these, sales cycles slow down and security reviews expand.
Translate the timeline into engineering milestones. Add “compliance gates” to your product lifecycle: (1) concept gate—intended purpose and boundary; (2) data gate—data governance plan and dataset inventory; (3) model gate—evaluation report and known limitations; (4) release gate—user instructions, human oversight measures, and evidence index; (5) post-release gate—monitoring plan, incident intake, and change control.
Common mistake: assuming that “we will document later” is acceptable. Under audit pressure, teams discover gaps like missing dataset licenses, no record of why certain thresholds were chosen, or no traceability from a known issue to a corrective action. The earlier you design for evidence, the cheaper compliance becomes.
Documentation under the EU AI Act should be treated like an engineering control system: it provides repeatable traceability from requirements to implementation to verification. The milestone in this chapter is to set up a compliance evidence workspace and naming conventions, because good evidence is discoverable. If evidence cannot be found quickly, it effectively does not exist.
Think in three layers. (1) Claims: statements you make about the system (intended purpose, risk tier, performance, limitations, human oversight). (2) Controls: processes and technical measures that make those claims true (data governance procedures, evaluation pipelines, access controls, review workflows). (3) Artifacts: concrete outputs (dataset inventories, evaluation reports, model version logs, UI screenshots, incident records). Your documentation pack should connect these via stable IDs and links.
Set up an evidence index—a single table that lists artifact name, owner, version, location, and which requirement/claim it supports. Use naming conventions that survive time and team changes, such as: AIAC-TECHDOC-01-SystemDescription-v1.2.pdf, AIAC-EVAL-05-BiasAudit-2026-02-15.md, and AIAC-DATA-03-DatasetRegister.xlsx. Store immutable snapshots for releases, and keep working documents separately to avoid overwriting historical proof.
Common mistake: mixing “policy statements” with “evidence.” A policy that says “we test for bias” is not evidence; an evaluation report with methodology, results, and sign-off is. Another mistake is scattering artifacts across personal drives and chat threads. Centralize and control access; you will need to show an audit trail, not just final PDFs.
This lab works best when you pick a concrete case and keep it stable while you learn the mechanics. Your final milestone for Chapter 1 is to select a lab case and apply scoping rules so your classification and documentation remain coherent. Choose a system you can describe end-to-end in one page, with a clear user, a clear workflow, and at least one measurable output (score, label, recommendation, generated text) that affects an action.
Use scoping rules to prevent “compliance sprawl.” Rule 1: define one primary intended purpose and up to three supported use scenarios; list everything else as non-intended. Rule 2: freeze the version under assessment (model version, prompts, thresholds, UI flow, integration points). Rule 3: specify data boundaries: what training data you used, what runtime data you expect, and what you explicitly prohibit (e.g., special category data unless justified and controlled). Rule 4: identify human roles: who sees outputs, who can override, and what happens when the system is uncertain.
As you scope, also identify where you sit in the value chain for this case. Are you the provider of the full system, or are you a component supplier? Are you also the deployer in a managed service, or does the customer operate it? These answers determine what your obligations map will include and which documents must be customer-facing (instructions, transparency notices) versus internal (development logs, evaluation protocols).
Common mistake: picking a case that is too abstract (“a general chatbot”) or too broad (“the whole platform”). Pick one deployable capability and one deployment context. You can expand later, but you cannot classify what you cannot scope.
1. Why does the chapter emphasize that the EU AI Act is not “an ethics checklist”?
2. What is the main purpose of defining your system’s intended purpose and boundaries at the start of the workflow?
3. Which set of roles reflects the AI value chain positions the chapter says you must identify for your system?
4. What does the chapter mean by saying “documentation” is not prose?
5. Which sequence best matches the repeatable builder workflow established in Chapter 1?
This chapter is your working method for classifying an AI system under the EU AI Act and producing a record you can defend in an audit, procurement review, or internal risk committee. The goal is not only to label a system as prohibited, high-risk, limited-risk, or minimal-risk, but to show how you reached that conclusion: what you assumed, what evidence you checked, where the boundaries are (what is “in scope” vs. “out of scope”), and who must act (provider, deployer, importer, distributor, product manufacturer).
In practice, classification is a sequence of gates. First you screen for prohibited practices and document the rationale (Milestone: prohibited screening). If you pass, you run a high-risk decision tree and record outcomes (Milestone: high-risk decision tree). If not high-risk, you assess whether transparency duties apply (Milestone: limited-risk transparency). Finally, you produce a signed-off risk classification memo that captures the decision, evidence links, and approvals (Milestone: signed memo).
Engineering judgement matters. Many systems are “almost” high-risk because they are used in a high-impact context, integrated into a regulated product, or influence a decision without being the final decision-maker. Common mistakes include: classifying based on the vendor’s marketing name rather than intended purpose; ignoring downstream integration; forgetting that “who is the provider” can shift when a deployer makes a substantial modification; and treating transparency notices as optional UX copy rather than compliance artifacts tied to user instructions and human oversight.
Use the six sections below as a repeatable workflow. Treat your classification record like a technical artifact: versioned, evidence-linked, and signed by accountable roles.
Practice note for Milestone: Screen for prohibited practices and document the rationale: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Run the high-risk decision tree and record outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Classify transparency duties for limited-risk systems: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Produce a signed-off risk classification memo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Screen for prohibited practices and document the rationale: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Run the high-risk decision tree and record outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Classify transparency duties for limited-risk systems: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Produce a signed-off risk classification memo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your first gate is to screen for prohibited practices. This is not a “quick sanity check”; it is a documentable decision. The EU AI Act bans certain uses outright, so your compliance posture starts by proving you are not building, supplying, or deploying one of those uses.
Work from the system’s intended purpose and real deployment context, not from model type. A generic classifier can become prohibited if deployed to manipulate vulnerable groups or to enable covert scoring. Your screening deliverable should be a short rationale with evidence links: product requirements, user stories, screenshots, contracts, and deployment policies.
Milestone: Screen for prohibited practices and document the rationale. Your output is a one-page “Prohibited Practices Screen” in your risk classification record: each prohibited bucket, “Applicable? (Y/N)”, rationale, and evidence links. Common mistake: writing “Not applicable” without specifying the operational boundary (e.g., “No use in public spaces; no law-enforcement customers; contract prohibits use for identification”). Make your boundary explicit.
If the system is not prohibited, you run the high-risk decision tree. High-risk classification commonly comes from two practical routes: (1) the AI is a safety component of a regulated product (or itself a regulated product), or (2) the AI is used in a listed high-impact domain (the “Annex-style” use cases such as employment, education, essential services, law enforcement, migration, or administration of justice).
Start with a mapping worksheet that ties intended purpose → decision influenced → domain → user group → impact. Then evaluate triggers. For example, an AI that screens job applicants, ranks candidates, or predicts performance is typically in the employment domain even if the system is branded as “productivity analytics.” Likewise, an AI that determines eligibility for credit or housing can fall into essential services.
Milestone: Run the high-risk decision tree and record outcomes. Your record should include: the branch taken, the trigger (domain/product), the exact system function in that context, and evidence (process diagrams, SOPs, integration architecture). Common mistakes: assuming “human in the loop” prevents high-risk classification (it does not), and under-describing influence (e.g., “advisory only” when the UI presents a single recommended action). Write the operational truth.
Practical outcome: a clear yes/no high-risk determination plus the list of compliance obligations you will need next (technical documentation structure, risk management, data governance, logging, transparency, human oversight, accuracy/robustness, post-market monitoring). Even if you are not writing the full high-risk file yet, you should identify the missing evidence now.
Many teams build on general-purpose AI (GPAI) models or offer capabilities that are reused across multiple products. Classification must account for downstream integration: a base model may not itself be deployed into a high-impact decision, but a downstream system might be.
Operationally, treat your system as a chain: data → base model → adapters/fine-tunes → orchestration/prompting → tools → UI → business process. Determine who is the provider at each step. A vendor providing a base model may be the provider of the model, while your organization becomes the provider of the integrated system if you place it on the market or put it into service under your name. If you make a substantial modification (e.g., change intended purpose, materially affect compliance characteristics, or materially change performance in a regulated context), responsibilities can shift.
A common mistake is to classify only the “model” and ignore the “system.” Under the EU AI Act approach, the relevant object is typically the AI system as deployed for an intended purpose. Your risk classification record should therefore include a downstream-use statement: “This classification applies to deployment in X process for Y users; if integrated into Z domain (e.g., hiring), classification must be re-run.” This is essential for teams who ship platforms and APIs.
If the system is not prohibited and not high-risk, it may still have transparency duties. Limited-risk obligations are frequently triggered by how the system interacts with people: users must be informed they are interacting with AI in certain contexts, synthetic content may need labeling, and outputs that could be mistaken for authentic content may require disclosures.
Milestone: Classify transparency duties for limited-risk systems. Do this as a set of concrete UX and documentation requirements, not abstract legal notes. Start by listing every user interaction surface: chat UI, email generation, call center scripts, image/video generation, voice output, and API responses. Then specify what notice appears, to whom, when, and how it is logged.
Common mistakes: burying notices in Terms of Service; using vague language (“powered by AI”) without stating what the system does; and failing to align notices with actual behavior (e.g., the system drafts a denial message that looks final, while policy says it’s only a draft). Practical outcome: a “Transparency Obligations Table” in your record: trigger, notice text, placement, owners, and evidence (mockups, screenshots, localization plan).
Minimal-risk systems still benefit from disciplined documentation because classification can change as scope expands. The best teams treat minimal-risk as “low regulatory burden,” not “no governance.” Your goal is to keep a lightweight, evidence-ready package that can scale if the system moves into a high-impact domain.
Adopt a voluntary documentation set aligned with EU AI Act-style technical documentation structure, but sized to your system. A practical minimal set includes: intended purpose statement, system architecture diagram, data sources and licenses, evaluation summary, known limitations, user instructions, and an incident/reporting pathway. This also supports procurement, security reviews, and customer trust.
Common mistake: skipping evaluation because “it’s only internal.” Internal deployments can still cause harm or become high-risk if used for employment decisions, access control, or customer eligibility. Practical outcome: a minimal-risk dossier you can reuse when you later draft full technical documentation and an evidence index.
Classification is rarely binary on the first pass. You will face ambiguity: mixed-use platforms, customers who can configure workflows, and systems that sit adjacent to regulated decisions. The correct response is not to guess; it is to document uncertainty, define assumptions, and escalate to the right governance forum.
Build your risk classification record as a set of testable statements. Example: “Assumption A1: Outputs are not used to make final hiring decisions; they are used only to draft interview questions.” Then define how A1 is enforced (permissions, product UI, contracts, training, monitoring). If you cannot enforce an assumption, it is not an assumption—it is a risk.
Milestone: Produce a signed-off risk classification memo. This memo should include: final risk category, rationale, evidence links, assumptions and enforcement controls, open questions, and named approvers (product owner, engineering lead, compliance/legal, and—where relevant—deploying business owner). Common mistake: treating the memo as static. Re-run classification upon major changes: new markets, new customer segments, new integrations, retraining/fine-tuning, or expanded intended purpose.
1. Which sequence best matches the chapter’s recommended “gates” for classifying an AI system under the EU AI Act?
2. Beyond assigning a risk label (prohibited/high/limited/minimal), what must the classification record demonstrate to be defensible in an audit or review?
3. Which situation is highlighted as a reason a system can be “almost” high-risk and requires careful engineering judgment?
4. Which is identified as a common mistake when classifying an AI system’s risk category?
5. How does the chapter advise treating transparency notices for limited-risk systems?
In Chapters 1–2 you classified the use case and drafted the outline of technical documentation. This chapter turns that classification into an execution plan: a control framework that tells you what you must do, who must do it, when it must happen in the lifecycle, and what evidence proves it happened.
The EU AI Act is obligation-heavy by design, and most teams fail not because they disagree with the obligations, but because they never translate them into an engineerable workflow. The goal here is to convert obligations into a control checklist with owners, define a minimal but credible quality management and change control workflow, create a traceability matrix from requirements to evidence, draft a gap analysis and remediation plan, and finally prepare an internal review packet for sign-off.
Think of your compliance plan as a “control plane” over your product lifecycle: requirements flow into controls; controls flow into procedures; procedures generate evidence; evidence supports technical documentation and sign-off. If you build that pipeline early, audits become a retrieval task instead of a fire drill.
Use the sections below as building blocks. Each section ends with practical outcomes you can lift directly into your lab deliverables.
Practice note for Milestone: Convert obligations into a control checklist with owners: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Define your quality management and change control workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create a traceability matrix from requirements to evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft a gap analysis and remediation plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Prepare an internal review packet for sign-off: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Convert obligations into a control checklist with owners: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Define your quality management and change control workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create a traceability matrix from requirements to evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft a gap analysis and remediation plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The EU AI Act responsibilities change sharply depending on your role. Start your compliance plan by mapping the real-world actors to statutory roles: provider (develops or places on the market), deployer (uses under its authority), and sometimes importer, distributor, or product manufacturer. Do not assume “we are the provider” just because you built a model; if you integrate a third-party model into a product and place the system on the market under your name, you may still be the provider of the system.
Convert this role mapping into a control checklist with owners. A practical way is to create a table where each obligation (risk management, data governance, technical documentation, logging, transparency, human oversight, post-market monitoring) is a row, and each role is a column. Then assign a single accountable owner (a person or team) for each control in your organization, even if multiple teams contribute.
Engineering judgement matters when a control seems “shared.” Example: if your customer (deployer) can change prompts or decision thresholds, you as provider should specify allowed configuration ranges and test boundaries; the deployer should document the actual chosen configuration and operational checks. A common mistake is leaving configurability ungoverned and later discovering the deployer’s configuration invalidates your performance claims.
Practical outcome: produce a one-page role map plus an obligations-by-role checklist with an owner, a frequency/trigger (e.g., “before release,” “each model update,” “quarterly”), and the evidence artifact (link placeholder) for each row. This becomes the spine of your compliance plan and feeds later sections on traceability and internal sign-off.
A Quality Management System (QMS) can sound heavyweight, but a minimal, credible set is enough if it is specific, used, and evidenced. For AI, the QMS must cover not only software quality but also data and model lifecycle. Your aim is to define a workflow that turns obligations into repeatable practice—then make it easy to follow.
Start with five “minimum viable QMS” procedures, each with a template and a routing rule:
Integrate change control as the operational heartbeat of the QMS (expanded in Section 3.4). The best practice is to define “gates” aligned with your SDLC/ML lifecycle: design review, pre-release evaluation, release approval, and post-release monitoring review. These gates generate the evidence you will later index in the technical documentation.
Common mistake: writing a QMS that mirrors generic ISO language without AI specifics. For instance, a software-only change procedure won’t capture model retraining, dataset refreshes, prompt template updates, or evaluation drift. Another mistake: having no explicit owner for QMS artifacts; if “the company” owns it, no one does.
Practical outcome: define a minimal QMS workflow diagram (even a simple box-and-arrow) and a RACI for the five procedures. Then draft your first “internal review packet” outline: what documents are required at release gate (risk report, evaluation report, instructions, logging plan, evidence index) and who signs each.
Risk management is where compliance becomes product thinking. The EU AI Act expects a continuous risk management process, not a one-time checklist. Your job is to define a routine that is specific to your intended purpose and operating context, and that produces traceable outputs.
Use a structured chain: hazard → hazardous situation → harm → affected stakeholder. A hazard is a source of potential harm (e.g., hallucinated medical advice, biased scoring, data leakage). The hazardous situation is how the hazard manifests in context (e.g., a user treats a generated answer as clinical instruction). The harm is the consequence (injury, discrimination, financial loss, rights infringement).
Score each risk using likelihood and severity, but define what those terms mean in your domain. Teams often copy a 1–5 matrix without calibrating it. Calibrate with examples: “Severity 5 = irreversible harm or major rights impact,” “Likelihood 5 = expected weekly in typical usage.” If you cannot justify a score, treat it as unknown and plan data collection.
This is where you create a traceability matrix: each identified risk links to (1) a control, (2) an implementation artifact (design doc, code module, configuration), and (3) an evaluation artifact (test plan, results). Do not let mitigations live only in narrative form—make them a set of testable requirements.
Common mistake: listing only model-centric hazards. Many real harms come from workflow: poorly designed UI, unclear instructions, or incentives that push users to misuse the system. Another mistake: not updating the risk register after changes or incidents. Risk management must be tied to change control and post-market signals.
Practical outcome: establish a risk register template with columns for hazard chain, scores, mitigations, verification evidence links, residual risk, and monitoring signals. Then add a monthly/quarterly review cadence and triggers (new release, supplier change, incident, drift).
AI systems change in more ways than traditional software. Your change management process must cover: model version changes, prompt template changes, retrieval corpus updates, feature engineering changes, data pipeline modifications, threshold and routing changes, and even user instruction updates. Treat each as a potentially material change to performance and risk.
Define a change taxonomy with three tiers:
Then create a change control workflow: request → impact assessment → required tests → approvals → deployment → monitoring → closure. The impact assessment should explicitly ask: Does this change alter the intended purpose? Does it change input data assumptions? Does it affect any documented limitations? Does it require updated instructions for use?
Rollbacks are part of compliance. If you cannot quickly roll back a model or prompt configuration, you cannot credibly claim ongoing risk control. Maintain a rollback plan with: last-known-good version, deployment toggles, data migration considerations, and a communication plan to deployers/users when behavior changes.
Common mistakes include: updating prompts in production “quietly,” not versioning retrieval data, and not rerunning evaluations when the environment shifts (e.g., new user population, new language, seasonal data). Another frequent gap is failing to connect change tickets to evidence updates—your technical documentation index becomes stale.
Practical outcome: implement change tickets that automatically require links to the risk register items affected and the evidence artifacts produced. Add explicit sign-off gates for material/major changes, and define monitoring “watch metrics” to confirm the change behaves as expected post-release (error rates, drift indicators, complaint types).
Most AI systems are composites: a foundation model API, open-source libraries, hosted vector databases, labeling vendors, or external datasets. Under the EU AI Act, you cannot outsource accountability; you must manage supplier risk. Your control framework should identify critical suppliers—components whose failure could cause harm, compliance failure, or inability to provide evidence.
Start by building a supplier inventory with: component name, purpose, data flows, where processing occurs, versioning method, SLAs, and substitution options. Then apply tiered controls:
Engineering judgement is required when the supplier is a black box. If you cannot see training data or internal evaluations, compensate with stronger external testing, tighter constraints on use, narrower intended purpose, and more robust monitoring. Document these compensating controls explicitly; auditors are looking for reasoned decisions, not perfect visibility.
Common mistake: treating third-party model updates as “their problem.” If the provider changes behavior, your system’s risk profile can change overnight. Another mistake is failing to record which supplier version was in use for a given decision—without that, incident investigation and CAPA become guesswork.
Practical outcome: add supplier controls to your control checklist with clear owners (e.g., Vendor Manager + ML Lead), define a supplier change trigger that feeds your change management workflow, and create a standard onboarding packet that produces evidence (due diligence notes, test results, approved use constraints).
Compliance lives or dies by evidence. An evidence register is not a folder of PDFs; it is a curated index that maps each obligation and each control to a versioned artifact and an approval record. This section is where you operationalize “evidence-ready” technical documentation.
Define your evidence register with three layers:
Versioning rules should be explicit. At minimum: every artifact has a unique ID, semantic version, date, owner, and status (draft/in review/approved/retired). For artifacts tied to releases, also record the product/model version and deployment environment. Store immutable snapshots for approved versions; do not rely on “latest” links.
Approvals should mirror your QMS gates. Define what requires sign-off (e.g., risk report, evaluation results, instructions for use, post-release monitoring plan), who signs (product owner, ML lead, compliance/legal, security/privacy where applicable), and what constitutes acceptance. This is how you prepare an internal review packet for sign-off: a consistent bundle of artifacts, each with its evidence register entry, ready to approve.
Now perform a gap analysis and remediation plan. Compare your current artifacts against the evidence register and mark each item as: available, incomplete, missing, or not applicable (with justification). For each gap, assign an owner, a remediation task, a due date, and the test/evidence that will close it. Common mistakes include calling items “not applicable” without a scope rationale, and “closing” gaps with plans instead of results.
Practical outcome: create your evidence register spreadsheet (or GRC tool equivalent) and populate it with at least one complete end-to-end trace: requirement → control → artifact → approval. When you can trace a single obligation all the way to approved evidence, you have a working compliance machine—not just documents.
1. What is the primary purpose of Chapter 3 in the course workflow?
2. Which sequence best represents the “control plane” pipeline described in the chapter?
3. Why do teams commonly fail at EU AI Act compliance, according to the chapter?
4. What is the role of a traceability matrix in the Chapter 3 framework?
5. What outcome best matches the chapter’s goal of making audits “a retrieval task instead of a fire drill”?
This chapter turns your classification work into provider-grade technical documentation: a technical file that can survive scrutiny by a notified body, a regulator, and your own incident response team months later. The goal is not to produce “pretty” narrative text; the goal is to produce an evidence-ready record that explains what you built, why it is allowed, what it is intended to do, how it can fail, and how you will detect and control those failures. Think of the technical documentation as a map: it lets an independent reader trace from an AI Act obligation to a concrete control, then to an artifact (policy, dataset lineage, evaluation report, logging configuration), and finally to a named owner and a date.
Provider-style documentation is different from a generic product spec. It must be bounded: clear intended purpose, defined target users, known operating environment, and explicit “out of scope” use. It must be reproducible: dataset versions, model versions, configuration, and test procedures must be identifiable. It must be auditable: evidence links, decision records, and change history. In this chapter you will complete five practical milestones: write the system description and intended purpose; document data governance and dataset lineage; capture model development, evaluation, and performance evidence; produce the documentation index with cross-references; and run a completeness check against your control checklist.
The engineering judgment here is in choosing the right level of detail. Too little detail makes the file un-auditable. Too much detail (like dumping raw training data or internal secrets) creates security and privacy risk and becomes unmaintainable. The safe middle is: document what a competent third party needs to assess compliance and safety, and provide controlled references to sensitive artifacts (with access controls) rather than embedding them directly.
As you draft, keep one discipline: every claim needs an artifact. If you write “the model is robust,” point to robustness tests and acceptance criteria. If you write “human oversight is provided,” point to UI designs, training materials, and escalation procedures. The rest of this chapter gives you a structured template and the common mistakes to avoid.
Practice note for Milestone: Write the system description and intended purpose section: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Document data governance and dataset lineage: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Capture model development, evaluation, and performance evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Produce the technical documentation index and cross-references: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Run a completeness check against your control checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Write the system description and intended purpose section: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Document data governance and dataset lineage: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The technical file should read like a controlled dossier, not a wiki. Start with an index that mirrors your control checklist: system description and intended purpose, role mapping, risk classification record, data governance, model lifecycle evidence, human oversight and user instructions, robustness/cybersecurity, logging/traceability, and post-market monitoring hooks. Treat this as the milestone where you “produce the technical documentation index and cross-references.” Build it as a table with: document ID, title, version, owner, storage location, and which AI Act requirement(s) it supports.
What goes in: concise descriptions, process summaries, acceptance criteria, test results, and decision records. Include diagrams (architecture, data lineage, model pipeline), configuration manifests (model version, dependencies), and references to controlled repositories (e.g., an internal GRC tool, ticketing system, or model registry). What stays out: raw personal data, full training corpora, secrets (API keys), exploit details, and anything that would materially increase security risk if leaked. Instead, include hashes, dataset IDs, and access-controlled links.
Use a two-layer pattern: (1) a narrative “front matter” that makes the intended purpose and boundaries unmissable, and (2) an evidence layer with artifacts. For the milestone “write the system description and intended purpose,” keep the main text crisp: intended purpose, target users, deployment setting, automation level, and prohibited uses. Common mistake: describing capabilities instead of intent (e.g., “can rank applicants” without stating whether it is intended to rank applicants). Regulators care about intended purpose, not what the model could be repurposed to do.
Practical outcome: by the end of this section you should have a document tree and an index that can be handed to an auditor, where each folder and file has a reason to exist. If a file doesn’t support a requirement, an assumption, or a risk control, it likely belongs in product documentation—not the technical file.
Architecture documentation is where you make the system legible. Start with a component diagram that distinguishes: user interface (or API), orchestration layer, model serving, data stores, feature pipelines, monitoring/logging, and human oversight touchpoints. Then add interface contracts: what data enters, what outputs leave, and what guardrails sit on those boundaries (validation, rate limiting, content filtering, policy checks). This section supports the milestone “write the system description and intended purpose” because architecture must align to what you claim the system is for and how it is used.
Document each component with four fields: purpose, inputs/outputs, failure modes, and owners. For example, an “ingestion service” might accept CSV uploads; failure modes include schema drift and missing consent metadata; owner is the data engineering team. Include deployment topology (cloud region, on-prem nodes, mobile edge) because it affects privacy, cybersecurity, and logging retention. If you rely on third-party models or APIs, document them as dependencies with versioning and contractual constraints; don’t hide them behind “vendor service.”
Interfaces are where most compliance gaps occur. Typical mistakes include: not specifying whether prompts, uploaded documents, or telemetry are used for retraining; not documenting where personal data is stored; and not describing human-in-the-loop decisions (who can override, when, and how). Be explicit about the automation boundary: what the system recommends vs. what it decides. If the deployer can configure thresholds or policies, list every configurable parameter that changes risk (e.g., decision threshold, class labels, confidence cutoffs, escalation routes) and how changes are governed.
Practical outcome: a reviewer should be able to trace from an input (e.g., a user record) through transformations to a model output and onward to a human decision, with clear control points. If you cannot trace the path, you cannot credibly claim traceability, oversight, or data governance later.
This section is your data story: where data came from, what it contains, why it is appropriate, and how it is controlled. It directly covers the milestone “document data governance and dataset lineage.” Start with dataset lineage tables: dataset ID, source, collection window, legal basis/permissions, preprocessing steps, labeling method, quality checks, and retention rules. Include a diagram showing how raw sources become training/validation/test sets, and how each split is versioned and frozen.
Sourcing: document whether data is first-party, customer-provided, scraped, purchased, or synthetic. For each source, record the usage rights and constraints (e.g., “no model training,” “only internal testing,” “delete after 30 days”). Labeling: explain the label ontology, instructions, rater qualifications, inter-rater agreement metrics, and adjudication process. Common mistake: treating labels as ground truth without documenting ambiguity. If labels are subjective (toxicity, suitability, risk), write down how disagreement is resolved and how uncertainty is represented.
Representativeness: define the target population and compare it to your dataset. You do not need perfect representativeness, but you must show awareness of gaps and mitigations. Provide slice definitions relevant to your intended purpose (e.g., language, region, device type, job category) and evidence that you checked performance and error rates across slices. Privacy: record data minimization decisions, PII handling, de-identification/pseudonymization steps, access controls, and DPIA/legitimate interest assessments where applicable. If personal data is used, describe how you prevent re-identification and how you honor deletion requests.
Practical outcome: a reader should be able to answer “What data trained this model?” without guessing, and also answer “Should this data have been used?” without hunting for legal or policy artifacts.
Model lifecycle documentation is where you “capture model development, evaluation, and performance evidence.” Treat the lifecycle as a controlled pipeline: design → training → tuning → validation → release → change management. For each stage, document inputs, outputs, gates, and sign-offs. A practical format is a release record per model version: training code commit, environment (libraries, hardware), dataset versions, hyperparameters, evaluation suite version, and approval ticket.
Training and tuning: specify model type, objective function, feature set or prompting strategy, and any constraints (fairness regularization, calibration, safety filters). Record why key decisions were made; auditors look for reasoned trade-offs, not just numbers. Testing and validation: define metrics aligned to the intended purpose (accuracy, F1, calibration error, false positive/negative costs) and include confidence intervals or repeated runs where variability matters. Add decision thresholds and how they were chosen (e.g., cost-based optimization, policy-based minimum recall). If you use foundation models, document adaptation method (prompting, RAG, fine-tuning), safety evaluations, and limitations.
Include negative testing: adversarial prompts, out-of-distribution inputs, and stress tests. Common mistakes: evaluating only on a “clean” test set; not freezing the test set; or reporting aggregate metrics while hiding failure clusters. Also document human oversight integration tests: does the UI present uncertainty, reasons, and escalation options? If humans can override, is it logged and fed into monitoring (without silently becoming training data)?
Practical outcome: you can reproduce results for a given model version and demonstrate that release decisions were gated by pre-defined acceptance criteria rather than informal approval.
This section translates “the system is safe” into concrete failure modes and defenses. Start with a failure mode catalog (often an FMEA-style table): failure mode, cause, effect, severity, likelihood, detectability, and mitigations. Tie mitigations to architecture controls (input validation, sandboxing), model controls (confidence thresholds, refusal behavior), and process controls (incident response, patching SLAs). If the AI system supports a high-stakes workflow, include fallbacks: manual processing, graceful degradation, or safe defaults.
Robustness evidence should include: distribution shift checks, perturbation tests (noise, missing fields, formatting changes), and resilience to data quality issues. Cybersecurity should cover: threat model, attack surfaces, dependency management, access control, secrets handling, and vulnerability monitoring. For ML-specific threats, document mitigations for prompt injection, data poisoning, model inversion, membership inference, and supply-chain compromise (malicious model artifacts). Common mistake: listing security policies without linking to system-specific controls and test results.
Also document “known limitations” plainly. A provider-style file should not oversell. If performance degrades in certain languages, or the model is not suitable for certain user groups or contexts, state it and route it into user instructions and deployer guidance. This section should connect to post-market monitoring by specifying what signals indicate degradation and what triggers a rollback or re-validation.
Practical outcome: you can show that failures were anticipated, tested where feasible, and bounded with controls—reducing both regulatory and operational risk.
Logging is the backbone of traceability, incident investigation, and regulatory defensibility. Define what events are logged, at what granularity, and for how long—then align it to privacy and minimization. At minimum, log: model/version identifier, configuration parameters that affect outputs, timestamp, input metadata (not necessarily full content), output(s), confidence/uncertainty, user action (accept/override/escalate), and system warnings or safety filter decisions. If you cannot log raw inputs due to sensitivity, log hashed fingerprints, feature summaries, or redacted excerpts with a reproducible redaction method.
Traceability means you can reconstruct “why did the system behave this way?” Create an end-to-end trace ID that links UI/API requests to downstream model calls, retrieval results (for RAG), and final responses. Maintain a record of changes: dataset updates, threshold changes, model swaps, prompt template changes, and policy/config updates. This is where you “run a completeness check against your control checklist”: verify that every required record is actually captured by your telemetry pipeline and that retention and access controls are enforced.
Common mistakes include: logging too much sensitive data; logging too little to debug; missing version identifiers; and having logs that exist but are not searchable or exportable for audits. Define operational playbooks: how to retrieve logs for an incident, who can access them, how you handle deletion requests, and how you produce an audit package.
Practical outcome: your technical file can point to specific log schemas and dashboards as evidence, and your organization can investigate complaints or anomalies without guesswork.
1. What is the primary goal of provider-style technical documentation in this chapter?
2. Which set of characteristics best distinguishes provider-style documentation from a generic product specification?
3. When deciding the level of detail to include, what “safe middle” does the chapter recommend?
4. What does the chapter mean by the discipline “every claim needs an artifact”?
5. Which activity best reflects the “map” function of the technical documentation described in the chapter?
This chapter turns your classification work into operational reality. Under the EU AI Act, it is not enough to label a system “high-risk” (or “limited-risk”) and produce a tidy technical file. You must show that people can supervise the system, understand what it is doing, intervene effectively, and use it within safe boundaries. In practice, this means designing intervention points (human oversight measures), drafting instructions that constrain use to the intended purpose, producing transparency notices, and validating that real operators can follow the guidance.
A useful mindset is: oversight, transparency, and instructions are not “documentation tasks.” They are control mechanisms. They help prevent foreseeable misuse, reduce over-reliance on model outputs, and make responsibility assignment credible across provider and deployer roles. Throughout this chapter you will draft artifacts that can be handed off to a deployer as an evidence-ready pack: an oversight plan, user instructions, disclosure artifacts, usability validation notes, and a deployment readiness checklist with clear go/no-go criteria.
The most common failure mode is writing generic policies (“a human will review outputs”) without specifying who, when, with what information, how quickly, and with what authority to override or stop the system. The second failure is writing user instructions that are correct in theory but unusable in the field: too long, too ambiguous, or missing the exact steps operators need at the moment of decision.
Practice note for Milestone: Specify human oversight measures and intervention points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft user instructions and operational constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create transparency notices and disclosure artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Validate usability: can operators follow the instructions?: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Finalize a deployer handoff pack: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Specify human oversight measures and intervention points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft user instructions and operational constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create transparency notices and disclosure artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Validate usability: can operators follow the instructions?: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Human oversight is an engineered workflow, not a slogan. Start by mapping the end-to-end decision chain: input capture → model inference → output presentation → operator action → downstream effect. Then mark intervention points where a human can (a) prevent harmful use, (b) detect model failure, or (c) limit impact. Your milestone in this section is to specify human oversight measures and intervention points with enough detail that another team could implement them.
Common human-in-the-loop (HITL) patterns include: pre-approval (model suggests, human must approve before any action), exception review (human reviews only high-risk/confidently negative/low-confidence cases), dual control (two-person sign-off for sensitive actions), and circuit breaker (operators can pause the system, revert to a safe baseline, or switch to manual processing). Select patterns based on severity, reversibility, and time-to-harm. If an outcome is hard to reverse (e.g., denial of a service, reporting to authorities), use pre-approval or dual control; if time-critical but reversible, exception review may be acceptable.
Design explicitly against automation bias: people tend to over-trust confident outputs, especially under time pressure. Mitigations include: showing uncertainty and key contributing factors (where appropriate), presenting alternatives (top-2 options), requiring an operator rationale for high-impact decisions, and forcing “active confirmation” rather than one-click acceptance. Avoid UI patterns that nudge acceptance, such as defaulting to “approve,” hiding dissenting evidence, or using authoritative language (“the system determined…”). A practical control is to log acceptance rates and investigate teams with unusually high “accept-without-edit” behavior.
Document oversight in a short “Oversight Control Sheet”: roles (operator, supervisor), required checks, escalation thresholds, maximum time to intervene, tools available (audit view, data provenance, explanation view), and stop conditions (e.g., drift alerts, anomaly spikes, policy changes). Pair it with a simple RACI for override authority—who can block a decision, who can pause deployment, and who owns incident response.
Oversight only works when the people doing it are competent and empowered. Treat operator competence as a safety requirement: define what an operator must know, how they prove it, and how access is restricted if they are untrained. This section’s practical output is a training-and-access appendix suitable for technical documentation and for a deployer handoff pack.
Start with a role taxonomy: operator (uses outputs), supervisor (reviews escalations), admin (configures thresholds/workflows), and auditor (reads logs). For each role, specify: prerequisites (domain knowledge, legal constraints), training modules (system overview, limitations, bias/failure modes, data handling, escalation process), and competency checks (scenario-based assessment, minimum passing score, periodic re-certification). Include training frequency triggers: model updates, policy updates, or detected performance drift.
Access control should reflect the risk profile. Use least-privilege principles: operators should not be able to change thresholds that determine escalation; admins should have change control; auditors should have read-only access. Document authentication and authorization mechanisms (SSO, MFA, role-based access control), and log what matters: who accessed the system, what inputs were processed, what output was shown, what action was taken, and whether an override occurred. If your system supports “shadow mode” or limited pilots, document how access is segmented and how test outputs are prevented from affecting real decisions.
Common mistakes: “training available” without proof of completion; sharing admin credentials; and failing to train for edge cases (e.g., missing data, conflicting evidence, adversarial prompts). A practical check is to run a tabletop exercise: give operators five borderline cases and observe whether they follow escalation rules, identify limitations, and avoid over-reliance.
User instructions are where you lock the system to its intended purpose. Your milestone here is to draft user instructions and operational constraints that a deployer can actually enforce. Write them as if they will be used on a busy day by someone who did not build the model.
Begin with an “Intended Use” block: domain, user group, supported decisions, and where the system may be used (channels, jurisdictions, languages). Then add “Not Intended For” items that prevent foreseeable misuse (e.g., “not for autonomous decisions,” “not for diagnosing medical conditions,” “not for selecting candidates without human review”). Follow with “Required Inputs and Preconditions”: data freshness, minimum fields, acceptable data sources, and steps when data is missing or suspected incorrect.
Make limitations concrete. Instead of “may be inaccurate,” state conditions: “Performance degrades on out-of-distribution documents (scanned images with heavy compression),” or “The system may underperform for minority dialects; escalate when confidence is below X or when the user disputes the result.” List known risks and their mitigations: hallucination risk, proxy discrimination risk, privacy leakage risk, and security risks (prompt injection, data poisoning). Provide operational constraints: maximum allowed automation level, mandatory review for certain categories, and prohibited prompts or data types.
Include step-by-step procedures: (1) verify input data, (2) run inference, (3) review output with supporting evidence, (4) decide/override with rationale, (5) record action, (6) escalate if thresholds met. Add “Intervention Playbooks”: what to do if the system outputs disallowed content, contradicts authoritative sources, or shows drift indicators. The most common mistake is burying these steps in prose. Use short numbered procedures and keep “what to do now” visible.
Transparency is not just a banner that says “AI is used.” It is a set of disclosures tailored to who needs to know what, when they need to know it, and in what format. Your milestone here is to create transparency notices and disclosure artifacts that can be deployed across UI, policies, and customer communications.
Draft three layers of transparency. Layer 1: Point-of-use notice (in the interface): concise disclosure that the user is viewing AI-assisted output, the output’s role (recommendation vs. decision), and a link to more detail. Layer 2: Operator disclosure: what signals the system uses at a high level, key limitations, confidence/uncertainty meaning, and how to challenge or override results. Layer 3: External/user-facing disclosure (when the AI affects individuals): an explanation of the AI’s involvement, contact channels, and how to contest outcomes where applicable.
For AI-generated content, add a “Content Provenance” artifact: how generated text/images are marked, when they are stored, and how downstream recipients are informed. If the system provides assistance (e.g., drafting, triage, summarization), specify boundaries: “assistant output is a draft; operator must verify against source records.” Avoid misleading anthropomorphic language. Use neutral phrasing: “The system generated a summary based on the provided documents.”
Engineering judgment matters in what you disclose: too little increases misuse; too much can expose security-sensitive details. A practical compromise is to disclose categories of features and data sources, not exact weights or exploit-relevant thresholds. Finally, connect transparency to logging: if the UI shows a confidence score or explanation, record what was displayed so later investigations can reconstruct what the operator saw.
Complaints and feedback are not “customer support”; they are part of post-market monitoring and continuous risk control. Build a feedback loop that captures real-world failures, routes them to accountable owners, and produces evidence for corrective actions. This section’s practical outcome is a complaint-handling workflow and a feedback schema that can be implemented with minimal ambiguity.
Define intake channels: in-product “report issue” button, email/ticketing system, and (where relevant) formal channels for affected persons. For each channel, define required fields: system version, timestamp, context, input snapshot (with privacy controls), output snapshot, operator action taken, and user-reported harm type. Classify issues into categories: factual error, bias/fairness concern, unsafe content, privacy/security incident, usability/instructions confusion, and performance drift.
Create triage SLAs based on severity and reversibility. Example: potential unlawful discrimination or safety harm gets immediate escalation to compliance and incident response; minor usability complaints go to product backlog. Link complaints to a corrective action process: reproduce, root-cause (data, model, UI, workflow), mitigate (patch, retrain, threshold change, instruction update), verify, and close with evidence. Record whether an issue indicates the instructions were unclear or whether oversight failed; this is how you improve both documentation and controls.
Common mistakes include collecting feedback without enough context to reproduce, failing to version outputs, and not informing deployers when instructions change. Your handoff pack should include a one-page “How to Report Issues” guide and a decision tree for escalation.
The final milestone is to finalize a deployer handoff pack and to validate usability: can operators follow the instructions? Deployment readiness is where you turn documents into a decision: ship, pilot with constraints, or stop until gaps are closed.
Build a deployment readiness checklist that ties directly to risk classification and controls. At minimum include: oversight measures implemented (not just described), operator training completed with records, access controls configured, transparency notices placed in the right user journeys, logging and audit trails verified, feedback/complaint workflow operational, and incident response contacts confirmed. Add technical gates: model/version pinned, evaluation metrics meet thresholds, drift monitoring configured, data governance controls in place, and rollback plan tested.
Define explicit go/no-go criteria. Examples of no-go: operators cannot correctly follow escalation steps in a usability test; override authority is unclear; key disclaimers are missing at point-of-use; audit logs do not capture operator actions; known high-severity failure mode lacks a mitigation. For conditional go (pilot), list constraints: limited user group, shadow mode, manual approval required, reduced scope, increased monitoring frequency.
Run a usability validation that mirrors reality: give operators realistic cases under time limits, measure completion rates, error types (missed escalation, over-acceptance, misinterpretation of confidence), and time-to-intervention. Update instructions and UI prompts based on observed failures, then re-test. Package the final handoff as a structured bundle: Oversight Control Sheet, User Instructions, Transparency Notices, Training Plan, Access Control Summary, Logging Map, Feedback Workflow, and the signed go/no-go decision with owners and dates.
1. In Chapter 5, why are human oversight, transparency, and user instructions treated as more than “documentation tasks”?
2. Which set of deliverables best matches the evidence-ready deployer handoff pack described in the chapter?
3. Which oversight plan element most directly addresses the chapter’s critique of generic policies like “a human will review outputs”?
4. What is the primary goal of drafting user instructions and operational constraints in this chapter?
5. What does the chapter identify as the second most common failure mode after vague oversight statements?
Shipping an AI system is not the finish line under the EU AI Act—it is the moment your compliance posture starts being tested by real-world behavior, real users, and real failure modes. Post-market monitoring is where your assumptions meet operational reality: data shifts, new user strategies, emergent misuse, changed regulations, patched dependencies, and evolving cyber threats. This chapter turns the “after launch” phase into a disciplined loop you can run every week, and package every quarter, without panic.
Two habits separate teams that pass audits from teams that scramble: (1) measuring the right signals with clear triggers and ownership, and (2) writing evidence narratives that connect those signals to decisions. You will design post-market monitoring KPIs and drift triggers, draft an incident reporting and corrective action workflow, assemble an audit-ready evidence package with an index, run a mock audit to produce an improvement backlog, and end with a practical 90-day compliance maintenance plan. The goal is not to create more paperwork; it is to make sure every meaningful decision leaves a trail that is easy to follow and hard to misunderstand.
Throughout this chapter, treat “audit-ready packaging” as an engineering product: an index, versioning, traceability, and well-defined inputs/outputs. Your future auditors (or internal reviewers) should be able to answer three questions quickly: What is the system supposed to do? How do you know it’s doing that safely and fairly? What did you do when it didn’t?
Practice note for Milestone: Design post-market monitoring KPIs and drift triggers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft an incident reporting and corrective action workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Assemble an audit-ready evidence package with an index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Run a mock audit and produce an improvement backlog: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Create a 90-day compliance maintenance plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Design post-market monitoring KPIs and drift triggers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Draft an incident reporting and corrective action workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Assemble an audit-ready evidence package with an index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone: Run a mock audit and produce an improvement backlog: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start by translating your intended purpose into measurable outcomes. A monitoring plan is not a dashboard of “interesting metrics”; it is a set of KPIs tied to risk controls, plus drift triggers that force action. Build your plan as a table with: metric name, definition, data source, segmentation, threshold, trigger severity, response owner, and evidence location.
Cover five monitoring domains:
For the milestone “design post-market monitoring KPIs and drift triggers,” make triggers operationally specific. Avoid vague triggers like “significant drift.” Prefer: “PSI > 0.25 for two consecutive weeks on top-10 features,” “FNR increases by 20% vs baseline on segment X,” or “unsafe output rate exceeds 0.5% in any single-day window.” Define what happens next: who reviews, what gets frozen, whether you roll back, and when you notify stakeholders.
Common mistakes: monitoring only aggregate metrics (hides segment harms), collecting metrics without response playbooks, and failing to version baselines (you can’t prove drift if you can’t prove what ‘normal’ was). Practical outcome: a monitoring spec that can be handed to engineering and SRE, plus an “Evidence ID” for each metric pipeline and alert rule.
Incident handling must be pre-decided. Under the EU AI Act, you should be prepared to identify and report serious incidents and malfunctioning that may breach obligations. Your job is to define criteria that your team can apply consistently at 2 a.m., and an escalation pathway that does not depend on a single person’s judgment.
Define serious incident criteria using a decision tree that considers: impact severity (harm to health, safety, fundamental rights, or significant economic harm), scope (number of affected persons), persistence (one-off vs recurring), and detectability (silent failure vs obvious). Include “near misses” as a separate category: events that could have caused harm but were caught by controls. Near misses are gold for prevention and auditors often ask for them because they show learning.
For the milestone “draft an incident reporting and corrective action workflow,” specify roles and timelines: L1 triage (support/on-call), L2 technical assessment (ML engineer + product), L3 compliance/legal review, and executive sign-off when needed. Add a single “Incident Coordinator” role who owns the clock and evidence capture. Your workflow should require: preserving logs, freezing relevant versions (model, prompts, data, code), capturing a minimal reproducible example, and documenting user-visible impact.
Common mistakes: over-classifying everything as “serious” (burnout and noise), under-classifying because teams fear blame, and failing to retain the exact artifacts that allow root-cause analysis. Practical outcome: an incident SOP with a clear escalation matrix (who to page, who to inform, who can approve rollback) and an incident record template that links to evidence (logs, runbooks, model card version, monitoring alerts).
CAPA turns incidents and monitoring triggers into durable improvements. Corrective action fixes the current problem; preventive action reduces the chance it happens again. For AI systems, CAPA must cover not just code changes, but data, evaluation, human oversight, user instructions, and deployment controls.
Implement CAPA as a structured record with these fields: problem statement, impact assessment, containment action (immediate mitigation), root cause analysis, corrective action plan, preventive action plan, verification method, effectiveness check date, and closure criteria. Root cause analysis should explicitly consider: data drift, label noise, pipeline bugs, prompt/template regressions, third-party model updates, UI changes that shift user behavior, and security bypasses.
Engineering judgment matters when choosing remedies. Example: if bias gaps appear only after a new user segment arrives, retraining may not be the first step—first validate whether the segment is within intended purpose and whether you have lawful, representative data. Sometimes the correct action is to constrain usage (geo-fencing, eligibility checks) rather than to optimize metrics on out-of-scope data.
Corrective actions should be tied to measurable outcomes: “Reduce unsafe output rate from 0.8% to <0.2% on red-team suite V3,” or “Restore calibration error to within 10% of baseline.” Preventive actions often include new tests in CI/CD, stronger monitoring, better reviewer guidance, or updated user instructions.
Common mistakes: closing CAPA after deploying a patch without verifying effectiveness, and failing to update documentation (model card, risk record, user instructions) to reflect the new control. Practical outcome: a CAPA log that auditors can sample, showing traceability from detection → decision → change → verification.
An audit-ready package is an index plus stories. The index tells reviewers where evidence lives; the narrative explains why that evidence proves compliance for your intended purpose. For the milestone “assemble an audit-ready evidence package with an index,” create an “Evidence Register” with: Evidence ID, title, system version, owner, location (URL/path), confidentiality level, and mapped requirement/control.
Then design your sampling strategy. Auditors rarely review everything; they sample. You should propose your own samples to demonstrate coverage and reduce random digging. Use risk-based sampling: pick the highest-impact decisions, highest-risk segments, most recent changes, and a few routine cases. Include at least: one incident end-to-end, one monitoring alert with resolution, one model update with approval trail, and one user complaint investigation.
Write evidence narratives that connect artifacts across the lifecycle. Example narrative: “Monitoring alert A-142 detected FNR drift in segment ‘new language locale’ → triage confirmed data shift → containment applied via feature flag → CAPA-33 retrained on expanded dataset with new fairness evaluation → effectiveness verified by test suite V5 → user instructions updated with new limitations.” Each arrow should point to an Evidence ID.
For the milestone “run a mock audit and produce an improvement backlog,” simulate an auditor’s behavior: ask for a claim (“bias is monitored”), then force the team to show the exact dashboard, the threshold, the alert history, and the decision logs. Anything you cannot produce within minutes becomes a backlog item. Common mistakes: evidence scattered across tools without stable links, and narratives that describe intentions rather than executed controls. Practical outcome: a curated audit pack with a table of contents, traceability map, and a prioritized backlog of gaps.
Post-market work often fails due to unclear responsibilities across provider, deployer, importer, distributor, and product manufacturer roles. Conformity assessment readiness is as much coordination as it is documentation. Make a RACI (Responsible, Accountable, Consulted, Informed) matrix for ongoing tasks: monitoring operation, incident triage, regulator communications, model updates, and user instruction changes.
Coordinate around three recurring events: (1) release approvals, (2) incident governance, and (3) evidence publication. For release approvals, require a “release note for compliance” that states what changed (model weights, prompts, training data, thresholds), the evaluation delta, and whether intended purpose or limitations changed. For incident governance, ensure deployers know how to report anomalies and what logs they must retain. For evidence publication, decide what is shared externally (to customers/partners) versus internally (full technical detail).
When third parties are involved (hosted foundation models, data providers, integrators), define contractual hooks: notification windows for upstream model changes, security incident cooperation, and access to evaluation artifacts. A common mistake is assuming that “the vendor is compliant” substitutes for your own evidence; in practice, you need vendor documentation mapped to your system’s intended purpose and integration risks.
Practical outcome: a stakeholder coordination plan that reduces surprises during conformity assessment or market surveillance—complete with named owners, recurring meeting cadence, and a shared repository structure aligned to your evidence index.
Continuous compliance is a living process: you will update models, fix vulnerabilities, retrain on new data, and sometimes retire a system. Treat each of these as a controlled change with pre-defined gates, not an ad hoc “ML refresh.” End this chapter by creating a 90-day compliance maintenance plan that includes weekly, monthly, and quarterly activities with owners and outputs.
Weekly: review monitoring alerts, review top user complaints, confirm security patches and dependency scans, and check that logging coverage meets your retention policy. Monthly: run drift reports, fairness slice reviews, red-team regressions for known misuse patterns, and a CAPA effectiveness check for recently closed actions. Quarterly: refresh risk classification assumptions, re-validate intended purpose boundaries, sample evidence narratives for audit readiness, and perform a mock audit mini-sprint to generate an improvement backlog.
For updates and retraining, define change categories (minor/major) based on impact to intended purpose and risk controls. Major changes should trigger expanded evaluation, updated user instructions, and an evidence bundle that can be shown as “before/after.” Always preserve the ability to roll back to a known-good version and document rollback criteria.
For decommissioning, include: notifying deployers/users, disabling endpoints, archiving evidence, retaining logs per policy, and documenting final known limitations and any open CAPAs. Common mistakes: retraining without updating baselines (breaks monitoring comparability) and failing to update user instructions when system behavior shifts. Practical outcome: a calendarized maintenance plan plus a change-control checklist that keeps your system compliant as it evolves.
1. In Chapter 6, what does “after launch” primarily represent for EU AI Act compliance?
2. Which pair of habits does Chapter 6 say separates teams that pass audits from teams that scramble?
3. Why does Chapter 6 emphasize designing KPIs and drift triggers for post-market monitoring?
4. How should “audit-ready packaging” be treated according to Chapter 6?
5. What are the three questions your evidence package should help auditors (or internal reviewers) answer quickly?