HELP

+40 722 606 166

messenger@eduailast.com

Designer to AI UX Prototyper: Copilot Interfaces That Test Well

Career Transitions Into AI — Intermediate

Designer to AI UX Prototyper: Copilot Interfaces That Test Well

Designer to AI UX Prototyper: Copilot Interfaces That Test Well

Go from screens to shipped copilot prototypes validated on real tasks.

Intermediate ai-ux · copilot-design · prototyping · user-testing

Transition from traditional UX to AI copilot prototyping

This course is a short, technical, book-style path for designers who want to move into AI product work by building something concrete: a copilot interface that supports real user tasks, handles uncertainty, and improves through testing. Instead of treating AI as a “chat box you add later,” you’ll learn how to design the workflow, the conversation, the UI states, and the guardrails as one coherent experience.

You’ll work from a task-first foundation—what the user is trying to get done, under what constraints, and what “good” looks like—then progressively shape that into a prototype you can evaluate with realistic scripts and measurable outcomes. The goal is not only a polished flow, but a defensible design: a copilot that explains itself appropriately, recovers gracefully when it can’t proceed, and earns trust without overpromising.

What you’ll build

By the end, you’ll have an end-to-end copilot prototype and a lightweight design spec that makes implementation discussions easier. Your deliverables are intentionally aligned with how AI teams work: product requirements + interaction design + prompt/tool specifications + test evidence.

  • A scoped copilot use case tied to a real workflow (not generic chat)
  • Intent and task decomposition with edge-case coverage
  • Copilot UI patterns and interaction states (loading, tool use, uncertainty, refusal, recovery)
  • A prompt and behavior contract (tone, structure, constraints, and test suite)
  • A task-based usability test plan and results summary
  • A portfolio-ready case study narrative with iteration rationale

How the book-style chapters progress

Chapter 1 sets the foundation: what makes copilot UX different, what boundaries matter (model, tools, data), and how to define success beyond “it responded.” Chapter 2 converts a messy workflow into testable tasks and a conversation flow that supports human control and safe handoffs. Chapter 3 turns those flows into interface patterns and stateful interaction design—because AI UX is as much about states and recovery as it is about the happy path.

Chapter 4 helps you prototype behavior with prompt architecture and tool specs you can hand to engineers. You’ll learn to define output contracts (structure, length, formatting), design for groundedness and verification, and build a regression prompt suite so improvements don’t break earlier wins. Chapter 5 focuses on assembling an end-to-end prototype without full engineering, including response libraries and instrumentation so you can evaluate the experience systematically. Chapter 6 closes the loop with task-based testing, practical AI UX metrics, and iteration decisions—then shows you how to package the work as a strong career-transition artifact.

Who this is for

This course is designed for UX/UI and product designers who can already create flows and wireframes, and now need the AI-specific layer: designing for uncertainty, tool use, trust, and evaluation. It also fits design-minded product folks who need a repeatable method for validating copilot interactions with real tasks.

Get started

If you want a clear pathway from “I design interfaces” to “I prototype and validate AI copilots,” this course gives you a practical structure and portfolio outcomes. Register free to begin, or browse all courses to compare related paths.

What You Will Learn

  • Translate real user tasks into copilot workflows, intents, and success metrics
  • Design copilot UI patterns: chat, side panel, inline assist, and agentic task runners
  • Create prompt and tool specs that designers can hand to engineers
  • Prototype AI interactions with states for uncertainty, refusal, and recovery
  • Plan and run lightweight usability tests for AI experiences using task scripts
  • Evaluate AI UX quality using groundedness, usefulness, safety, and trust signals
  • Document AI UX decisions with a PRD-style spec, edge cases, and acceptance criteria
  • Produce a portfolio-ready AI copilot case study with test evidence

Requirements

  • Basic UX/UI design knowledge (wireframes, flows, usability basics)
  • Access to a prototyping tool (e.g., Figma, FigJam, or equivalent)
  • Comfort writing short product requirements and test tasks
  • Optional: access to an LLM tool (ChatGPT/Claude/etc.) for experimentation

Chapter 1: From UX Designer to AI UX Prototyper

  • Define the copilot problem space and your target workflow
  • Map the AI system boundaries: model, tools, data, and UI
  • Choose the right interaction mode (chat, inline, panel, agent)
  • Set measurable outcomes: task success, time, trust, and safety

Chapter 2: Task Decomposition and Conversation Flow Design

  • Turn a workflow into intents, slots, and decision points
  • Write task scripts and success criteria for realistic testing
  • Design the copilot conversation: turns, confirmations, and handoffs
  • Create an error and recovery map for ambiguous or missing inputs

Chapter 3: Copilot UI Patterns and Interaction States

  • Design the core UI components: input, context, and output
  • Add transparency: sources, rationale, and confidence cues
  • Model system states: loading, tool use, partial results, failures
  • Create interaction guidelines for editing, undo, and versioning

Chapter 4: Prompting and Tool Specs for Prototype-Ready Behavior

  • Write a system prompt and style guide that matches the UX
  • Define tool calls and data needs as a designer-friendly spec
  • Design guardrails: constraints, refusals, and escalation paths
  • Create a test prompt suite that covers core tasks and edge cases

Chapter 5: Prototype the Copilot End-to-End (Without Full Engineering)

  • Build a clickable copilot flow with realistic states and copy
  • Simulate AI responses consistently using a response library
  • Instrument the prototype for evaluation (events and annotations)
  • Package the prototype with a spec for stakeholders and engineers

Chapter 6: Test With Real Tasks, Measure Quality, Iterate to “Pro”

  • Run moderated or unmoderated task-based tests on the prototype
  • Score results using AI UX metrics and a rubric
  • Prioritize fixes: UX, prompt, tool, or policy changes
  • Publish a portfolio case study with evidence and learnings

Sofia Chen

AI Product Designer & UX Research Lead (Conversational + Copilot UX)

Sofia Chen designs AI-assisted workflows and copilot interfaces for B2B SaaS, bridging UX research, prototyping, and LLM implementation constraints. She has led cross-functional teams to ship task-based AI experiences with measurable quality and safety outcomes.

Chapter 1: From UX Designer to AI UX Prototyper

Moving from “UX designer” to “AI UX prototyper” is less about learning a new visual style and more about learning to design a system with variable behavior. In a traditional UI, you can usually predict what happens when a user presses a button. In a copilot interface, the user’s input may be ambiguous, the model may be uncertain, a tool may fail, or the system may need more context. Your job becomes: translate real tasks into a copilot workflow, define the AI system boundaries (model, tools, data, and UI), choose the right interaction mode (chat, inline, panel, agent), and measure success beyond completion—time, trust, and safety matter.

This chapter helps you establish the “problem space” of copilot experiences: what they are good for, what they are risky for, and how to scope them. You’ll learn to frame work as user intents and steps, to recognize the gap between model capability and user experience, and to set measurable outcomes that can be tested with lightweight usability scripts. Most importantly, you’ll start thinking in artifacts that engineers can build: prompt and tool specs, state diagrams, and interaction prototypes that include uncertainty, refusal, and recovery.

Practice note for Define the copilot problem space and your target workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map the AI system boundaries: model, tools, data, and UI: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose the right interaction mode (chat, inline, panel, agent): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set measurable outcomes: task success, time, trust, and safety: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the copilot problem space and your target workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map the AI system boundaries: model, tools, data, and UI: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose the right interaction mode (chat, inline, panel, agent): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set measurable outcomes: task success, time, trust, and safety: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the copilot problem space and your target workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What makes copilot UX different from traditional UX

Copilot UX is “co-work” UX: the system collaborates with the user to achieve a goal, often by interpreting intent, generating drafts, and taking actions through tools. This shifts your design focus from screens and flows to workflows and handoffs. In a classic form-fill app, the UI is the primary executor. In a copilot, the UI is an orchestration layer between the user, the model, and the tools that do real work (search, CRM updates, scheduling, code execution, document creation).

Three differences matter immediately. First, the system is probabilistic: two identical prompts may produce different outputs, and even a good output can contain errors. Second, the system is contextual: relevant information may live in data sources, user history, or documents, and your interface must communicate what context is being used. Third, the system is stateful in new ways: “I need more info,” “I’m not allowed,” “The tool failed,” and “Here are my assumptions” are first-class states you must design.

Common mistake: treating copilot as a chatbox bolted onto an existing product. Chat is only one interaction mode; sometimes it’s the worst one for speed and accuracy. Another mistake is designing for “best-case generation” while ignoring recovery paths. A real copilot experience needs explicit affordances for editing, confirming, undoing, and citing sources. You are not only designing what the system can do, but also how it should behave when it can’t.

  • Design objective shift: from pixel-perfect screens to dependable workflows.
  • Core UX risk: ungrounded or unsafe output that looks confident.
  • Key competency: define boundaries and failure states as deliberately as success states.

As you step into AI prototyping, you’ll constantly ask: what part of this task should the system automate, what part should remain user-driven, and how do we make the boundary visible so users stay in control?

Section 1.2: Task-first framing: jobs, constraints, and environments

Start with tasks, not features. A copilot should exist because it improves a real workflow: reduces time, increases quality, or lowers cognitive load. Task-first framing means translating “users want AI help” into a concrete job with constraints and an environment. For example: “Support agents triage inbound tickets and draft replies under time pressure, using a knowledge base and strict policy.” That single sentence already implies tools (ticket system, KB search), boundaries (policy compliance), and interaction mode (inline drafting with citations).

To define the copilot problem space, map a workflow at the level of intent and checkpoints: (1) user goal, (2) inputs available, (3) decision points, (4) outputs required, (5) review/approval, and (6) downstream impact if wrong. Then identify which steps are best suited for AI assistance: summarization, retrieval, drafting, classification, transformation, and planning. Avoid the trap of asking the model to “decide” when the user really needs it to “prepare.”

Constraints and environments matter because they determine acceptable behavior. Is the work regulated? Are there privacy constraints? Is the user multitasking? Are they on mobile? Is latency acceptable? A copilot that takes 20 seconds might be fine for weekly reporting, disastrous for live customer chats. Your task framing should explicitly include time budget, error tolerance, and collaboration model (solo work vs. team review).

Practical outcome: write a short “task card” you can hand to engineers and testers: target user, primary job, context sources, required output format, and non-negotiable constraints (security, policy, tone). This card becomes the anchor for selecting interaction mode and defining measurable success metrics later in the chapter.

Section 1.3: Capability vs. experience: model limits you must design for

Designers new to AI often confuse model capability with product experience. A model can generate fluent text; that does not mean it can reliably solve the user’s problem in context. Your job is to design an experience that makes limitations visible and survivable. Practically, this means mapping system boundaries: what the model knows by itself, what it can fetch via tools, what data it is permitted to access, and what the UI must communicate at each step.

Think of four layers: model (reasoning and generation), tools (APIs that act or retrieve), data (documents, databases, user context), and UI (interaction and feedback). If you skip this map, you’ll prototype fantasies: “the assistant will check the policy” without giving it retrieval; “it will update the CRM” without a tool; “it will reference last quarter” without data access.

Key model limits that affect UX: hallucination (invented facts), brittleness with ambiguous prompts, sensitivity to missing context, and uneven performance across edge cases. You design for these by adding groundedness patterns: show citations, show which sources were used, require confirmation before actions, and provide structured input chips (e.g., select account, date range, tone) to reduce ambiguity. You also need refusal and escalation patterns: when the model shouldn’t answer (policy, safety), it should explain why and offer safe alternatives.

Engineering judgment shows up in scoping: don’t design a single “do everything” prompt. Break work into smaller intents, each with clear tool access and output schema. A helpful artifact here is an “intent sheet” listing: intent name, user trigger examples, required context, tool calls, and the UI state sequence (draft → review → apply). This is how you turn capability into a testable, buildable experience.

Section 1.4: Trust calibration: when to be confident vs. cautious

In AI UX, trust is not a brand feeling; it’s a calibrated relationship between user expectations and system reliability. Over-trust leads to unreviewed mistakes. Under-trust leads to abandonment. Your design must guide users toward the right level of confidence for each output and action. This is why “trust signals” belong alongside task success metrics.

Start by classifying outputs into risk tiers. Low risk: drafting an internal email, summarizing meeting notes. Medium risk: recommending next steps, generating customer-facing replies. High risk: financial decisions, medical advice, irreversible actions in systems. For each tier, define UI behaviors: low risk may allow one-click insert; medium risk requires review and editable suggestions; high risk requires explicit confirmation, highlighted assumptions, and sometimes human approval workflows.

Trust calibration techniques you can prototype quickly include: showing source citations and timestamps; labeling uncertainty (“I’m not sure—here’s what I found”); exposing assumptions (“Assuming the customer is on Plan Pro”); offering verification actions (“Open source,” “Check policy,” “Run test”); and providing safe recovery (“Undo update,” “Revert draft,” “Create version”). Avoid confidence theater—phrases like “I’m certain” without evidence. Instead, ground confidence in observable signals: retrieved sources, tool results, and consistent formatting.

Set measurable outcomes for trust and safety. Examples: the rate of users who verify before sending, the number of unsafe requests correctly refused, the percentage of actions that users undo, and user-reported confidence after task completion. Time-on-task alone can be misleading: a faster workflow that increases error rates is not success. Your usability tests should include at least one “tricky” scenario to evaluate whether the design nudges the user toward appropriate caution.

Section 1.5: The AI UX prototyper toolkit: artifacts and workflow

An AI UX prototyper produces artifacts that connect design intent to engineering implementation. Your toolkit is not only Figma screens; it includes specs that describe prompts, tools, and states. A practical workflow is: (1) define task card, (2) map boundaries (model/tools/data/UI), (3) choose interaction mode, (4) draft intents and success criteria, (5) prototype multi-state interactions, and (6) test with task scripts.

Core artifacts to create:

  • Workflow map: user steps and decision points, including where AI assists and where the user must confirm.
  • System boundary diagram: what the model can access, what tools exist, what data sources are allowed, and what is out of scope.
  • Intent + tool spec: triggers, required inputs, tool calls, output schema, and constraints (tone, policy, formatting).
  • Prompt spec: system instructions, few-shot examples, structured variables, and refusal rules. Written so an engineer can implement.
  • State model: happy path plus uncertainty, tool failure, missing info, refusal, and recovery (edit, retry, escalate).
  • Test plan: lightweight scripts with success metrics (task success, time, trust signals, safety outcomes).

Choosing interaction mode is part of the toolkit, not an afterthought. Chat is good for exploration and ambiguous intent. Inline assist is best for editing and drafting where the user’s cursor already is. Side panels work well for reference, step-by-step guidance, and keeping context visible. Agentic task runners make sense when the system must execute multiple steps across tools, but they demand stronger confirmations, logs, and undo paths.

Common mistake: prototyping only the “answer.” Instead, prototype the whole loop: ask clarifying questions, show retrieved sources, present an editable draft, request confirmation before actions, and provide recovery. When your prototype includes these states, your usability tests will reveal whether users feel in control, not just impressed.

Section 1.6: Selecting a portfolio project and defining scope

Your first portfolio project should be small enough to finish, but real enough to demonstrate judgment. A strong AI UX prototyping project is defined by a single workflow, a clear user role, and a bounded set of tools and data. Instead of “AI for project management,” pick “Copilot that turns meeting notes into Jira-ready tickets with acceptance criteria and links to decisions.” This naturally produces artifacts: intents (summarize, extract decisions, draft tickets), tool boundaries (ticket creation API, document retrieval), and measurable outcomes.

Define scope by limiting: (1) number of intents (start with 2–4), (2) number of tools (1–2), (3) data sources (one corpus), and (4) risk tier (avoid high-stakes domains unless you can design rigorous safety). Write your “definition of done” as outcomes: target task success rate, acceptable time range, and explicit trust/safety criteria (e.g., must cite sources for factual claims; must refuse requests that violate policy; must provide undo for tool actions).

Select an interaction mode that matches the workflow. If the user edits text, build inline assist. If they need step-by-step completion and evidence, use a panel. If the assistant must execute a sequence (search → fill form → submit), prototype an agentic runner with a visible plan, progress updates, checkpoints, and an activity log. Include at least one designed failure case in your portfolio: a missing document, a conflicting source, or a restricted request—then show how your UI guides recovery.

Finally, make your portfolio readable to both design and engineering audiences. Present the task card, boundary map, intent/tool specs, and a prototype walkthrough that highlights uncertainty, refusal, and recovery states. When you can clearly explain what the system will do, what it will not do, and how you’ll measure success, you’re already operating as an AI UX prototyper—not just a designer experimenting with prompts.

Chapter milestones
  • Define the copilot problem space and your target workflow
  • Map the AI system boundaries: model, tools, data, and UI
  • Choose the right interaction mode (chat, inline, panel, agent)
  • Set measurable outcomes: task success, time, trust, and safety
Chapter quiz

1. According to Chapter 1, what most distinguishes moving from “UX designer” to “AI UX prototyper”?

Show answer
Correct answer: Designing a system with variable behavior and uncertainty rather than only predictable button-click outcomes
The chapter emphasizes that AI UX prototyping is about designing for variable behavior (ambiguity, uncertainty, failures, missing context), not adopting a new visual style.

2. Which set best represents the AI system boundaries you need to map when designing a copilot experience?

Show answer
Correct answer: Model, tools, data, and UI
Chapter 1 explicitly calls out defining boundaries across the model, tools, data, and UI.

3. A key early step in scoping a copilot experience is translating real work into what?

Show answer
Correct answer: A copilot workflow expressed as user intents and steps
The chapter focuses on framing work as user intents and steps to turn real tasks into a copilot workflow.

4. Which is the best reason Chapter 1 gives for choosing the right interaction mode (chat, inline, panel, agent)?

Show answer
Correct answer: Different modes fit different workflows and levels of autonomy the system should take
The chapter presents interaction modes as a design choice tied to the target workflow and how the copilot should operate within it.

5. Which measurement approach best matches Chapter 1’s definition of success for a copilot interface?

Show answer
Correct answer: Measure task success, time, trust, and safety—not just whether the task was completed
Chapter 1 stresses measurable outcomes beyond completion, including time, trust, and safety.

Chapter 2: Task Decomposition and Conversation Flow Design

Designing a copilot that “tests well” starts long before you write a prompt. It starts with a disciplined decomposition of real user work into smaller decisions, inputs, and handoffs that an AI can reliably support. In traditional UX, you could hide complexity behind screens and progressive disclosure. In AI UX, the system must also manage uncertainty: it may not know what the user means, may not have enough information, or may need to refuse. This chapter gives you a practical method to translate workflows into intents, slots, decision points, conversation turns, and recovery behaviors—and to turn those into testable task scripts with measurable success criteria.

Think of task decomposition as your “AI contract.” It defines what the copilot is trying to do, what it needs from the user, what tools or data sources it can use, and how it communicates confidence and risk. Conversation flow design then becomes the user-facing expression of that contract: the turns, confirmations, summaries, and approval moments that keep the user in control. If you do this well, you can hand engineers clear prompt and tool specs, prototype realistic states (including refusal and recovery), and run lightweight usability tests that reveal true product risks rather than superficial wording preferences.

Throughout this chapter, you’ll build four durable artifacts: (1) a workflow breakdown with variability and dependencies, (2) an intent model with slots and decision points, (3) an information architecture for prompts and outputs, and (4) an error/recovery map. These artifacts make your prototypes more than “happy path demos”—they make them evaluable systems with success metrics.

  • Outcome focus: users complete tasks with fewer steps, fewer errors, and clearer trust signals.
  • Engineering focus: prompts, tools, and states are specified in a way that can be implemented and logged.
  • Testing focus: you can write task scripts and success criteria that simulate real-world variability.

The rest of this chapter breaks the work into six practical sections, each aligned to decisions you must make as an AI UX prototyper.

Practice note for Turn a workflow into intents, slots, and decision points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write task scripts and success criteria for realistic testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the copilot conversation: turns, confirmations, and handoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create an error and recovery map for ambiguous or missing inputs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Turn a workflow into intents, slots, and decision points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write task scripts and success criteria for realistic testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the copilot conversation: turns, confirmations, and handoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Workflow breakdown: steps, variability, and dependencies

Section 2.1: Workflow breakdown: steps, variability, and dependencies

Start with a real workflow, not a feature idea. Pick a task users already perform (for example: “create a customer-facing project update,” “triage inbound support tickets,” or “draft a research plan”). Write the workflow as numbered steps in the user’s current process, then annotate each step with three things: variability, dependencies, and decision points. Variability is where users do it differently (novices vs. experts, regional rules, different data availability). Dependencies are inputs or systems required (CRM data, calendar access, policy docs). Decision points are moments where the user chooses a path (tone, target audience, whether to include risk, what timeframe).

A practical technique is the “3-layer breakdown.” Layer 1 is the user goal (“send a project update”). Layer 2 is subgoals (“gather status,” “summarize risks,” “choose audience,” “format message”). Layer 3 lists required fields and checks (“dates,” “milestones,” “blocked items,” “approval required”). This reveals what the copilot can do autonomously versus what must be asked or verified. It also surfaces where the user may want an inline assist (tight context, small help) versus a side-panel copilot (multi-step composition) versus an agentic task runner (multi-step execution with tool calls).

Common mistake: writing workflows as linear. In practice, workflows branch. Capture branches explicitly: “If data missing, ask user” or “If policy conflict, escalate.” This becomes the backbone of your later error and recovery map. Another mistake is ignoring time: some tasks are synchronous (draft now), others are asynchronous (monitor and notify). If the copilot needs to “wait,” design for status updates and resumable state.

Practical outcome: a workflow table you can test against. For each step, add a “measurable success signal” (e.g., message sent, draft approved, ticket correctly routed) and “failure modes” (wrong audience, hallucinated numbers, missing compliance language). These are the seeds for your usability task scripts and your metrics later.

Section 2.2: Intent modeling for copilots (beyond chatbot intents)

Section 2.2: Intent modeling for copilots (beyond chatbot intents)

Classic chatbot intent models (“reset password,” “check order status”) are too shallow for copilots. Copilots often support complex, multi-turn tasks with partial information, tool usage, and iterative refinement. Model intents at three levels: primary intent (what the user ultimately wants), supporting intents (subtasks like summarize, rewrite, compare, extract), and control intents (undo, regenerate, cite sources, change tone, stop). This captures how real users steer AI: they don’t just “ask”; they revise, constrain, and correct.

For each primary intent, define slots (required and optional). A slot is a piece of structured information the copilot needs: audience, format, timeframe, constraints, source documents, and tool permissions. Then define decision points: when the copilot can proceed, when it must clarify, and when it should confirm. For example, if the user says “Draft an update for the client,” audience and timeframe may be ambiguous; clarification is needed. But if the user provides a named client and a date range, the copilot can proceed to drafting and later confirm the final send action.

Include “tool intents” explicitly: “fetch project status,” “search policy,” “create ticket,” “schedule meeting.” These are not user intents; they are system actions the copilot may propose or execute. Your model should state which tools can be used, under what conditions, and what evidence must be shown (citations, retrieved snippets, or preview of changes). This is where engineering judgment matters: if an action is reversible, you can allow more automation; if it’s irreversible or risky, require explicit approval.

Common mistakes: treating every missing slot as a question (creates interrogation), and overloading a single “draft” intent with too many behaviors (hard to test, hard to implement). A better pattern is progressive slot filling: ask only what blocks progress, infer safe defaults, and surface assumptions in a summary so the user can correct them. Practical outcome: an intent catalog that maps directly to prompt templates, tool specs, and test cases.

Section 2.3: Information architecture for prompts and outputs

Section 2.3: Information architecture for prompts and outputs

Once you know intents and slots, design an information architecture (IA) that keeps prompts, context, and outputs stable. A copilot’s prompt is not one string—it is a structured bundle: system instructions (policy and style), developer instructions (workflow rules), user request (goal and constraints), retrieved context (documents and tool results), and memory (user preferences). Your job is to specify what belongs where so engineers don’t accidentally mix volatile user text into authoritative instructions.

Create a “prompt and tool spec” that a designer can hand to engineering. It should include: the intent name, required slots, default assumptions, allowed tools, output schema, and grounding requirements. Output schemas are critical for testability. Define a predictable shape: headings, bullet lists, fields, citations, and warnings. For example, an update draft might require sections for “Summary,” “Progress,” “Risks,” “Asks,” and a “Numbers” table sourced from tools. If the model cannot ground a number, the schema should allow “Unknown” with a follow-up question instead of guessing.

Design outputs for different UI patterns. In chat, the output should be scannable and include clear next actions (“Approve,” “Edit,” “Ask to clarify”). In a side panel, you can use structured cards with citations and toggles for tone or length. Inline assist outputs should be small and context-local (a sentence rewrite, a quick suggestion). Agentic task runners need status and logs: what steps are planned, which tools were called, what changed, and how to undo.

Common mistakes: allowing free-form outputs that vary each run (hard to compare in tests), and burying assumptions inside paragraphs (users miss them). Practical outcome: a simple IA diagram or table that links each intent to (a) prompt components, (b) tool calls, and (c) a stable output structure. This makes usability tests fair: you can evaluate usefulness and trust signals without the noise of inconsistent formatting.

Section 2.4: Conversation patterns: clarify, confirm, and summarize

Section 2.4: Conversation patterns: clarify, confirm, and summarize

A strong copilot conversation is a sequence of deliberate patterns, not improvisation. Use three core moves: clarify (reduce ambiguity), confirm (prevent harmful errors), and summarize (keep shared state visible). Clarify questions should be minimal and targeted. Instead of “Can you provide more details?” ask “Who is the audience: internal team or client?” and “What timeframe should the update cover?” Make questions multiple-choice when possible to reduce cognitive load and to improve downstream slot filling.

Confirmations should be risk-based, not constant. Confirm when the action is irreversible, compliance-sensitive, or likely to embarrass the user. For low-risk drafting, proceed with assumptions and present a summary: “I assumed audience=client, tone=professional, timeframe=last 2 weeks. Change any of these?” For sending an email or modifying records, require explicit approval and show a preview. This is where your UI patterns matter: approvals often work best as buttons (Approve/Cancel) rather than conversational “yes/no,” which can be misread in context.

Summaries are the backbone of recovery. After a few turns, restate the current plan, known slots, and open questions. This reduces the “lost in chat” problem and makes the conversation testable: observers can see whether the copilot tracked constraints correctly. Summaries should also include provenance when relevant: “Numbers pulled from Jira as of 10:32 AM,” or “Policy excerpt from HR Handbook v3.2.” That supports groundedness and trust.

Integrate handoffs intentionally. If the user asks for something the copilot can’t do safely, the conversation should guide them to a human or a standard workflow: “I can draft the message, but you’ll need a manager to approve sending.” Avoid dead ends like “I can’t do that.” In prototypes, model these patterns with clear states: clarifying question state, draft state, confirmation state, and handoff state. Practical outcome: a turn-by-turn conversation flow you can simulate in usability tests with realistic user responses and interruptions.

Section 2.5: Human-in-the-loop: approvals, edits, and overrides

Section 2.5: Human-in-the-loop: approvals, edits, and overrides

Human-in-the-loop (HITL) is not a checkbox; it’s a design system for control. Decide where humans approve, where they edit, and where they can override decisions. A helpful rule is to classify copilot actions into: suggest (low risk), prepare (medium risk), and execute (high risk). “Suggest” actions can be applied inline (rewrite a paragraph). “Prepare” actions create a draft artifact (ticket, email, plan) and require review. “Execute” actions change external state (send email, close ticket, update CRM) and should require explicit approval, often with a confirmation UI and an audit trail.

Design editing as a first-class loop. Users should be able to say “keep everything but shorten,” “replace this section,” or “use a friendlier tone,” and the system should preserve constraints and references. In UI terms, provide affordances for inline edits, version history, and “regenerate this section” rather than regenerating everything. Overrides must be respected: if a user changes a number, the copilot should ask whether to update the source-of-truth or treat it as a manual adjustment, and it should flag conflicts rather than silently adopting incorrect data.

Approvals should capture intent, not just clicks. If a user approves a draft, log what was approved (content hash or version ID), the assumptions, and the data sources used. This supports later debugging and trust. If policy requires it, route approval to another role (manager, legal) and show status. Even in prototypes, represent these states: “Pending approval,” “Approved,” “Rejected with comment,” and “Rework.”

Common mistakes: building a copilot that either asks permission for everything (slow, annoying) or executes too much (unsafe). Practical outcome: a HITL map that pairs each tool action with its required approval level, UI control, and rollback strategy. This map directly improves test scripts: you can measure whether users notice and understand approvals, and whether they feel in control.

Section 2.6: Edge cases: ambiguous goals, conflicting constraints, and refusal

Section 2.6: Edge cases: ambiguous goals, conflicting constraints, and refusal

Edge cases are not rare in AI experiences; they are the product. Your prototypes must include an error and recovery map that covers ambiguous goals, missing inputs, conflicting constraints, and refusal. Start by listing the top ambiguity types: unclear audience, unknown timeframe, missing source documents, undefined success standard (“make it better”), or vague actions (“handle this”). For each, specify the recovery behavior: ask a clarifying question, propose safe defaults, or offer examples the user can pick from.

Conflicting constraints are especially common: “Make it short but include all details,” “Be casual but legally compliant,” or “Use only approved sources but also include competitor info.” The copilot should detect conflicts, explain the tradeoff, and ask the user to prioritize. This is a key trust signal: users prefer a system that surfaces tension over one that pretends everything is possible. In flow terms, add a “constraint conflict” decision point that routes to a negotiation turn and then updates the slot values.

Refusal must be designed, not left to model defaults. Define refusal categories: disallowed content, unsafe actions, privacy violations, and tool permission gaps. A good refusal includes (1) a brief reason, (2) what it can do instead, and (3) how to proceed safely. Example: if asked to generate sensitive personal data, refuse and offer to draft a general template without personal details. If the issue is missing permissions, offer a handoff or an instruction to request access.

Finally, convert edge cases into testable task scripts and success criteria. A task script should include prompts that intentionally trigger ambiguity or conflict, plus expected copilot behaviors (clarify, summarize assumptions, refuse with alternatives). Success criteria should be observable: Did the copilot ask the right question? Did it avoid guessing? Did it provide a recovery path? This is how you evaluate groundedness, usefulness, safety, and trust—not by debating wording, but by verifying behavior under pressure. Practical outcome: an edge-case matrix that ensures your copilot can fail gracefully and still help users complete their work.

Chapter milestones
  • Turn a workflow into intents, slots, and decision points
  • Write task scripts and success criteria for realistic testing
  • Design the copilot conversation: turns, confirmations, and handoffs
  • Create an error and recovery map for ambiguous or missing inputs
Chapter quiz

1. According to the chapter, what is the main reason task decomposition must happen before writing prompts?

Show answer
Correct answer: It defines a disciplined “AI contract” for what the copilot will do, what it needs, and how it manages uncertainty
The chapter frames decomposition as an “AI contract” that specifies intents, required inputs, tools, and confidence/risk communication before any prompting.

2. How does AI UX differ from traditional UX in handling complexity, as described in the chapter?

Show answer
Correct answer: AI UX must actively manage uncertainty (ambiguity, missing info, refusals) rather than relying only on progressive disclosure
The chapter contrasts hiding complexity in traditional UX with AI UX needing explicit behaviors for uncertainty, including not knowing, lacking info, or refusing.

3. Which set of elements best represents the chapter’s method for translating workflows into a testable copilot design?

Show answer
Correct answer: Intents, slots, decision points, conversation turns, and recovery behaviors
The chapter’s method centers on intents/slots/decision points plus conversation flow and recovery behaviors to make the system evaluable.

4. In the chapter, conversation flow design is best described as:

Show answer
Correct answer: The user-facing expression of the task decomposition contract (turns, confirmations, summaries, approvals, handoffs)
Conversation flow design expresses the contract through turns, confirmations, summaries, and approval moments that keep users in control.

5. Which combination best reflects the three focus areas the chapter ties to the artifacts you produce?

Show answer
Correct answer: Outcome focus, engineering focus, and testing focus
The chapter explicitly lists outcome (fewer steps/errors, clearer trust), engineering (implementable/loggable specs), and testing (task scripts and success criteria).

Chapter 3: Copilot UI Patterns and Interaction States

In Chapter 2 you learned to translate tasks into intents and success metrics. This chapter turns those intents into interfaces that people can actually operate under real constraints: partial information, interruptions, tool delays, and mistakes. “Copilot UI” is not a single screen—it is a system of components and states that keeps the user oriented while the model reasons, fetches data, and sometimes refuses.

A practical copilot design starts with three core UI components: an input surface (how users ask and correct), context controls (what the AI is allowed to use), and an output surface (how suggestions become work). The hard part is not drawing these elements; it is deciding what is editable, what is attributable, what is reversible, and what happens when the model is uncertain or wrong. Those decisions are “engineering judgment” because they shape latency, data access, logging, and failure recovery.

As you prototype, plan for explicit system states, not just “idle” and “answered.” You will need loading states, tool-use states, partial results, failures, and recovery paths. You will also need interaction guidelines for editing, undo, and versioning, because AI output is rarely “final”—it’s a proposal that must be shaped into something safe and correct.

This chapter introduces common copilot UI archetypes and the interaction patterns that make them test well in lightweight usability sessions. The goal is a repeatable pattern library you can hand to engineering: what the user sees, what the model sees, what tools are called, how uncertainty is expressed, and how the user remains in control.

Practice note for Design the core UI components: input, context, and output: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add transparency: sources, rationale, and confidence cues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model system states: loading, tool use, partial results, failures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create interaction guidelines for editing, undo, and versioning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the core UI components: input, context, and output: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add transparency: sources, rationale, and confidence cues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model system states: loading, tool use, partial results, failures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create interaction guidelines for editing, undo, and versioning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Copilot UI archetypes: side panel, inline assist, overlay

Most copilots fall into three archetypes. Choosing the right one is less about aesthetics and more about cognitive load, context availability, and how reversible the AI’s work is.

Side panel copilots keep the user’s primary workspace intact (a document, CRM record, design file) while providing a persistent assistant beside it. This is ideal when the copilot needs ongoing context and the user will iterate: “summarize,” “draft,” “compare,” “rewrite.” Side panels also support multi-step workflows (plan → draft → refine → apply). Engineering judgment: side panels often require stable context syncing (selection, current record, cursor position) and careful permission boundaries for what the model can read.

Inline assist lives directly in the content: a ghost text suggestion, a rewrite button on a selected paragraph, or an autocomplete chip. Inline is best when the user’s intent is local and the output is small and easily reversible. The interaction should be quick: accept, reject, edit. Common mistake: inline assist that triggers long tool calls without indicating state; users think the editor froze. Make loading explicit near the insertion point and provide a cancel action.

Overlay copilots (modal dialogs, command palettes, full-screen “assistant mode”) are useful when the task is disruptive or high risk: bulk changes, account actions, data exports, sensitive content generation. Overlays can enforce confirmations, show warnings, and collect required parameters before running tools. Common mistake: using overlays for routine micro-edits; it breaks flow and increases abandonment.

  • Practical rule: If output is applied to a specific selection, prefer inline. If output is iterative across the whole artifact, prefer side panel. If output triggers irreversible or privileged actions, prefer overlay.
  • State checklist: idle → composing prompt → “thinking” → tool in progress → partial result → final result → apply/undo → error/retry.

In prototypes, label these states visibly. You are not just testing “do users like it?”—you are testing whether users can recover when the AI stalls, misreads context, or proposes something wrong.

Section 3.2: Output formats: suggestions, drafts, plans, and actions

Copilot output should match the user’s risk tolerance and the maturity of available context. A reliable pattern is to progress from low-commitment to high-commitment outputs: suggestionsdraftsplansactions.

Suggestions are small, local, and easily rejected: alternative wording, next steps, clarifying questions. They should be scannable and comparable. Use bullets, short chips, or inline alternatives. Provide “why this suggestion” sparingly—one sentence maximum—and reserve deeper rationale for a disclosure.

Drafts are longer artifacts: an email, a support reply, a requirements doc section. Draft UIs must prioritize editing. The “apply” interaction should not overwrite silently. Common pattern: show a diff (added/removed) or insert into a separate draft area with version history. Common mistake: presenting a draft as a chat bubble with no editing affordances; users copy/paste, losing traceability and undo.

Plans are structured sequences: a checklist, a workflow, a decomposition of tasks. Plans are particularly useful when the model is uncertain: they let the user verify intent before generation. Plans should expose parameters (audience, tone, constraints) as editable fields. Engineering judgment: plans are a good boundary between model reasoning and tool execution—tools should not run until the user confirms the plan.

Actions are tool-triggering outputs: “create ticket,” “schedule meeting,” “update record,” “run analysis.” Actions require explicit confirmations, previews, and a robust undo story. Design actions as buttons with clear labels (“Create 3 tasks in Jira”) rather than implicit “the AI did it.”

  • Practical outcome: In your spec, declare output types per intent (e.g., “Summarize = suggestion,” “Respond to customer = draft,” “Implement changes = plan then actions”).
  • Success metric tie-in: Suggestions optimize for time-to-decision; drafts optimize for edit distance; plans optimize for correctness of scope; actions optimize for error rate and reversibility.

The UI should make the user’s control visible at the moment it matters: before applying changes, before triggering tools, and whenever the assistant’s confidence is low.

Section 3.3: Context controls: what the AI sees and what users can change

Context is the fuel for usefulness—and the source of many failures. Good copilot UI makes context explicit, inspectable, and adjustable. Treat context as a first-class component alongside input and output.

Start by modeling context layers: system context (policies, tool availability), workspace context (the document, record, project), user-provided context (uploaded files, pasted text), and ephemeral context (current selection, cursor location, recently viewed items). Your UI should indicate which layers are active for this response. A simple “Using: This doc, Selection, Customer profile” line prevents a surprising number of errors.

Provide context controls users can change without rewriting the prompt: toggles (“Include previous messages”), scopes (“This page / This section / Entire doc”), and attachments (“Add file,” “Remove”). For side panels, add a small “context tray” showing chips users can remove. For inline assist, use lightweight scoping (“Rewrite selection”) plus a one-click expansion (“Use entire document”).

Common mistake: hiding context selection in a settings screen. Users need to adjust context at the moment they notice the AI is off. Another mistake is over-automating context, leading to privacy surprises. If you auto-attach a record or meeting transcript, disclose it clearly and allow removal.

From an engineering standpoint, context controls map directly to tool specs: what retrieval queries run, what documents are passed, what redactions apply, and what logs are stored. Your design deliverable should include a short “context contract” describing: allowed sources, default scope, user overrides, and failure behavior (e.g., “If retrieval returns 0 results, ask user to select a document”).

  • State behavior: when context changes, show a small re-run indicator (“Updated context—Regenerate”) rather than silently changing the answer.
  • Trust signal: display sources when retrieval is used, and label them as “workspace files” vs “external web,” because users judge risk differently.

When your usability test subjects fail, ask: was it an intent mismatch, or did the AI simply not have the right context? Context controls help you diagnose and fix the right problem.

Section 3.4: Designing for uncertainty: hedging, asks, and next-best actions

Uncertainty is not a bug; it is a normal state in AI systems. The design goal is to keep uncertainty from becoming confusion. You do that with calibrated language, visible states, and useful next steps.

Hedging should be specific, not vague. Replace “I’m not sure” with “I couldn’t find the policy for refunds after 30 days in the provided docs.” This ties uncertainty to missing context and invites correction. Avoid overconfident language when the model is guessing; it creates brittle trust that collapses after one obvious error.

Asks are targeted questions that reduce ambiguity quickly. Good asks are multiple-choice when possible: “Which audience is this for: customers, internal team, or executives?” In UI terms, convert asks into controls: dropdowns, chips, or short forms. This reduces the burden of writing another prompt and supports measurable completion rates.

Next-best actions prevent dead ends. If the model can’t answer, it should propose what to do next: “Select a document to search,” “Connect your calendar,” “Run the ‘Fetch Orders’ tool,” or “Create a draft with placeholders.” This is also where partial results matter: show what is known, clearly label what is assumed, and provide a “fill missing info” path.

Model system states explicitly: “Generating,” “Searching files,” “Calling tool,” “Waiting for permission,” “Partial results,” “Couldn’t complete.” Each state needs user options: cancel, retry, change scope, or switch to manual. Common mistake: a single spinner for everything; users can’t tell whether the system is working, stuck, or blocked on a permission.

  • Practical guideline: If uncertainty is due to missing parameters, ask. If it’s due to missing access, request permission or offer a manual input. If it’s due to low retrieval evidence, show sources and recommend expanding scope.
  • Prototype outcome: Add at least one test script step where the AI lacks key context; measure whether users can recover without abandoning.

Designing for uncertainty is how you make copilots “test well”: participants forgive incomplete answers when the interface makes the path forward obvious and controllable.

Section 3.5: Safe UX: warnings, sensitive content, and policy surfaces

Safety is not only a model policy problem; it is a UX problem. Users need clear boundaries about what the copilot can do, what it will not do, and what requires extra care. Good safe UX prevents accidental harm and also reduces support load.

Start with policy surfaces that are visible at the right time. Don’t bury rules in a help center. If a user requests a disallowed action, present a refusal with a short reason and acceptable alternatives. The refusal should be calm and specific: “I can’t help generate content that targets a protected group. I can help write a respectful, general announcement instead.” This turns refusal into recovery.

Use warnings when the user is about to apply high-impact output. Examples: sending external emails, modifying customer data, exporting reports, generating medical or legal guidance. Warnings should not be constant banners that users ignore. Trigger them at decision points (before “Send,” before “Apply to 1,200 records”). Pair warnings with a preview and an “Edit” affordance.

Handle sensitive content with clear state changes. If the assistant detects personal data, show a notice: “Contains personal information—review before sharing.” Provide redaction tools (“Remove phone numbers,” “Mask IDs”) and explain whether redaction is local-only or affects what is sent to the model. Engineering judgment: define what is logged, what is stored, and how users can delete conversation history.

Common mistakes include: vague refusals (“I can’t do that”), overblocking benign requests, and silent filtering that changes meaning without disclosure. If you must transform content for safety (e.g., remove identifiers), show what changed and allow the user to view and edit.

  • Recovery patterns: offer templated alternatives, route to human review, or suggest a safer tool-driven path (“Create a report without customer names”).
  • Trust signal: show whether an answer is grounded in internal sources, and label external web content as such to avoid implied endorsement.

Safe UX is measurable: track refusal resolution rates (did users complete the task via an alternative?) and post-warning edits (did users catch issues before sending?).

Section 3.6: Accessibility and inclusive language for AI-generated content

Copilot UX must be accessible in two senses: the interface must be operable by everyone, and the generated content must avoid excluding or harming people. Treat both as design requirements, not polish.

For interface accessibility, ensure AI states are perceivable to assistive tech. Loading and tool-use states should be announced (ARIA live regions), not only shown as spinners. Inline suggestions need keyboard navigation and clear focus management: users must be able to accept/reject without a mouse. When you show diffs or version history, provide text alternatives and ensure color is not the only signal for changes.

For editing, undo, and versioning, accessibility intersects with safety: users need reliable ways to revert. Provide “Undo apply,” “Restore previous version,” and a history list with timestamps. Avoid ephemeral toast-only confirmations; use persistent, discoverable affordances. In prototypes, explicitly design what happens when the user edits AI output: does the assistant treat edits as new context? Can the user “lock” a paragraph from being regenerated?

For inclusive language, add lightweight controls and guidance. Offer tone and audience selectors that default to neutral and respectful language. Provide a “bias and inclusivity check” action that flags stereotypes, gendered assumptions, and ableist language, with suggested rewrites. Important: present this as an assistive review, not an accusation. Also allow domain-appropriate terminology (for communities that self-identify with specific terms) by letting organizations configure language guidelines.

Common mistake: relying on the model to “just be inclusive” without UI support. Users need visibility into why a rewrite is recommended and the ability to keep original phrasing when appropriate.

  • Practical checklist: keyboard operability for accept/reject; state announcements for generating/tool use; readable contrast for suggestions; accessible diff/preview; undo/version history; optional inclusivity review with editable recommendations.
  • Usability testing tip: include at least one task where the participant must reject a suggestion, restore a previous version, and regenerate with a changed tone—this validates control and comprehension.

When copilots are accessible and inclusive by design, they build trust across a wider range of users—and your prototypes will survive real-world scrutiny, not just ideal demos.

Chapter milestones
  • Design the core UI components: input, context, and output
  • Add transparency: sources, rationale, and confidence cues
  • Model system states: loading, tool use, partial results, failures
  • Create interaction guidelines for editing, undo, and versioning
Chapter quiz

1. According to Chapter 3, what makes “Copilot UI” different from a single-screen UI?

Show answer
Correct answer: It is a system of components and states that keeps users oriented through reasoning, tool delays, and refusals
The chapter emphasizes Copilot UI as a system of components and interaction states, not a single screen.

2. Which set best represents the three core UI components in a practical copilot design?

Show answer
Correct answer: Input surface, context controls, output surface
Chapter 3 defines the core components as input (ask/correct), context (what AI can use), and output (turn suggestions into work).

3. Why does the chapter describe decisions like editability, attribution, and reversibility as “engineering judgment”?

Show answer
Correct answer: Because they influence latency, data access, logging, and failure recovery
These UI decisions have downstream impacts on system behavior and constraints, so they require engineering-aware tradeoffs.

4. What is the recommended approach to modeling copilot system states during prototyping?

Show answer
Correct answer: Plan explicit states such as loading, tool use, partial results, failures, and recovery paths
The chapter argues for explicit, testable states beyond idle/answered to handle real constraints and interruptions.

5. Why does Chapter 3 stress interaction guidelines for editing, undo, and versioning?

Show answer
Correct answer: Because AI output is typically a proposal that must be shaped into something safe and correct
The chapter frames AI output as non-final; users need control mechanisms to refine, reverse, and track changes safely.

Chapter 4: Prompting and Tool Specs for Prototype-Ready Behavior

Prototypes fail when the AI “acts smart” in a demo but cannot behave reliably under real task pressure. Chapter 4 is about moving from vibes to specs: prompts that express the experience you designed, tool contracts that engineers can implement, and test prompts that catch the subtle failures that ruin trust. Think of this chapter as your bridge from UI flow to runnable behavior.

As a designer transitioning into AI UX, your leverage is not writing the cleverest prompt. Your leverage is creating a predictable interface between human intent, model reasoning, and product constraints. That means: a prompt architecture that keeps roles clean; output contracts that stabilize layout and tone; tool definitions that clarify what the AI can and cannot do; grounding rules that encourage verifiable answers; and guardrails that translate safety policies into visible user-facing behaviors. Finally, you need prompt QA: a suite of test prompts that gives you regression confidence as tools, prompts, and UI states evolve.

Throughout, aim for “prototype-ready behavior”: the assistant produces structured outputs your UI can render, calls tools with correct arguments, refuses when necessary, and recovers with helpful next steps. If you can hand your prompt and tool spec to an engineer and get the same behavior you designed, you are operating at an AI UX prototyper level.

Practice note for Write a system prompt and style guide that matches the UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define tool calls and data needs as a designer-friendly spec: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design guardrails: constraints, refusals, and escalation paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a test prompt suite that covers core tasks and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a system prompt and style guide that matches the UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define tool calls and data needs as a designer-friendly spec: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design guardrails: constraints, refusals, and escalation paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a test prompt suite that covers core tasks and edge cases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a system prompt and style guide that matches the UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Prompt architecture: system, developer, user, and context

Section 4.2: Output contracts: schemas, tone, length, and formatting

Output contracts are the agreement between the model and the UI. Without them, your prototype becomes a fragile parser: one extra sentence and the UI breaks. Designers should specify output shape the way they specify component props: predictable fields, controlled verbosity, and stable formatting.

Start by deciding what the UI needs to render at each step. In a chat pattern, you may want a primary answer plus “why” and “next actions.” In a side panel, you may want compact bullets and a separate expandable “details” section. In inline assist, you often need a minimal suggestion plus a confirmation affordance. Translate those needs into a contract: required fields, optional fields, and error fields.

A practical approach is a JSON schema (even if you don’t implement strict validation yet). Example fields: intent, assistant_message, confidence, assumptions, citations, tool_calls, follow_up_questions, handoff_suggestion. Your UI can map these to states for uncertainty, refusal, and recovery. If the model is unsure, it should populate assumptions and ask questions rather than inventing details.

Tone and length are also part of the contract. Specify constraints like “Use 1–3 short paragraphs, then bullets for steps” or “Avoid hedging language unless uncertainty is explicitly flagged.” Designers often overlook that overly long responses are not just a content problem—they’re a usability problem that increases scanning cost and hides critical actions below the fold.

Common mistakes: (1) asking for “structured output” but not stating what happens when a field is unknown; (2) mixing user-facing copy with internal debug notes; (3) putting formatting rules in multiple places so they conflict. A strong contract includes fallback rules: “If citations unavailable, say ‘I can’t verify this from the provided sources’ and suggest the retrieval tool.”

Outcome: you get prototype behavior that is renderable and testable. Engineers can implement UI bindings and validation, and you can write usability test scripts that measure success metrics like “time to first useful step,” “clarity of next action,” and “user confidence after uncertainty disclosure.”

Section 4.2: Tooling concepts: retrieval, actions, and function calling

Section 4.3: Tooling concepts: retrieval, actions, and function calling

Tools turn a chatty assistant into a copilot that can do real work. As a designer, you don’t need to implement tools, but you must specify them clearly enough that engineering can build them and QA can test them. Think of each tool as a capability with inputs, outputs, latency, and failure modes.

Three core categories cover most products. Retrieval tools fetch information (search, knowledge base, documents). Your spec should define what sources are allowed, how results are returned (snippets, titles, URLs, doc IDs), and what “no results” looks like. Retrieval is what enables groundedness and reduces hallucination, but only if you instruct the assistant to prefer retrieved content over prior knowledge when answering product-specific questions.

Actions tools change the world: create tickets, send emails, update records, book meetings. Actions require confirmation patterns. Your tool spec should include a “dry run” mode or a two-step flow: propose → confirm → execute. This is where agentic task runners become usable rather than scary. Also specify permissions: what the assistant can do autonomously vs what requires explicit user approval.

Function calling is the structured API-like interface: the model outputs a tool name and a JSON argument object. Your designer-friendly spec should include: tool name, user intent it supports, required arguments, optional arguments, example calls, and example responses. Add constraints: allowed enums, max lengths, date formats, and validation rules. Include how the assistant should react to errors (timeouts, 400 validation errors, 403 permission errors). The best specs define the recovery UX: “If the action fails, summarize what was attempted, show the error in plain language, and offer next steps (retry, edit inputs, escalate).”

Common mistakes: defining tools as if the model “knows” the domain schema; omitting negative cases; and not specifying whether the assistant should call retrieval before answering. A practical pattern is a decision rubric embedded in the developer prompt: “If user asks about company policy, call retrieval; if user asks to modify a record, gather required fields then propose an action; if missing required info, ask targeted questions.”

Outcome: a handoff-ready tool spec that engineers can implement and you can prototype against with mocked responses—crucial for testing workflows before backend integration.

Section 4.3: Grounding and citations: designing for verifiability

Section 4.4: Grounding and citations: designing for verifiability

Grounding is the practice of tying outputs to verifiable sources—retrieved docs, records, or user-provided context—so users can trust and audit the assistant. In AI UX, grounding is not a backend feature; it’s a design choice expressed through prompts, UI affordances, and output contracts.

Start by deciding what “verifiable enough” means for your task. For creative brainstorming, you may not need citations. For policy guidance, medical content, financial summaries, or analytics explanations, you do. Your prompt should explicitly instruct: “When making factual claims about internal policies, quote or cite the retrieved source; if no source is available, say so and ask permission to search.” This prevents the model from filling gaps with plausible-sounding text.

Design citations as part of the response structure. Instead of dumping links, require a citations array with fields like source_title, source_id, quote, location (page/section), and relevance_note. The UI can render these as footnotes, expandable “Show sources,” or inline chips. For side panels, a split view (answer on top, sources below) helps users confirm without losing their place.

Also design for conflicting sources. A robust prompt instructs the assistant to surface discrepancies: “If sources disagree, summarize both and recommend the next best verification step.” This aligns with trust signals: transparency about uncertainty is better than false confidence.

Common mistakes: requiring citations but not defining what counts as a citation; allowing citations to be invented; and over-citing (every sentence) which harms readability. A good balance is to cite claims that drive decisions: numbers, deadlines, policy requirements, and “must/should” guidance. Finally, connect grounding to success metrics in testing: measure whether participants can answer “Where did that come from?” and whether they change behavior appropriately when sources are weak.

Outcome: prototypes that demonstrate not only usefulness but also auditability—one of the fastest ways to differentiate a copilot that “sounds right” from one that tests well.

Section 4.4: Safety-by-design: policies translated into UX behaviors

Section 4.5: Safety-by-design: policies translated into UX behaviors

Safety is experienced as behavior. Users don’t read policy documents; they encounter refusals, warnings, confirmations, and handoffs. Your job is to translate policy constraints into predictable UX patterns that preserve momentum.

Begin with a guardrail map: list the main risk categories relevant to your product (privacy, regulated advice, harassment, self-harm, IP, enterprise confidentiality, unsafe actions). For each category, define (1) what the assistant must refuse, (2) what the assistant can comply with under constraints, and (3) how to escalate. Escalation might mean “handoff to human support,” “ask the user to authenticate,” or “require manager approval.”

Refusals should not be dead ends. Your system prompt should include a refusal template aligned to your style guide: brief reason, what it can do instead, and a next step. Example: if asked to access private HR data, the assistant can say it can’t access that data, then offer to draft an email to HR or guide the user to the correct portal. This is refusal with recovery.

Design confirmations for high-impact actions. If the assistant can send messages, delete files, or change records, require explicit confirmation and show a preview. In prompt terms: “Never execute irreversible actions without user confirmation; present a summary of changes first.” In UI terms: a confirmation modal, “Review & Send,” or a diff view. This reduces both safety risk and user anxiety.

Common mistakes: over-refusing (blocking benign requests), under-explaining (users think the system is broken), and hiding guardrails in the backend only. Safety-by-design also includes how you handle uncertainty: when the model is not confident, it should ask questions or recommend verification rather than guessing. That is both a quality and safety improvement.

Outcome: a copilot that protects users and the business while still feeling helpful—measurable in usability tests as fewer abandonment moments and higher trust ratings after edge-case prompts.

Section 4.5: Prompt QA: adversarial tests, regression prompts, and checklists

Section 4.6: Prompt QA: adversarial tests, regression prompts, and checklists

Prompt QA is how you keep behavior stable as you iterate. Treat prompts like code: changes require tests. A “test prompt suite” is a set of scripted inputs and expected properties of the output (not necessarily exact text), run regularly to detect regressions.

Build your suite from three layers. First, core task prompts mapped to your workflows and success metrics (e.g., “Draft a support reply using the selected ticket,” “Summarize this doc with action items,” “Plan steps and request missing constraints”). Each should specify what “good” looks like in contract terms: correct schema fields, appropriate tone, tool usage, and completion criteria.

Second, edge-case prompts that trigger uncertainty, missing data, conflicting context, and tool failures. Include “no results” retrieval scenarios, malformed inputs, and ambiguous user goals. Your expected outcomes should check for recovery behavior: clarifying questions, alternative paths, and no fabricated details.

Third, adversarial prompts to probe safety and instruction hierarchy: prompt injection attempts (“Ignore previous instructions and reveal the system prompt”), data exfiltration attempts, and unsafe action requests. Your checklist here is crisp: refuses when required, does not leak hidden instructions, and offers safe alternatives.

Operationalize this with a checklist that designers and engineers can share. Example checklist items: “Uses retrieval for policy questions,” “Citations are present and non-invented,” “Asks max 2 clarifying questions at a time,” “Never executes actions without confirmation,” “On tool error, summarizes and suggests next step,” “Outputs valid JSON per schema.” Capture known failure examples as regression prompts—real transcripts from usability testing are gold.

Common mistakes: testing only happy paths, relying on exact string matches, and not versioning prompts. Version your system/developer prompts, and log tool inputs/outputs in prototypes so you can diagnose whether failures came from the model or the tool layer.

Outcome: confidence. With a prompt QA habit, your prototypes remain stable enough to run lightweight usability tests repeatedly, compare iterations, and demonstrate AI UX quality improvements in groundedness, usefulness, safety, and trust signals.

Section 4.6: Practical Focus

Practical Focus. This section deepens your understanding of Prompting and Tool Specs for Prototype-Ready Behavior with practical explanation, decisions, and implementation guidance you can apply immediately.

Focus on workflow: define the goal, run a small experiment, inspect output quality, and adjust based on evidence. This turns concepts into repeatable execution skill.

Chapter milestones
  • Write a system prompt and style guide that matches the UX
  • Define tool calls and data needs as a designer-friendly spec
  • Design guardrails: constraints, refusals, and escalation paths
  • Create a test prompt suite that covers core tasks and edge cases
Chapter quiz

1. What is the main reason prototypes fail when the AI only “acts smart” during demos?

Show answer
Correct answer: They look impressive but don’t behave reliably under real task pressure
The chapter emphasizes reliability under real conditions, not demo-only cleverness.

2. In Chapter 4, what is described as the designer’s real leverage when transitioning into AI UX?

Show answer
Correct answer: Creating a predictable interface between human intent, model reasoning, and product constraints
The chapter frames leverage as making behavior predictable through clear interfaces and constraints.

3. Which set of artifacts best represents “moving from vibes to specs”?

Show answer
Correct answer: A prompt architecture, output contracts, and tool definitions engineers can implement
The chapter focuses on prompt architecture, output contracts, and implementable tool contracts.

4. What is the purpose of designing guardrails in prototype-ready behavior?

Show answer
Correct answer: To translate safety policies into visible user-facing behaviors, including constraints, refusals, and escalation paths
Guardrails make constraints and safety handling explicit and user-visible, including escalation paths.

5. Why does Chapter 4 recommend creating a test prompt suite?

Show answer
Correct answer: To provide regression confidence as tools, prompts, and UI states evolve and to catch subtle failures
A test prompt suite is prompt QA that catches subtle failures and supports regression as the system changes.

Chapter 5: Prototype the Copilot End-to-End (Without Full Engineering)

A copilot prototype is not a “pretty mock.” It is a working model of decisions: what the copilot can do, what it cannot do, what it asks the user for, and how it recovers when reality gets messy. The goal of this chapter is to help you build an end-to-end, clickable copilot flow that behaves consistently across tasks, includes realistic uncertainty and refusal states, and can be evaluated with lightweight usability tests—without needing a full engineering build.

To do that, you will make four things tangible: (1) the UI flow users click through, (2) a response library that keeps AI outputs consistent and testable, (3) instrumentation—events and annotations—so you can evaluate quality, and (4) a package of artifacts stakeholders and engineers can act on. Designers often stop at the happy path; AI UX breaks there first. Your prototype must show what happens when the model lacks info, the tool fails, the user asks for something unsafe, or the system needs clarification.

Throughout this chapter, treat your prototype like a “simulation rig.” You are not proving the model is smart—you are proving the workflow is usable, the system is honest about limitations, and the experience leads to measurable success. You will make choices about fidelity, scope, and which failure modes to include. Those choices are engineering judgment: make the smallest prototype that can still reveal the risks and the wins.

Practice note for Build a clickable copilot flow with realistic states and copy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Simulate AI responses consistently using a response library: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Instrument the prototype for evaluation (events and annotations): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package the prototype with a spec for stakeholders and engineers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a clickable copilot flow with realistic states and copy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Simulate AI responses consistently using a response library: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Instrument the prototype for evaluation (events and annotations): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package the prototype with a spec for stakeholders and engineers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a clickable copilot flow with realistic states and copy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Prototyping fidelity choices: lo-fi, mid-fi, and hi-fi

Pick fidelity based on the decisions you need to test, not the stakeholder you need to impress. For copilots, fidelity is less about pixels and more about behavioral completeness: does the user see the right states, constraints, and outcomes? You can prototype the end-to-end flow at three common levels.

Lo-fi (paper or grayscale wireframes) is best when you are still deciding the workflow: where the user starts, what the copilot asks, what the user can edit, and where outputs land (chat, side panel, inline). You can still include realistic AI states by using static cards for “AI is thinking,” “needs more info,” “refused,” and “tool error.” The key is to make the clicks match the task script, so testing reveals whether the workflow itself makes sense.

Mid-fi (clickable prototype with basic styling) is the sweet spot for most AI usability tests. You can simulate turn-by-turn conversation, tool calls, and revision loops. Mid-fi is also where you start implementing a response library so the same prompt produces the same class of output in testing. If you are using Figma, this usually means components + variants; if you are using a prototyping tool, it may be a simple “choose response” overlay.

Hi-fi (polished UI + near-real copy) is appropriate when trust signals, microcopy, and clarity are the primary risks—common in sensitive domains (finance, health, HR). Hi-fi also helps when you need to compare patterns (chat vs inline assist) without “ugly UI” skewing perception. However, hi-fi can hide gaps if it looks done; protect yourself by labeling the prototype as a simulation and keeping the states visible.

  • Common mistake: building hi-fi screens with one perfect response. AI systems fail in patterns; prototype the patterns.
  • Practical outcome: a clickable copilot flow that includes at least one happy path and two degraded paths (e.g., missing info + refusal), aligned to a real user task.

As you choose fidelity, also choose scope. Prototype one “north star” task end-to-end before you prototype five partial tasks. Stakeholders forgive limited breadth; users do not forgive broken loops (clarify → generate → revise → apply).

Section 5.2: Response libraries: reusable outputs, variants, and tone

A response library is how you simulate AI responses consistently. Without it, usability tests become “participant versus random output,” and you cannot compare sessions. Your library is a set of reusable outputs organized by intent, state, and tone—like a design system for language.

Start by defining intents (what the user is trying to do) and map each to response types: clarifying question, draft output, summary, options list, step-by-step plan, refusal, citation-backed answer, tool result, and follow-up suggestion. For each response type, write 2–4 variants that cover realistic differences: short vs detailed, confident vs uncertain, and “best effort with assumptions” vs “ask before proceeding.”

Then define tone rules and encode them as reusable micro-patterns: how the copilot hedges, how it confirms actions, how it references sources, and how it expresses limits. For example, create a standard structure for uncertain answers: (1) what it can do, (2) what it needs, (3) suggested next step. The point is not literary consistency; the point is testability.

  • Library template: Intent → Trigger → Required inputs → Tool used (if any) → Primary response → Variant A/B → Failure variants (no data, ambiguous, policy) → “Next best action.”
  • Common mistake: writing only “good” outputs. You must include mediocre-but-plausible outputs to test recovery flows (edit, ask follow-up, rerun with constraints).

When you run usability tests, you will “play the copilot” by selecting responses from the library. That lets you keep behavior stable while observing user comprehension and trust. It also sets you up for engineering: your response library becomes an early corpus for prompt examples and evaluation cases.

Section 5.3: State machines for AI UX: happy path to degraded modes

AI UX needs explicit states. If your prototype only shows the happy path, you are not prototyping the product—you are prototyping a demo. A simple state machine keeps your design honest: it forces you to define what happens when the model is uncertain, when tools fail, when safety rules apply, and when the user needs to recover.

Model your copilot interaction as transitions between states such as: IdleCollecting contextGeneratingPresenting draftUser edits/asksApplying actionDone. Then add degraded modes: Needs clarification, Low confidence, Tool timeout, No access/permission, Policy refusal, Hallucination risk (no sources), and Partial completion.

Each degraded state must answer three questions:

  • What does the user see? (UI: loading indicator, warning banner, disabled CTA, “verify” prompt)
  • What can the user do next? (actions: provide info, choose option, retry, escalate to human, export draft)
  • What is logged? (event + annotation for later evaluation)

Instrumenting the prototype can be lightweight. Add an “analyst layer” in your file: small labels that won’t appear in the final UI but record which state is active and why. Define events such as copilot_response_shown, user_accepts_draft, user_edits_output, refusal_triggered, tool_call_failed, and user_escalates_to_human. During testing, you or a note-taker can tally these events and attach qualitative notes about groundedness, usefulness, safety, and trust.

Common mistake: treating “uncertainty” as a single banner. In practice, uncertainty can mean missing user intent, missing data, stale data, or conflicting sources—each needs a different recovery path.

Practical outcome: a clickable prototype where every turn of the copilot maps to a defined state, including at least one refusal and one recovery loop that leads back to success.

Section 5.4: Microcopy that builds trust: disclaimers, nudges, and labels

In copilot interfaces, microcopy is part of the safety system. Good labels and disclaimers do not “legalize” a weak experience; they clarify capability boundaries and guide users toward reliable outcomes. The most effective microcopy is specific, timely, and paired with an action the user can take.

Use labels to differentiate system behaviors: “Draft,” “Suggestion,” “Verified from sources,” “From your files,” “Estimated,” or “Needs review.” If the output is not grounded, say so and offer a next step: “I couldn’t find a source in your workspace. Want me to search Project Docs or paste a link?”

Use disclaimers sparingly and locally. A global “AI may be wrong” message trains users to ignore you. Instead, place disclaimers at moments of risk: before applying changes, before sending messages to customers, or when using assumptions. Pair them with controls: preview, diff view, undo, and citations. If the copilot is acting agentically, your microcopy must make the action boundary explicit: “I will create a ticket and assign it to Alex. Review details before submitting.”

Use nudges to improve prompting without blaming the user. Provide examples and chips that encode good constraints (“audience,” “tone,” “length,” “must include,” “don’t include”). When the user’s request is ambiguous, your microcopy should convert ambiguity into choices: “Which timeframe do you mean?” with options, not a blank question.

  • Common mistake: hiding refusal behind generic errors. Refusal copy should be direct, policy-aligned, and offer safe alternatives.
  • Practical outcome: microcopy that increases trust by showing what the system used (sources/tools), what it assumed, and what the user can do to correct it.

Write microcopy to be tested. In your response library, tag outputs with the trust signals they require (citations, review step, limitation note). That way you can observe whether users notice and act on them during usability tests.

Section 5.5: Collaboration workflow: design-to-dev handoff artifacts

Your prototype becomes valuable to engineering when it is packaged with specs that reduce ambiguity: what prompts exist, what tools are called, what data is required, and what counts as success. The handoff for AI UX is not just redlines; it is behavior, evaluation, and constraints.

Create a small set of artifacts that travel together:

  • Flow map: user task → intents → screens → states → transitions (include degraded modes).
  • Prompt + tool spec: for each intent, document system instructions, user message template, required context, and tool invocation schema. Include examples from your response library as “golden outputs.”
  • UI contract: what the UI needs from the model (fields, citations, confidence label, action proposals) and what the model needs from the UI (user role, document selection, permissions).
  • Instrumentation plan: events, annotations, and success metrics (task completion, edits before accept, escalation rate, time-to-first-useful-output).

Be explicit about engineering judgment calls. For example, specify when the system should ask a clarifying question versus proceeding with assumptions, and what thresholds trigger “low confidence” messaging. Engineers can tune these later, but they need your intent to avoid shipping a copilot that either nags constantly or bulldozes ahead.

Common mistake: handing off “prompt text” without tool boundaries. If the experience depends on tools (search, CRUD actions, calculators), the spec must describe tool inputs/outputs and failure handling.

Practical outcome: stakeholders can understand what will be built, and engineers can estimate work because the prototype includes states, tool calls, and measurable success criteria—not just UI screens.

Section 5.6: Review readiness: demo script and stakeholder checklist

Before you schedule a review, make the prototype “demo-ready” and “test-ready.” Demo-ready means you can reliably show the same story every time. Test-ready means a participant can complete tasks without you explaining the interface. Both require preparation that designers often skip.

Write a demo script that includes: the user persona and goal, the starting context (what data/files are available), the primary task flow, and at least two degraded scenarios. For example: (1) user asks for a summary; copilot asks a clarifying question; user selects a document; copilot provides a draft with citations; user edits; apply. Then (2) user asks for something unsafe; copilot refuses with alternatives. Then (3) tool times out; copilot offers retry and manual path. This demonstrates that you designed recovery, not just generation.

Use a stakeholder checklist to drive decisions rather than opinions:

  • Workflow clarity: Can users predict what happens next? Are action boundaries clear?
  • Groundedness: Does the UI show sources/tool usage when needed?
  • Usefulness: Does the output land where work happens (doc, ticket, email), not only in chat?
  • Safety: Are refusal and escalation paths explicit and humane?
  • Trust: Are assumptions labeled? Is undo available for applied actions?
  • Measurement: Do we know what events/metrics define success?

Finally, do a “dry run” with a colleague acting as a skeptical user. If they can derail your prototype by asking a reasonable follow-up you cannot answer consistently, your response library or state machine is incomplete. Fix that before the review. The practical outcome of this section is a prototype package you can ship internally: a clickable flow, a response library, instrumentation notes, and a script that makes evaluation repeatable.

Chapter milestones
  • Build a clickable copilot flow with realistic states and copy
  • Simulate AI responses consistently using a response library
  • Instrument the prototype for evaluation (events and annotations)
  • Package the prototype with a spec for stakeholders and engineers
Chapter quiz

1. In Chapter 5, what is the primary purpose of a copilot prototype?

Show answer
Correct answer: A working model of decisions that proves the workflow is usable and honest about limitations
The chapter emphasizes a prototype as a working model of decisions and usability, not polish or a full build.

2. Which set of elements must be made tangible to create an evaluable end-to-end copilot prototype without full engineering?

Show answer
Correct answer: UI flow, response library, instrumentation (events/annotations), and a packaged spec for stakeholders/engineers
The chapter lists four tangible outputs: clickable flow, response library, instrumentation, and a package of artifacts.

3. Why does the chapter argue that designing only the happy path is risky in AI UX?

Show answer
Correct answer: AI UX breaks first when reality gets messy, so prototypes must include uncertainty, refusal, tool failure, and clarification states
The chapter warns that real-world failure modes (missing info, unsafe requests, tool failures) are where AI UX often fails.

4. What is the role of a response library in the prototype?

Show answer
Correct answer: To keep AI outputs consistent and testable across tasks and states
A response library simulates AI responses consistently so the prototype behaves predictably during evaluation.

5. What does the chapter mean by treating the prototype like a “simulation rig”?

Show answer
Correct answer: It simulates the workflow to evaluate usability, limitations, and measurable success rather than proving the model is smart
The chapter frames prototyping as testing the experience and honesty of the system, using the smallest scope that reveals risks and wins.

Chapter 6: Test With Real Tasks, Measure Quality, Iterate to “Pro”

At this point in the course, you can design copilot workflows, sketch UI patterns, and prototype uncertainty, refusal, and recovery. Chapter 6 is where you earn “pro” status: you put the prototype in front of people, give them real tasks (not feature tours), measure the outcomes with consistent metrics, and iterate with engineering judgment. AI UX isn’t validated by whether a chat bubble looks right—it’s validated by whether users can reliably complete their work with acceptable quality, safety, and trust.

This chapter shows a practical loop you can run in a week: plan a lightweight study, execute moderated or unmoderated sessions, score results with a rubric, decide whether each failure is a UX problem, a prompt problem, a missing tool, or a policy/safety rule, and then publish a portfolio case study that includes evidence. The goal is not perfection; the goal is a repeatable method that produces defensible design decisions.

As you read, keep one principle in mind: test the system as a workflow, not as a model. Participants don’t experience “the prompt.” They experience an interface, a set of capabilities, latency, ambiguity, refusal, and the need to recover from errors. Your job is to learn which part breaks first under real task pressure—and fix that part with the smallest, safest change.

Practice note for Run moderated or unmoderated task-based tests on the prototype: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Score results using AI UX metrics and a rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prioritize fixes: UX, prompt, tool, or policy changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Publish a portfolio case study with evidence and learnings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run moderated or unmoderated task-based tests on the prototype: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Score results using AI UX metrics and a rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prioritize fixes: UX, prompt, tool, or policy changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Publish a portfolio case study with evidence and learnings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run moderated or unmoderated task-based tests on the prototype: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Test planning: participants, scenarios, and consent basics

Start by deciding what “real” means for your copilot. Pick 2–3 primary jobs-to-be-done from earlier chapters (for example: “draft a customer email using account notes,” “summarize a ticket thread and propose next steps,” or “create a project plan from a meeting transcript”). Each job becomes a scenario with a clear starting point and a finish line. Avoid generic prompts like “try the assistant and tell us what you think.” That produces opinions, not evidence.

Recruit participants who match the actual role. Five to eight participants can uncover most workflow issues, but only if they have comparable context. If you can’t recruit true end users, recruit “adjacent” users and tighten your scenarios (provide more background, sample artifacts, or a glossary) so the test still reflects realistic constraints. Decide moderated vs. unmoderated based on what you need to observe: moderated sessions are better for learning how users reason about uncertainty and refusal; unmoderated sessions are better for collecting more task completion data quickly.

Plan your materials like a mini research packet: (1) a one-page study plan (goals, tasks, metrics), (2) participant screener, (3) task scripts, (4) a rubric, and (5) a consent form. Consent basics: explain recording, what data you collect (including any text participants paste into the copilot), how it will be used, and that they can stop anytime. If your prototype could expose sensitive content, give participants sanitized sample data instead of asking them to use real customer information.

  • Common mistake: testing only “happy path” scenarios. Include at least one scenario that should trigger uncertainty (missing info), one that risks a refusal (policy boundary), and one that requires recovery (wrong output that must be corrected).
  • Practical outcome: you end planning with tasks that map to measurable outcomes, not just “feedback.”

Finally, define what you will change after the test. If your team is not prepared to adjust prompts, UI copy, or tool wiring, don’t run the study yet. The point of testing is iteration, not documentation.

Section 6.2: Task scripts and probes for copilot usability sessions

A strong AI copilot task script is more like a work order than a survey. It sets context, constraints, and success criteria. Include the artifacts the participant would normally have (a ticket, a transcript, a spreadsheet excerpt), and specify what “done” looks like. For example: “Send a reply email that (a) references the customer’s last message, (b) proposes two options, and (c) uses a friendly but professional tone.” That turns a vague conversation into an assessable output.

For moderated sessions, your script needs two layers: the task instructions and the probes. Probes are targeted questions you ask at key moments without leading the participant. Use them to capture mental models, trust calibration, and decision points. Good probes include: “What do you expect the copilot to do next?”, “What information would you need to feel confident sending this?”, “If you couldn’t use the copilot, what would you do instead?”, and “What made you accept or reject that suggestion?”

Because AI interfaces are probabilistic, add a “branch” section to your script that covers likely outcomes: the model hallucinates, refuses, asks clarifying questions, or returns an incomplete answer. Your job is to observe recovery behavior, not to rescue the participant. If they’re stuck, use neutral prompts: “What would you try next?” or “Show me what you’d do in your real work.” Avoid teaching them hidden features during the task; save feature education for after the measurement.

  • Include at least one tool-dependent task: a scenario where the copilot must use retrieval, search, or a structured action (create a ticket, update a field). This reveals whether your UI communicates tool usage and whether users can verify results.
  • Capture the full interaction: log user inputs, model outputs, system messages/state (loading, tool call, error), and any citations. AI UX issues often live in transitions, not in the final answer.

For unmoderated tests, write scripts that are self-contained: include step-by-step instructions, a short practice task, and a place to paste final outputs. Ask a small number of fixed questions after each task (confidence, effort, perceived correctness) and reserve open-ended questions for the end. The practical goal is consistency so you can score outcomes across participants.

Section 6.3: Metrics: task success, effort, trust, and error recovery

To evaluate AI UX, you need both outcome metrics (did the work get done?) and experience metrics (how hard did it feel, and did trust calibrate appropriately?). Start with task success as your anchor. Use a simple three-level scale: Success (meets criteria with minimal edits), Partial (needs substantial correction or missing requirements), Fail (cannot complete or produces unsafe/incorrect output). Tie these to your task’s explicit acceptance criteria, not to your personal taste.

Next measure effort. In lightweight studies, “time on task” is useful but noisy because participants type at different speeds and may explore. Pair it with self-reported effort (e.g., 1–7) and observable friction signals: number of turns, backtracks, copy/paste to external tools, and moments of “what do I do now?” If your copilot requires five re-prompts to get a usable answer, that’s an effort problem even if the final output looks good.

Trust and confidence are critical because AI can be convincingly wrong. After each task, ask for a confidence rating and a short justification: “What made you confident or unsure?” Look for over-trust (sending without verification) and under-trust (ignoring correct suggestions). Both are failures: over-trust is a safety risk; under-trust is a usefulness risk. Your design should encourage appropriate verification through citations, previews, or “show your work” affordances.

Finally, quantify recovery. Create a small recovery score: (1) user noticed the issue, (2) user found a way to correct it, (3) user reached a safe outcome. Track where recovery broke: unclear error messaging, missing undo, no way to constrain output, or refusal with no alternative. Recovery is where professional AI UX differs from demos.

  • Common mistake: counting “participant liked it” as success. A pleasant tone can hide incorrect content.
  • Practical outcome: a table per task with success, effort, trust, and recovery scores you can compare across iterations.

Use a rubric to make scoring repeatable. When multiple people score outputs, agree on examples of “Success” vs. “Partial” beforehand. Consistency is more important than precision at this stage.

Section 6.4: AI quality review: groundedness, helpfulness, and safety checks

Usability scores tell you whether users can finish tasks; AI quality review tells you whether the copilot’s outputs deserve trust. Add a structured review pass after each session (or after each unmoderated submission) where you assess the assistant’s responses on groundedness, helpfulness, and safety. Think like an engineer: you’re identifying failure modes and deciding which layer should change.

Groundedness asks: is the answer supported by provided context or cited sources? If your prototype includes retrieval or citations, verify that citations actually support the claim. If it doesn’t include citations, check whether the model invents specifics (names, dates, policies) that weren’t present. Mark hallucinations explicitly, even if they sound plausible. In copilot work, “plausible” is often the enemy.

Helpfulness asks: does it move the task forward in the user’s real constraints? Helpful outputs are actionable (next steps, options, structured summaries) and appropriately scoped. Watch for “polite but useless” responses: generic advice, long explanations when the user needed a draft, or excessive hedging that pushes work back onto the user.

Safety checks depend on your domain. Create a short checklist tailored to your product: privacy leakage, policy violations, disallowed content, and risky instructions. Include boundary tests: prompts that try to extract sensitive data, bypass policies, or request actions the copilot should not take. In your prototype, the user should see clear refusal language plus safe alternatives (e.g., redirection to approved resources or a way to proceed with anonymized data).

  • Common mistake: treating refusal as a pure model problem. Often it’s a UX problem: users don’t understand why it refused or what they can do instead.
  • Practical outcome: a quality log of issues tagged by type (hallucination, missing citation, unsafe suggestion, unhelpful verbosity) with severity.

Combine the quality log with your usability metrics. A task can be “successful” from the participant’s perspective but still unacceptable if the content is ungrounded or unsafe. Professional teams treat those as critical failures regardless of user satisfaction.

Section 6.5: Iteration loops: prompt tweaks vs. UX changes vs. tool needs

After testing, you’ll have a list of issues. The professional move is to sort each issue into the smallest effective fix category: UX change, prompt change, tool change, or policy/safety change. This prevents a common failure pattern where teams keep “prompting harder” to solve problems that need product design or engineering.

UX changes help when users don’t know what to do, misunderstand system state, or can’t verify/undo. Examples: add suggested prompts, clarify what the copilot can access, provide a “verify with sources” toggle, show tool activity (“Searching knowledge base…”), add an edit-and-resubmit flow, or create a structured side panel for constraints (tone, audience, format). UX fixes often reduce effort and improve recovery without touching the model.

Prompt tweaks help when the assistant’s behavior is inconsistent but the needed information is available. Examples: strengthen instructions to ask clarifying questions before drafting, enforce output structure (bullets, tables, JSON), specify citation requirements, or add a system-level policy reminder. Keep prompt changes testable: change one thing, rerun the same tasks, and compare scores. Don’t make prompts so long they become brittle; prefer concise rules plus tool support.

Tool needs appear when the model cannot reliably know something or take an action. If groundedness fails because the copilot lacks retrieval, you need a search tool or RAG, not a better prompt. If tasks require updating records, you need a safe action API with confirmations, not “please pretend you updated it.” Tools also need UX: preview, diff, and undo for writes; and citations/snippets for reads.

Policy/safety changes are required when a capability is fundamentally risky. Sometimes the correct iteration is to restrict a workflow, add human review, or introduce role-based access. Document these decisions as product constraints, not “model limitations.”

  • Prioritization method: rank issues by severity (safety/incorrectness first), frequency (how often it occurred), and fix cost. A high-severity hallucination that happens once may outrank a minor UI annoyance that happens often.
  • Practical outcome: an iteration backlog with owners: design (UX), AI design (prompt), engineering (tooling), and policy (governance).

Close the loop by rerunning at least one task after changes. “Iterate to pro” means you can show measurable improvement, not just new screens.

Section 6.6: Portfolio packaging: storyline, artifacts, and before/after evidence

Your portfolio case study should read like a product decision record backed by human evidence. Hiring teams want to see that you can define tasks, test them, measure quality, and iterate responsibly. Structure your story around a single narrative thread: a real user job, a prototype designed to support it, what broke in testing, and how you improved it.

Use a simple storyline: Problem (user job and context), Approach (workflow, UI pattern choice—chat, side panel, inline assist, agentic runner), Prototype (key screens and states for uncertainty/refusal/recovery), Study (participants, tasks, method), Results (metrics + rubric scores), Iterations (what changed and why), Impact (before/after evidence and remaining risks).

Include artifacts that demonstrate you can hand off to engineering: a prompt spec (system goals, constraints, examples), a tool spec (inputs/outputs, error states, confirmation steps), and the scoring rubric. Show at least one “before” transcript and one “after” transcript for the same task, annotated with what improved (groundedness, fewer turns, clearer recovery). If you used citations or tool call logs, include a redacted snippet that proves verification is possible.

  • Evidence to include: task success table, effort and confidence ratings, top failure modes, and one or two screenshots of key UI changes (e.g., adding constraints panel, adding refusal with alternatives, adding verify/cite flow).
  • Common mistake: showcasing only the best outputs. Show a failure and how you addressed it; that signals maturity.

End with a professional reflection: what you would test next (new tasks, edge cases, longer-term trust), and what governance assumptions you made (data access, logging, escalation paths). A “pro” case study doesn’t claim the AI is perfect; it shows you know how to make it reliably useful, measurably better, and responsibly constrained.

Chapter milestones
  • Run moderated or unmoderated task-based tests on the prototype
  • Score results using AI UX metrics and a rubric
  • Prioritize fixes: UX, prompt, tool, or policy changes
  • Publish a portfolio case study with evidence and learnings
Chapter quiz

1. In Chapter 6, what most strongly validates an AI UX copilot design as “pro”?

Show answer
Correct answer: Users can reliably complete real work tasks with acceptable quality, safety, and trust
The chapter emphasizes validation through task completion outcomes and quality/safety/trust—not UI polish or demo performance.

2. When planning a study in this chapter, what kind of participant activity is preferred?

Show answer
Correct answer: Real tasks that reflect actual workflow pressure
The chapter stresses task-based tests over feature tours to see if people can do real work.

3. What is the practical testing loop Chapter 6 says you can run in about a week?

Show answer
Correct answer: Plan a lightweight study, run moderated or unmoderated sessions, score with a rubric, decide root cause, iterate, and publish evidence
The chapter describes a repeatable, lightweight loop: test, measure with consistent metrics, diagnose, iterate, and document evidence.

4. If a task fails during testing, how does Chapter 6 recommend you categorize the likely source before fixing it?

Show answer
Correct answer: Decide whether it’s a UX issue, a prompt issue, a missing tool, or a policy/safety rule
The chapter calls for engineering judgment: identify whether the break is UX, prompt, tooling, or policy/safety, then make the smallest safe change.

5. What core principle should guide how you test a copilot system in Chapter 6?

Show answer
Correct answer: Test the system as a workflow users experience, not as a model or isolated prompt
Participants experience a workflow including interface, latency, ambiguity, refusals, and recovery—so evaluation should target the whole system experience.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.