HELP

+40 722 606 166

messenger@eduailast.com

Designer to AI UX Researcher: LLM Chat Flow Testing & Analytics

Career Transitions Into AI — Intermediate

Designer to AI UX Researcher: LLM Chat Flow Testing & Analytics

Designer to AI UX Researcher: LLM Chat Flow Testing & Analytics

Turn design skills into AI UX research with tested chat flows and metrics.

Intermediate ai-ux · ux-research · conversational-ai · llm-evaluation

Transition from designer to AI UX researcher—using evidence, not vibes

Conversational AI changes the rules of UX. Instead of static screens, you’re designing and researching a system that generates language, reasons with partial context, and can be helpful one moment and confidently wrong the next. This course is a short technical book that shows you how to move from product design into AI UX research by building testable conversational prototypes and evaluating them with a blend of usability methods and LLM-specific analytics.

You’ll work through a practical progression: define the assistant’s job-to-be-done, prototype realistic conversational flows (including error recovery), design research studies that capture real user behavior, and then quantify what’s happening with instrumentation and transcript analysis. The goal is not to turn you into a data scientist; it’s to give you the frameworks, artifacts, and vocabulary an AI product team expects from a UX researcher working on LLM experiences.

What you’ll build (portfolio-ready artifacts)

  • A conversation UX research brief with hypotheses and success criteria
  • A testable chat flow prototype (scripts/branches/fallbacks) plus an evaluation rubric
  • A moderated or unmoderated test plan with tasks, scenarios, and consent language
  • A scoring approach that blends usability signals with LLM quality dimensions
  • An analytics plan (events, taxonomies, outcomes) and a simple dashboard spec
  • An insights-to-actions backlog and a stakeholder-ready research report

How this course approaches LLM evaluation for UX

LLM experiences require more than classic usability testing. You’ll learn how to assess groundedness, helpfulness, safety, and consistency without getting lost in model internals. We’ll treat prompts and system instructions as part of the interface, and we’ll show how to test variability, capture reproducible traces, and interpret failure modes like hallucinations, refusal mismatches, tool errors, and ambiguous user intent.

Who this is for

This course is designed for product designers, UX designers, service designers, and researchers who want to pivot into AI-focused roles. If you already know UX fundamentals but want a clear, end-to-end method for researching chat and assistant experiences—this is your playbook.

How you’ll learn

Each chapter is structured like a book chapter with milestones and sub-sections. You’ll start with a scoped use case and end with a repeatable research and analytics workflow you can apply to new products. Along the way, you’ll practice turning ambiguous conversation quality into measurable criteria, so your recommendations are easier to prioritize and ship.

Ready to begin? Register free to access the course, or browse all courses to compare learning paths in AI career transitions.

By the end

You’ll be able to prototype conversational flows that are explicitly testable, run studies that capture both user experience and model behavior, and use LLM analytics to measure outcomes and guide iteration. Most importantly, you’ll have a coherent set of artifacts that demonstrate AI UX research capability—ideal for interviews, internal mobility, or client work.

What You Will Learn

  • Translate UX design experience into AI UX research responsibilities and deliverables
  • Prototype conversational flows for LLM-powered assistants using scripts and state diagrams
  • Design a practical research plan for chat experiences (goals, hypotheses, tasks, success criteria)
  • Run moderated and unmoderated tests for conversational UX and capture high-signal evidence
  • Apply LLM evaluation concepts (groundedness, safety, helpfulness) to UX research
  • Build an analytics instrumentation plan for conversations (events, intents, slots, outcomes)
  • Analyze transcripts with qualitative coding plus quantitative metrics (resolution rate, fallbacks, latency)
  • Create an insights-to-actions backlog and communicate results to product and engineering

Requirements

  • Basic UX fundamentals (personas, journeys, usability testing concepts)
  • Comfort reading simple spreadsheets and charts
  • Access to a prototype tool (Figma, FigJam, Miro) and a spreadsheet (Google Sheets/Excel)
  • No programming required (optional: basic SQL familiarity helps)

Chapter 1: From Product Designer to AI UX Researcher

  • Map your transferable skills and AI UX research gaps
  • Define the assistant’s job-to-be-done and user outcomes
  • Create a conversation UX research brief and hypothesis set
  • Set up your research repository and evidence standards
  • Portfolio plan: what artifacts you’ll produce in this course

Chapter 2: Prototype Conversational Flows That Are Testable

  • Choose a use case and define scope boundaries
  • Draft a conversation script with branching and error paths
  • Design prompts/system instructions as a product interface
  • Build a clickable or runnable prototype for testing
  • Create a rubric: expected vs acceptable model behaviors

Chapter 3: Research Design for Conversational AI

  • Turn product goals into measurable conversation success criteria
  • Design tasks and scenarios that elicit realistic user behavior
  • Plan participant recruitment, screening, and ethics
  • Build a test plan: moderation guide, logging, and consent
  • Pilot your study and refine tasks and rubrics

Chapter 4: Run Tests and Capture High-Signal Evidence

  • Run moderated sessions and probe model + user mental models
  • Collect clean transcripts and structured annotations during sessions
  • Execute unmoderated tests at scale with consistent logging
  • Score conversations with a balanced rubric (UX + LLM quality)
  • Synthesize findings into problem statements and opportunity areas

Chapter 5: LLM Analytics—Measure What the Model and UX Are Doing

  • Design an instrumentation plan: events, properties, and taxonomy
  • Compute core conversation metrics and create a weekly dashboard
  • Perform qualitative coding at scale and connect to metrics
  • Diagnose failure modes and prioritize fixes using impact estimates
  • Propose experiments: prompt tweaks, UX changes, and model policies

Chapter 6: Ship Research—Make Recommendations and Build Your Portfolio

  • Create an insights-to-actions backlog with owners and acceptance criteria
  • Write a concise AI UX research report and present it to stakeholders
  • Build a governance loop: monitoring, audits, and regression testing
  • Package a portfolio case study for AI UX research roles
  • Career transition plan: resume bullets, interview stories, and skill signals

Sofia Chen

Conversational AI UX Research Lead

Sofia Chen leads conversational UX research for AI assistants across fintech and healthcare products. She specializes in LLM evaluation, human-in-the-loop testing, and turning qualitative insights into measurable improvements. Her work focuses on safe, reliable chat experiences and research operations that scale.

Chapter 1: From Product Designer to AI UX Researcher

Transitioning from product design to AI UX research is less about abandoning your design craft and more about aiming it at a new kind of interface: one that responds in language, adapts in real time, and sometimes makes things up. In traditional UI, you ship screens and flows; the system’s behavior is mostly bounded by what you designed. In LLM-powered chat, you ship a behavioral envelope: prompts, policies, tools, guardrails, and evaluation criteria that shape what the assistant can do and how reliably it does it.

Your design background already includes the foundations of this work: goal framing, task analysis, usability heuristics, journey mapping, prototyping, and evidence-based iteration. What changes is the research surface area. You’re no longer only validating “can users find and click the right thing?” You’re validating “can users express the right thing, can the assistant interpret it, and does the response remain helpful, grounded, and safe across varied contexts?”

This chapter sets up your new workflow. You’ll map your transferable skills and identify AI UX research gaps, define the assistant’s job-to-be-done (JTBD) and user outcomes, draft a conversation research brief with hypotheses, set up a research repository and evidence standards, and plan a portfolio of artifacts you’ll produce throughout the course.

  • Mindset shift: from pixel-perfect interactions to probabilistic behavior management.
  • Primary skill shift: from UI usability alone to language understanding, evaluation, and instrumentation.
  • Primary deliverable shift: from wireframes to briefs, test plans, scorecards, and analytics-ready schemas.

Most importantly, you’ll practice engineering judgment: knowing what to test first, what to hold constant, what to instrument, and how to interpret messy conversational evidence without overfitting to anecdotes.

Practice note for Map your transferable skills and AI UX research gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the assistant’s job-to-be-done and user outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a conversation UX research brief and hypothesis set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up your research repository and evidence standards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Portfolio plan: what artifacts you’ll produce in this course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map your transferable skills and AI UX research gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the assistant’s job-to-be-done and user outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a conversation UX research brief and hypothesis set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What changes in UX when the UI talks back

Section 1.1: What changes in UX when the UI talks back

When the UI talks back, the interaction model becomes turn-based, interpretive, and open-ended. The “interface” is not just layout—it’s phrasing, timing, repair strategies, and the assistant’s ability to keep context. Users don’t browse; they negotiate meaning. That’s why AI UX research puts more emphasis on misunderstandings, recovery, and expectation management than many screen-based studies.

As a designer, you likely already run usability tests with tasks and success criteria. In chat, you still do that, but tasks must include variations in how people ask, what they omit, and what they assume the assistant knows. Your prototypes also change: instead of high-fidelity screens, you’ll prototype conversational flows with scripts, branching logic, and state diagrams. A lightweight way to start is to write a “happy path” transcript, then add at least three common deviations: ambiguous request, missing detail, and user correction.

  • Turn-taking rules: When does the assistant ask a question versus make an assumption?
  • Repair moves: How does the assistant handle “No, that’s not what I meant”?
  • Confidence signals: How does it express uncertainty without sounding incompetent?

Common mistake: treating chat like a search box with a long answer. The research consequence is you’ll measure satisfaction while missing failure modes like goal drift (assistant changes the task), over-verbosity (user abandons), or hidden errors (assistant sounds confident but is wrong). Practical outcome: you’ll begin mapping your transferable skills (task design, facilitation, synthesis) and your gaps (LLM evaluation, tool use, instrumentation) so your learning plan is targeted rather than overwhelming.

Section 1.2: Key roles: AI UX, conversation design, evaluation, and research ops

Section 1.2: Key roles: AI UX, conversation design, evaluation, and research ops

AI products blur responsibilities that used to sit neatly between design, research, and engineering. Understanding the roles helps you collaborate—and helps you position your career transition. In many teams, “AI UX Researcher” sits at the intersection of conversation design, evaluation, and research operations.

AI UX (product/experience): Defines user outcomes, sets interaction principles, and decides what the assistant should and should not do. This includes aligning on the assistant’s JTBD: a crisp statement of the job users hire the assistant to do, plus the outcomes that indicate success (time saved, fewer errors, higher confidence, reduced handoffs).

Conversation design: Crafts the assistant’s voice, prompts, clarifying questions, and fallback behaviors. Even if you’re not the primary conversation designer, you’ll research whether these behaviors actually work for real users and real inputs.

Evaluation (LLM quality): Operationalizes “good” with rubrics and scorecards—helpfulness, groundedness, safety, and task completion—then runs systematic testing. UX research becomes more quantitative here: you’ll compare variants, define rating scales, and ensure raters interpret criteria consistently.

Research ops: Builds the system that keeps evidence usable: repositories, tagging, privacy standards, and repeatable templates. Without this, chat studies become piles of transcripts no one can trust.

  • Transferable skill: You already write problem statements and align stakeholders—now you’ll write research briefs that include model constraints and evaluation criteria.
  • Gap to close: Learn how prompts, tools, and retrieval change behavior so you can attribute outcomes correctly.

Practical outcome: you’ll define a working “assistant role” (what it’s allowed to do), identify stakeholder owners (product, engineering, legal, data), and decide early what evidence standards you’ll hold for decisions—especially when the model is persuasive but unreliable.

Section 1.3: Core concepts: intent, context, grounding, and ambiguity

Section 1.3: Core concepts: intent, context, grounding, and ambiguity

To research chat UX effectively, you need a small set of concepts that explain most failures. Start with intent: what the user is trying to accomplish in this turn. Intent is rarely a single label; it’s often layered (e.g., “draft an email” plus “match my tone” plus “use the attached policy”). Next is context: the relevant information the assistant should use—conversation history, user profile, system policies, and external sources (documents, databases, tools).

Grounding is the discipline of tying responses to reliable sources or known truths. In LLM UX, grounding is not an abstract ML concept; it’s a user trust lever. If the assistant gives a policy answer, where did it come from? If it proposes a next step, is it consistent with the user’s constraints? Your research will often reveal that users don’t mind clarification questions, but they do mind confidently wrong answers.

Ambiguity is the default state of language. People omit details (timeframes, formats, audiences) and use overloaded words (“report,” “optimize,” “safe”). Your job is to test whether the assistant detects ambiguity, asks the right clarifying questions, and allows users to correct it without penalty.

  • Prototype artifact: a state diagram showing key states (collect goal, gather constraints, execute, verify, handoff) and transitions (clarify, error, refusal).
  • Script artifact: transcripts for 8–12 test scenarios: happy path, ambiguous path, correction path, and edge cases.
  • Common mistake: evaluating responses without specifying what “grounded” means for the product (citations, retrieved snippets, tool results, policy references).

Practical outcome: you’ll define the assistant’s JTBD and user outcomes in a way that is testable, then create a hypothesis set tied to intent handling, context retention, and grounding behavior—so your study isn’t just “do people like it?”

Section 1.4: Risks and constraints: hallucinations, safety, privacy, compliance

Section 1.4: Risks and constraints: hallucinations, safety, privacy, compliance

LLM experiences fail in ways that look like “good UX” until you check the truth. Hallucinations—fabricated facts, citations, or actions—are the classic risk, but they’re not the only one. Safety issues include harmful instructions, biased outputs, harassment, or inappropriate content. Privacy issues include leaking personal data, retaining sensitive information, or encouraging users to paste confidential content. Compliance can include regulated advice (medical, legal, financial), accessibility, record retention, and enterprise data handling rules.

AI UX research must treat these as first-class constraints, not edge cases. That changes how you write tasks and how you store data. For example, if your participants paste real customer data into a prototype, you’ve created a compliance problem. Your research repository needs redaction practices, storage permissions, and clear rules for what can appear in transcripts and screenshots.

  • Safety testing mindset: test for what users will try, not what you wish they would do.
  • Evidence standard: log the full prompt context (system instructions, tool outputs, retrieval snippets) when capturing failures, otherwise you can’t reproduce them.
  • Common mistake: fixing the UI copy when the root cause is missing guardrails, insufficient grounding, or tool limitations.

Practical outcome: you’ll establish evidence standards in your repository (what must be captured for each finding: conversation transcript, model version, settings, sources used, and severity). You’ll also begin defining refusal and escalation expectations: when the assistant should say “I can’t help with that,” when it should offer safer alternatives, and when it should hand off to a human.

Section 1.5: Research questions that matter for LLM experiences

Section 1.5: Research questions that matter for LLM experiences

Great chat research starts with questions that isolate behavior. Instead of a generic usability goal, define a practical research plan with explicit goals, hypotheses, tasks, and success criteria. Your hypotheses should be falsifiable and tied to user outcomes, not model mystique. For example: “If the assistant asks one clarifying question before drafting, users will report higher confidence and require fewer edits.”

In moderated sessions, you’ll watch how people formulate requests, what they reveal about intent, and whether they notice uncertainty. In unmoderated tests, you’ll prioritize scalability and consistency: tight tasks, clear stopping rules, and structured post-task questions. In both, you’ll capture high-signal evidence—moments that change a product decision—rather than collecting endless transcripts.

  • Understanding: Does the assistant correctly infer intent and constraints from varied phrasing?
  • Context use: Does it remember critical details and avoid re-asking?
  • Groundedness: Are factual claims supported by provided sources or tool outputs?
  • Helpfulness: Does it move the task forward with appropriate depth and next steps?
  • Safety: Does it refuse unsafe requests and provide safe alternatives?
  • Recovery: Can users correct it quickly, and does it adapt?

Common mistake: using satisfaction as the primary success metric. LLMs can be highly satisfying while being wrong. Practical outcome: you’ll draft a conversation UX research brief that includes (1) assistant JTBD and target users, (2) top risks, (3) hypotheses mapped to measurable criteria (completion, correction rate, grounding rate), and (4) a task set that intentionally includes ambiguity and adversarial-but-realistic inputs.

Section 1.6: Deliverables: briefs, test plans, scorecards, and insight decks

Section 1.6: Deliverables: briefs, test plans, scorecards, and insight decks

Your portfolio as an AI UX Researcher is built from artifacts that make chat behavior testable and improvable. The first is a conversation UX research brief: a one- to two-page document that states the assistant’s JTBD, user outcomes, in-scope capabilities, constraints (safety/privacy/compliance), and the hypotheses you’ll test. This is where you translate design instincts into research responsibilities.

Next is a test plan tailored to conversational UX. It specifies study type (moderated vs. unmoderated), participant criteria, tasks (including ambiguous and correction scenarios), what you’ll log, and how you’ll judge success. For LLMs, pair the plan with a scorecard that operationalizes evaluation concepts: groundedness, safety, and helpfulness. Define rating anchors (e.g., 1–5) with concrete examples so different reviewers agree.

To avoid “insights theater,” set up your research repository early with evidence standards: naming conventions, tags (intent, failure mode, severity), and required context (model version, prompt, tools, retrieval sources). This is research ops applied to conversational systems—without it, you can’t track regressions or improvements.

  • Brief: JTBD, outcomes, scope, hypotheses, risks, and decision points.
  • Flow prototype: scripts + state diagram for key paths and recovery.
  • Test plan: tasks, criteria, moderation guide, logging template.
  • Scorecard: groundedness/safety/helpfulness rubrics with anchors.
  • Insight deck: findings framed as decisions, backed by reproducible evidence.

Practical outcome: by the end of this course, your portfolio plan should show a coherent narrative—how you identified skill gaps, defined an assistant role, designed research to validate conversational behavior, and connected findings to both product decisions and analytics instrumentation (events, intents, slots, outcomes) so improvements can be measured after launch.

Chapter milestones
  • Map your transferable skills and AI UX research gaps
  • Define the assistant’s job-to-be-done and user outcomes
  • Create a conversation UX research brief and hypothesis set
  • Set up your research repository and evidence standards
  • Portfolio plan: what artifacts you’ll produce in this course
Chapter quiz

1. In Chapter 1, what is the key mindset shift when moving from traditional UI design to LLM-powered chat experiences?

Show answer
Correct answer: From pixel-perfect interactions to probabilistic behavior management
LLM chat interfaces behave probabilistically, so the focus shifts to managing and evaluating behavior rather than perfecting static screens.

2. What does Chapter 1 mean by saying you ship a “behavioral envelope” in LLM chat?

Show answer
Correct answer: A set of prompts, policies, tools, guardrails, and evaluation criteria that shape assistant behavior
In chat, you define constraints and supports (prompts, tools, guardrails, eval criteria) that guide behavior across many contexts.

3. Compared to traditional UI validation, what additional core question does AI UX research need to validate in LLM chat?

Show answer
Correct answer: Whether users can express the right thing and whether the assistant remains helpful, grounded, and safe across contexts
Chapter 1 emphasizes validating language expression/interpretation and response quality, grounding, and safety—not just navigation and clicking.

4. Which set best matches the chapter’s described primary deliverable shift for this role transition?

Show answer
Correct answer: From wireframes to briefs, test plans, scorecards, and analytics-ready schemas
The chapter frames AI UX research deliverables as research briefs, hypotheses, test plans, scorecards, and instrumentation/analytics schemas.

5. What is the chapter’s description of “engineering judgment” in AI UX research?

Show answer
Correct answer: Knowing what to test first, what to hold constant, what to instrument, and how to interpret messy conversational evidence without overfitting to anecdotes
Engineering judgment is framed as prioritizing tests, controlling variables, instrumenting appropriately, and interpreting noisy conversational evidence carefully.

Chapter 2: Prototype Conversational Flows That Are Testable

Designers transitioning into AI UX research often underestimate one thing: you can’t “just test the chat” unless you’ve first made the chat testable. Traditional UX prototypes tend to be stable—screens don’t change their meaning between sessions. LLM experiences are different: wording, context, and subtle prompt changes can produce materially different outcomes. Your job in this chapter is to learn how to prototype conversational flows with enough structure that they can be evaluated, compared, and iterated—without accidentally testing a moving target.

A testable conversational prototype has three characteristics. First, it has explicit scope boundaries (what it will and won’t do). Second, it has a flow model (even if the UI is a simple chat window) that anticipates branching and failure paths. Third, it has an evaluation frame: a rubric describing expected versus acceptable model behavior, including how the assistant should respond when it’s uncertain, unsafe, or out of scope.

As you work through the sections, keep the end deliverable in mind: something you can put in front of participants (or run unmoderated), capture evidence from, and analyze. You are not aiming for a perfect product; you’re aiming for a prototype that produces high-signal learning about user goals, assistant behavior, and interaction breakdowns.

Practice note for Choose a use case and define scope boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft a conversation script with branching and error paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design prompts/system instructions as a product interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a clickable or runnable prototype for testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a rubric: expected vs acceptable model behaviors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose a use case and define scope boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft a conversation script with branching and error paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design prompts/system instructions as a product interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a clickable or runnable prototype for testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a rubric: expected vs acceptable model behaviors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Selecting the right use case and defining “done”

Section 2.1: Selecting the right use case and defining “done”

The fastest way to build momentum is to pick a use case that is real, repeatable, and bounded. “Help users with HR questions” is too broad. “Help a new hire understand how to submit expenses” is testable because it has a clear goal, known constraints, and observable outcomes. As a former designer, you already know how to define a scenario; the AI twist is that you must define boundaries in language, not just screens.

Start by writing a one-paragraph use-case brief: who the user is, what they’re trying to accomplish, what inputs they can provide, and what the assistant can access (documents, policies, tools, or nothing). Then define scope boundaries explicitly: topics that are out of scope, actions the assistant cannot take, and assumptions about data freshness. This becomes the basis for your system instructions and your test plan.

Define “done” in terms of user outcomes and model behavior. User outcomes are observable: “User correctly identifies reimbursable categories and submits a complete expense report.” Model behavior outcomes are evaluative: “Assistant cites the policy source, asks for missing fields, and avoids inventing reimbursement rules.” If you can’t describe success criteria without using vague words like ‘helpful,’ your prototype isn’t ready to test.

  • Practical checklist: (1) One primary user goal, (2) 3–5 common questions, (3) known reference source or stated lack of one, (4) explicit non-goals, (5) measurable success criteria.
  • Common mistake: picking a use case that requires hidden integrations (account lookups, form submission) before you’ve decided how the assistant should behave without them.

The output of this step is not a full spec; it’s a boundary box that makes later prompt-writing and rubric design possible.

Section 2.2: Flow types: guided, open-ended, and mixed-initiative

Section 2.2: Flow types: guided, open-ended, and mixed-initiative

Conversational flows are not all the same. If you test an open-ended experience with a guided script (or vice versa), you will misdiagnose failures. Choose a flow type intentionally based on risk, complexity, and the cost of errors.

Guided flows resemble forms: the assistant asks targeted questions to fill required slots (date, amount, category). They are easier to test because paths are predictable and “done” is clear. They are ideal for high-stakes domains or when you need consistent data capture.

Open-ended flows prioritize exploration: users ask anything, the assistant responds and pivots. This is closer to “search plus synthesis,” but it’s harder to test because user paths vary widely. Here, your prototype must emphasize guardrails, source attribution, and graceful uncertainty handling.

Mixed-initiative flows combine both: the user starts with an open question, and the assistant decides when to guide (“To answer that, I need your country and employment type—full-time or contractor?”). Mixed-initiative is often the best product experience, but it requires clear rules for when the assistant should take control.

Draft your conversation script with branching and error paths aligned to your chosen flow type. For guided flows, sketch the required slots and order, then add branches for missing info and user corrections. For open-ended flows, script representative “intents” and add boundaries: what happens when the user requests something the assistant shouldn’t do. For mixed-initiative, define triggers that switch modes (e.g., “unclear request,” “multi-step task,” “policy-sensitive topic”).

  • Engineering judgment tip: the more variable the user input, the more you need stable internal structure—intent categories, slots, or decision rules—to make testing results comparable.
  • Common mistake: treating every failure as a prompt problem when the real issue is the wrong flow type (e.g., forcing open-ended questions into a slot-filling script).

By the end of this section, you should have a script outline with at least three branches: happy path, missing-information path, and out-of-scope or error path.

Section 2.3: State, memory, and context windows in UX terms

Section 2.3: State, memory, and context windows in UX terms

In screen-based UX, “state” is what the system currently knows: logged-in status, items in a cart, selected filters. In LLM UX, state still exists, but it’s split across different layers that affect testability: conversation history (what’s in the context window), explicit memory (saved preferences), and external state (tools, databases, files).

For prototyping, treat state as a diagram, not a vibe. Create a lightweight state model that lists: (1) what must be captured from the user (slots), (2) what can be inferred but should be confirmed, and (3) what must never be assumed. Then decide where each piece of state lives. If it’s only in the chat history, it may be forgotten when the conversation gets long or when you change the prompt. If it’s in explicit memory, you must design consent and editability (“Forget my location”). If it’s in external state, you must design latency and failure handling (“I couldn’t access your policy portal right now”).

Context windows introduce a practical testing constraint: identical user input can produce different outputs depending on what the model “sees” above it. This is why you should define a test harness conversation prefix: a fixed system message, fixed few-shot examples (if used), and a consistent starting message. When running studies, reset sessions between tasks unless you are explicitly testing carryover memory.

  • Practical deliverable: a state table with columns for “Field,” “Source,” “Persistence,” “How collected,” and “Failure behavior.”
  • Common mistake: letting long, meandering warm-up conversation contaminate later tasks; you end up testing context artifacts instead of the flow.

Thinking in state also helps analytics later: intents and slots are essentially tracked state transitions. The prototype you build now should reflect what you’ll eventually instrument.

Section 2.4: Writing prompts as UX: tone, constraints, and fallback design

Section 2.4: Writing prompts as UX: tone, constraints, and fallback design

In LLM products, prompts and system instructions are part of the interface. They decide what the assistant prioritizes, how it speaks, and—most importantly—what it refuses to do. A strong prompt is not a clever incantation; it’s a UX specification written for a language model.

Start with a system instruction that defines role, audience, and boundaries: what sources to use, what not to do, and how to respond under uncertainty. Keep it structured: short sections with headings like “Purpose,” “Allowed inputs,” “Safety constraints,” “Output format,” and “When to ask clarifying questions.” Then define tone intentionally. Tone is not just brand voice; it affects perceived competence. For policy and troubleshooting, concise and procedural often tests better than playful.

Constraints must be observable. Instead of “be accurate,” specify behaviors you can evaluate: “If you are not sure, say you’re not sure and ask for X,” “Cite the policy section title,” “Do not provide legal advice; offer to connect the user to HR.” These become rubric criteria later.

Design fallbacks as first-class flow steps. Your assistant should have a planned response for: missing information, ambiguous intent, tool failure, and out-of-scope requests. Write these fallback patterns as reusable snippets so your prototype behaves consistently across branches.

  • Practical tip: include a “clarification ladder”: first ask one targeted question; if still unclear, present 2–3 options; if still unclear, offer escalation or a resource link.
  • Common mistake: adding too many instructions in one long paragraph; models follow structured, prioritized instructions more reliably than dense prose.

When you treat prompts as UX, you also get cleaner research: participants experience a stable interaction policy, and your findings map to specific prompt decisions you can iterate.

Section 2.5: Prototyping options: scripts, simulators, and low-code chat UIs

Section 2.5: Prototyping options: scripts, simulators, and low-code chat UIs

You have three practical prototype levels, and choosing the right one is an exercise in research judgment: match fidelity to the questions you need answered.

1) Script-only prototypes are fastest. You write the conversation as a branching script (like a play), including user utterances, assistant responses, and annotations for intent/slot/state changes. This is ideal for early-stage concept validation and for aligning stakeholders on scope boundaries. It’s also the easiest way to ensure you’ve covered error paths and escalation before any tool is built.

2) Simulators add controlled variability. A simulator can be as simple as a spreadsheet with “if user says X, respond with Y,” or a lightweight tool that lets a researcher select assistant responses from predefined options. Simulators support moderated testing because you can keep the experience consistent while observing user language and expectations.

3) Low-code or runnable chat UIs (a basic web chat hooked to an LLM, or a prototyping platform with an LLM connector) are best when you must observe real model behavior: hallucinations, instruction-following gaps, and sensitivity to phrasing. If you go runnable, lock down versions: save the prompt, model name, temperature, and any retrieval sources so results are reproducible.

Regardless of prototype type, create a rubric of expected vs acceptable behaviors before you test. “Expected” means ideal product behavior; “acceptable” means still usable without harming trust. For example: expected = “asks for missing receipt date before calculating,” acceptable = “gives steps but flags missing date as required.” The rubric keeps you from overreacting to minor phrasing issues and helps you focus on user impact.

  • Common mistake: jumping to a runnable prototype without a scripted baseline; you end up debugging randomness instead of evaluating the flow.
  • Practical outcome: a prototype package: script + state table + prompt + rubric + test starting context.

The goal is not to impress with polish—it is to generate stable, comparable evidence across participants and iterations.

Section 2.6: Designing edge cases: refusal, uncertainty, and escalation

Section 2.6: Designing edge cases: refusal, uncertainty, and escalation

Edge cases are not rare in conversational UX; they are the moments users remember. In AI UX research, you must prototype them deliberately because they shape trust, safety, and perceived competence. Three categories matter most: refusal (the assistant should not comply), uncertainty (the assistant cannot be sure), and escalation (a human or alternative channel is needed).

Refusal design should be specific and helpful. A good refusal states the boundary, gives a brief reason in user terms, and offers safe alternatives. For example, if asked for medical diagnosis, refuse and suggest speaking to a clinician, plus provide general information disclaimers. Avoid scolding, long policy quotes, or vague “I can’t help with that” responses that force the user to guess the boundary.

Uncertainty handling is where groundedness meets UX. Prototype how the assistant signals confidence: cite sources, show assumptions, or ask a clarifying question instead of guessing. Build rubric criteria that distinguish “transparent uncertainty” (acceptable) from “confident fabrication” (failure). If your use case involves documents, prototype what happens when sources conflict or are missing.

Escalation is a flow, not a dead end. Define triggers (user frustration, repeated failure, high-risk topics), the handoff content (summary of what’s been collected), and the user’s next step (link, contact method, ticket creation). Even in a prototype, write the escalation message and capture what data would be passed along.

  • Practical edge-case set for testing: (1) ambiguous request, (2) missing key slot, (3) out-of-scope topic, (4) unsafe request, (5) conflicting information, (6) user correction, (7) “model is wrong” challenge, (8) tool/data unavailable.
  • Common mistake: treating refusal and escalation as compliance failures; they are often the correct behavior and should be evaluated against clear success criteria.

When you design edge cases into the prototype, your research sessions stop being improvisational. You can measure whether the assistant preserves trust under stress—exactly the capability stakeholders will care about when deciding whether an AI experience is ready to ship.

Chapter milestones
  • Choose a use case and define scope boundaries
  • Draft a conversation script with branching and error paths
  • Design prompts/system instructions as a product interface
  • Build a clickable or runnable prototype for testing
  • Create a rubric: expected vs acceptable model behaviors
Chapter quiz

1. Why does Chapter 2 argue you can’t “just test the chat” in LLM experiences?

Show answer
Correct answer: Because small wording or context changes can materially change outcomes, making the experience a moving target without structure
LLM behavior can shift with subtle prompt/context changes, so you need a testable prototype to avoid evaluating an unstable target.

2. Which set best describes the three characteristics of a testable conversational prototype in this chapter?

Show answer
Correct answer: Explicit scope boundaries, a flow model that anticipates branches/failures, and an evaluation rubric for expected vs acceptable behavior
The chapter defines testability as clear scope, modeled branching/failure paths, and a rubric to evaluate behavior (including uncertainty/unsafe/out-of-scope responses).

3. What is the main purpose of defining explicit scope boundaries before testing a conversational flow?

Show answer
Correct answer: To clarify what the assistant will and won’t do so tests don’t unintentionally evaluate out-of-scope behavior
Scope boundaries prevent accidental testing of capabilities the prototype is not intended to support.

4. What does the chapter mean by having a “flow model” for a chat prototype?

Show answer
Correct answer: A map of conversation paths that anticipates branching and failure/error paths, even if the UI is just a chat window
A flow model structures the conversation into expected branches and breakdown points so behavior can be evaluated consistently.

5. Which rubric element is emphasized as necessary for evaluating the assistant during testing?

Show answer
Correct answer: A definition of expected vs acceptable behavior, including responses when the assistant is uncertain, unsafe, or out of scope
The rubric frames evaluation and includes how the assistant should respond in uncertain, unsafe, and out-of-scope situations.

Chapter 3: Research Design for Conversational AI

Conversational AI research is not “regular usability testing with a chat box.” You are studying a system that can generate novel responses, follow (or ignore) instructions, and produce failures that look confident. That changes how you define success, how you write tasks, and what evidence counts as “high-signal.” Your job is to turn product intent into measurable conversation outcomes, then choose methods that expose breakdowns in understanding, reasoning, safety, and user trust.

In practice, research design for chat experiences is a workflow: (1) align on product goals and the decisions your study must inform, (2) translate those goals into success criteria and hypotheses, (3) craft scenarios that elicit realistic and risky behavior, (4) recruit and screen the right participants, (5) build a test plan with a moderator guide, logging, and consent, and (6) pilot to refine tasks and rubrics before spending your full sample.

Unlike many UI studies, conversational research benefits from blending qualitative evidence (transcripts, probes, error taxonomies) with quantitative signals (resolution rate, time-to-answer, escalation, deflection). You will also find that engineering judgment matters: instrumentation, data retention, and privacy constraints can shape what you can measure, which in turn shapes your study design.

  • Deliverable mindset: leave each study with a clear decision, a defensible method, and traceable evidence (transcripts + coded outcomes + metrics).
  • LLM-specific lens: evaluate helpfulness, groundedness (is it supported by a source/system?), and safety (does it avoid harm?) alongside classic usability.
  • Pilot early: pilots reveal prompt ambiguity, task impossibility, and missing logs—issues that otherwise waste an entire study.

The sections below walk through concrete choices and templates you can apply immediately.

Practice note for Turn product goals into measurable conversation success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design tasks and scenarios that elicit realistic user behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan participant recruitment, screening, and ethics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a test plan: moderation guide, logging, and consent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pilot your study and refine tasks and rubrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Turn product goals into measurable conversation success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design tasks and scenarios that elicit realistic user behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan participant recruitment, screening, and ethics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Choosing methods: usability test, diary study, log review, benchmark

Method choice starts with the question you must answer and the maturity of the chat experience. For early concepts (scripted flows, state diagrams, prompt prototypes), a moderated usability test is usually the fastest path to actionable insight because you can probe intent, clarify user mental models, and observe how users recover from unexpected responses. For later-stage products with real traffic, log review and benchmarking become essential to quantify performance at scale and to detect long-tail failures you will never see in a lab.

Use this practical rule: if you’re uncertain why failures happen, run moderated sessions; if you’re uncertain how often they happen, analyze logs; if you’re uncertain whether you’re improving, run a benchmark. Diary studies are best when the assistant’s value emerges over time (e.g., workflow coaching, health habit support) or when context matters (mobile, on-the-go, intermittent use). Diaries capture the “messy middle” between sessions: repeated prompts, trust shifts, and when users stop using the tool.

  • Usability test: best for task success, breakdown diagnosis, prompt comprehension, and recovery behaviors; requires a moderation guide and consistent rubrics.
  • Diary study: best for longitudinal adoption, habit formation, and real-world contexts; requires lightweight prompts and a clear check-in cadence.
  • Log review: best for intent distribution, drop-off points, escalation/deflection patterns, and safety incident monitoring; requires instrumentation and privacy constraints.
  • Benchmark: best for comparing versions/models/prompts; requires fixed task sets and scoring definitions (including groundedness/safety).

A common mistake is choosing a method based on convenience rather than decision impact. For example, testing only “happy path” prompts in moderated sessions can make an assistant look excellent while hiding that real users ask messy, multi-part questions. Another mistake is running a benchmark without stable scoring criteria; if you can’t define what “resolved” means, you can’t claim improvement. Tie each method to a deliverable: a prioritized breakdown list, a metrics dashboard spec, or a go/no-go decision for launch.

Section 3.2: Defining metrics: resolution, deflection, satisfaction, time-to-answer

Conversation success criteria must be measurable, aligned to product goals, and robust to ambiguity. Start by translating a goal into a user outcome and then into a metric definition. Example: “reduce support load” becomes “users solve issues without human help,” which becomes deflection rate—but only if you also measure whether users were actually helped (otherwise you incentivize premature closures).

Define metrics with operational rules you can apply consistently across transcripts and logs. Recommended core set:

  • Resolution rate: % of conversations where the user’s primary intent is achieved (with clear criteria per task/intent). Include an “unknown” category to avoid forced judgments.
  • Deflection rate: % of cases that avoid human escalation and meet minimum helpfulness (e.g., user confirms, or objective evidence like successful troubleshooting steps completed).
  • Satisfaction: post-task rating (e.g., 1–7) plus a short open-ended “why.” For AI, separate “satisfaction with answer” from “trust in correctness.”
  • Time-to-answer: time from user question to a usable answer (not merely the first token). Track both system latency and user effort (turns-to-resolution).

Engineering judgment matters: instrument what you can reliably capture. “Time-to-answer” is meaningless if your timestamping is inconsistent across clients or if streaming responses complicate the endpoint. Decide whether you measure time to first response, time to final response, or time to user-confirmed resolution—each answers a different question.

Common mistakes: (1) using satisfaction alone (users can be satisfied with a wrong answer), (2) defining “resolution” too loosely (“the model responded”) or too strictly (“perfect answer”), and (3) mixing product outcomes across intents. A practical fix is to write a metric rubric per intent: what counts as resolved, partially resolved, unresolved, and unsafe. In moderated tests, apply the same rubric while coding transcripts; in logs, map rubric outcomes to observable signals (e.g., “clicked source,” “asked follow-up,” “escalated,” “abandoned”). This alignment makes your qualitative findings comparable to analytics later.

Section 3.3: Scenario writing for LLMs: realism, variability, and adversarial cases

Tasks for conversational AI must elicit realistic language, not “researcher-speak.” Start by grounding each scenario in a believable situation, a user role, and a constraint. Then add variability so the assistant is tested against the same underlying intent expressed in different ways. For example, “Change my flight” should appear as: direct request, multi-part request, vague request, and request with missing details.

Write scenarios to test not just success, but failure handling. LLM systems will fail differently than deterministic UIs: they may hallucinate, over-commit, or answer confidently without enough information. Include tasks that force the assistant to ask clarifying questions, cite sources, refuse unsafe requests, or hand off appropriately.

  • Realism: include context users actually have (order number unknown, policy unclear, emotional tone, interruptions).
  • Variability: provide alternate phrasings, slang, and multi-intent prompts; plan how you’ll rotate them across participants to avoid learning effects.
  • Adversarial cases: include prompt injection attempts, policy-violating requests, and misleading context (while staying ethical and safe). These reveal guardrail robustness and user trust impact.

Connect scenarios to hypotheses and success criteria. Example hypothesis: “If the assistant shows a short plan before acting, users will trust it more and require fewer follow-ups.” Your task must create a moment where planning matters, and your measures must capture trust and turns-to-resolution.

A frequent mistake is over-scripting participant language (“Ask the bot: ‘Please assist me with…’”). Instead, give the goal and let participants speak naturally, then capture their original phrasing as data. Another mistake is ignoring edge cases until after launch; for LLM chat, edge cases are often the brand-damaging ones. Build a task set that includes 60–70% common intents, 20–30% messy/ambiguous cases, and 10% adversarial/safety cases, then pilot to confirm the difficulty is realistic.

Section 3.4: Sampling and screening for domain knowledge and risk

Sampling for conversational AI is about matching intent distribution and risk exposure. Start with your primary user segments, then map which segments produce high-stakes conversations (financial decisions, medical topics, vulnerable populations, regulated workflows). Your recruitment plan should reflect both frequency (common users) and consequence (high-risk users), even if high-risk users are a smaller share.

Create a screener that captures: domain familiarity, frequency of the target task, comfort with chat tools, and constraints that affect language (non-native speakers, accessibility needs). For workplace assistants, screen for role-specific vocabulary and workflow ownership; novices and experts fail differently. Novices reveal onboarding and expectation gaps; experts reveal precision requirements and tolerance for ambiguity.

  • Domain knowledge bands: novice / intermediate / expert; set quotas per band based on product strategy.
  • Risk flags: participants who may disclose sensitive data, rely on the system for critical decisions, or are subject to compliance constraints.
  • Context fit: device, environment, and time pressure; conversational behavior changes when users are multitasking.

Common mistake: recruiting “generic users” for a domain-specific assistant. This produces misleading findings—participants will ask basic questions or behave unlike real users. Another mistake is screening only for demographics and not for task reality (do they actually do the thing?). A practical approach is to require a recent example: “In the last 30 days, how did you handle X?” and ask for the steps they took. This both validates eligibility and gives you language to seed realistic scenarios.

Finally, plan how you will handle participants encountering model errors. In high-stakes domains, you may need guardrails in the prototype, a moderator intervention rule, or a study disclaimer that the assistant is not authoritative. These choices affect who you can ethically recruit and what claims you can make from the data.

Section 3.5: Study materials: moderator script, probes, and debrief prompts

A strong test plan makes chat studies repeatable and debuggable. Your materials should include: consent language, a standardized intro, tasks/scenarios, probes, a logging plan, and a debrief. For moderated sessions, the moderator script is your “control system”—it reduces variation introduced by different facilitators and keeps you from rescuing the product with leading hints.

Structure the moderator guide around moments that matter in chat:

  • Expectation setting: “Treat this like a real assistant; use your own words; you can stop anytime.”
  • Task delivery: provide goal + constraints, not wording; capture the participant’s natural first prompt.
  • Probes: ask about confidence, perceived source, and what they would do next (copy/paste, follow advice, escalate).
  • Recovery prompts: when stuck, ask “What would you try now?” before offering help; this reveals real recovery strategies.

Plan your logging and evidence capture before you run anyone. At minimum, capture: full transcript, timestamps, model/system version, retrieval sources shown, tool calls/actions taken, safety filters triggered, and any user feedback events. Without versioning, you cannot interpret changes across sessions—LLM behavior can drift with prompt tweaks or model updates.

Debrief prompts should separate usability from trust and safety. Examples: “Which answers did you trust least, and why?” “Where did it feel like it made assumptions?” “Did you notice anything you’d consider risky or inappropriate?” A common mistake is ending with “Any other feedback?” and missing the chance to extract decision-quality insights. End with a structured comparison: what worked, what failed, and what the assistant should do when uncertain (ask, cite, refuse, or escalate). This supports a clear set of design and research deliverables: updated conversation flows, revised prompts, and a scored breakdown list.

Section 3.6: Ethics and privacy: PII handling, sensitive topics, and consent

Conversational data is uniquely sensitive because users naturally paste personal details into chat—often without realizing it. Ethics and privacy are not a legal afterthought; they shape your study design, tooling, and what you can store. Start by classifying what counts as PII in your context (names, emails, IDs, addresses) and what counts as sensitive data (health, finances, minors, employment disputes). Then design your tasks to avoid unnecessary collection and build safeguards for accidental disclosure.

  • PII minimization: write scenarios that use synthetic identifiers; explicitly instruct participants not to enter real account numbers or addresses.
  • Redaction workflow: define who redacts transcripts, how quickly, and how redaction is verified. Automate where possible, but spot-check—LLM-based redaction can miss edge cases.
  • Retention and access: set retention windows and restrict access to raw transcripts; store coded data when feasible.

Consent must be specific to conversational capture: participants should understand that free-form text may be recorded, that the system may generate unexpected content, and that they can skip any prompt. For studies involving safety or adversarial scenarios, avoid exposing participants to harmful content; keep adversarial tests bounded (e.g., testing refusal behavior with mild policy-violating requests) and pre-approve scripts with stakeholders.

Common mistakes include collecting real customer data in prototypes without proper safeguards, sharing raw transcripts broadly in slide decks, and forgetting that “model training” implications differ by vendor and configuration. Coordinate with engineering and legal early: verify whether chats are sent to third-party model providers, whether data is retained, and how to disable training on user inputs. Build these constraints into your research plan and instrumentation so your evidence is both actionable and ethically obtained.

Finally, pilot your consent and privacy instructions the same way you pilot tasks. If multiple participants still paste real data, your instructions or scenario design is failing—and you should fix that before scaling the study.

Chapter milestones
  • Turn product goals into measurable conversation success criteria
  • Design tasks and scenarios that elicit realistic user behavior
  • Plan participant recruitment, screening, and ethics
  • Build a test plan: moderation guide, logging, and consent
  • Pilot your study and refine tasks and rubrics
Chapter quiz

1. Why does conversational AI research require a different approach than “regular usability testing with a chat box”?

Show answer
Correct answer: Because the system can generate novel responses and confident-looking failures, changing what counts as success and evidence
LLMs can follow or ignore instructions and produce plausible but wrong outputs, so success criteria and evidence must account for understanding, reasoning, safety, and trust.

2. Which sequence best reflects the chapter’s recommended workflow for research design in chat experiences?

Show answer
Correct answer: Align on product goals and decisions, translate into success criteria/hypotheses, craft scenarios, recruit/screen, build a test plan, then pilot to refine
The chapter lays out an ordered workflow that starts with product alignment and ends with piloting to refine tasks and rubrics before running the full study.

3. What is the primary purpose of turning product intent into measurable conversation outcomes?

Show answer
Correct answer: To ensure the study produces traceable evidence that can support a clear decision
Measurable outcomes connect product goals to defensible methods and evidence (coded outcomes + metrics + transcripts) that inform decisions.

4. Which approach best matches the chapter’s guidance on what evidence should be collected in conversational AI studies?

Show answer
Correct answer: Blend qualitative evidence (transcripts, probes, error taxonomies) with quantitative signals (resolution rate, time-to-answer, escalation/deflection)
The chapter emphasizes combining high-signal qualitative data with quantitative indicators to capture both breakdowns and performance.

5. Why does the chapter emphasize piloting early in conversational AI research?

Show answer
Correct answer: Pilots can reveal prompt ambiguity, impossible tasks, and missing logs before you waste the full sample
Early pilots surface issues in tasks, prompts, and instrumentation that can otherwise invalidate or waste an entire study.

Chapter 4: Run Tests and Capture High-Signal Evidence

As a designer moving into AI UX research, your biggest mindset shift is that “the interface” is partially a probabilistic system. A chat flow can look perfect in a script and still fail when the model improvises, misreads context, or confidently answers beyond the product’s real capabilities. This chapter shows how to run moderated and unmoderated tests for LLM chat experiences and, more importantly, how to capture evidence that engineering and product teams can act on.

High-signal evidence has three qualities: it is reproducible (or clearly bounded by conditions), it is attributable (you can point to the prompt, model settings, and user action that caused it), and it is decision-ready (it maps to a fix: prompt change, policy, UI affordance, retrieval update, or guardrail). You will learn how to probe both the user’s mental model and the model’s behavior, collect clean transcripts with structured annotations, score conversations with a balanced rubric, and synthesize findings into problem statements and opportunity areas.

Throughout, treat your session outputs like a “trace” rather than a narrative. A good AI UX researcher can answer: What did the user ask? What did the system see? What did it retrieve? What did it respond? What did the user believe happened? And what changed after the repair? When you can connect those dots, your research shifts from anecdote to evidence.

Practice note for Run moderated sessions and probe model + user mental models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Collect clean transcripts and structured annotations during sessions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Execute unmoderated tests at scale with consistent logging: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Score conversations with a balanced rubric (UX + LLM quality): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Synthesize findings into problem statements and opportunity areas: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run moderated sessions and probe model + user mental models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Collect clean transcripts and structured annotations during sessions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Execute unmoderated tests at scale with consistent logging: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Score conversations with a balanced rubric (UX + LLM quality): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Moderation techniques for conversational interfaces

Moderating LLM chat tests differs from classic usability testing because the participant is co-creating the interface in real time. Your job is to keep the session structured while leaving room for natural language exploration. Start with a pre-brief that sets expectations: what the assistant can/can’t do, whether it can be wrong, and what “good” looks like (e.g., accurate steps, safe boundaries, transparent uncertainty). Then run tasks that mirror real intents, not UI steps. A task prompt like “Plan my first week using this tool” produces richer behavior than “Click the planner.”

Use a two-layer script: (1) a user-facing task list and (2) an internal moderator checklist. The checklist should include: model configuration (model version, system prompt, tools enabled), required logging, and probes you will use consistently. Example probes: “What do you think the assistant knows about you right now?” (mental model), “Why did you trust that answer?” (overtrust), “What would you do next if you couldn’t ask again?” (decision impact), and “If you rewrote your question, what would you change?” (repair behavior).

In moderated sessions, avoid “teaching the model” on behalf of the participant. If the participant asks, “What should I type?”, redirect: “What would you naturally ask?” Likewise, don’t rescue the conversation too early. Let the breakdown occur, then observe repair. Your goal is not to showcase the best-case chat, but to expose the edges where the product needs better prompts, UI affordances, or guardrails.

  • Practical setup: record screen, capture raw chat logs, and time-stamp key moments.
  • Consistency tip: keep a standard opening and closing so comparisons across sessions are meaningful.
  • Common mistake: changing task framing mid-session, which makes transcripts hard to compare and score.
Section 4.2: Observing breakdowns: confusion, overtrust, and repair strategies

Breakdowns in conversational UX often look polite: the participant continues the chat even when they’re confused, or they accept a wrong answer because it sounds confident. Watch for subtle signals: long pauses before sending, repeated rephrasing, “Okay…” without action, copying text elsewhere, or switching from goal-seeking to system-debugging (“Why are you saying that?”). These behaviors indicate a mismatch between the user’s mental model and the system’s actual capabilities.

Classify breakdowns into at least three buckets during observation: confusion (user doesn’t know what to ask or what happened), overtrust (user believes incorrect or unsafe output), and repair (user or system attempts to recover). Repair strategies are a core research target because they are designable. Examples include: providing a clarification question, offering structured options, exposing sources, admitting uncertainty, or suggesting a safer alternative action.

Probe the model’s “mental model” indirectly by asking what inputs it had and what assumptions it made. In tool-using assistants, ask engineering for traces (tool calls, retrieval results). Then, during sessions, note when the assistant behaves as if it has memory, permissions, or authority it does not actually have. A classic overtrust scenario: the model gives confident policy advice without citations, and the participant says, “Great, I’ll do that.” Your evidence should capture both the unsafe answer and the user’s intention to act.

Engineering judgment matters when deciding whether the fix is conversational (prompting and guardrails), informational (better retrieval/grounding), or interactional (UI that sets expectations). If users repeatedly ask the same clarifying question, that’s often an interaction design gap: the system should proactively surface constraints or ask for missing slots rather than forcing the user to guess the “right prompt.”

Section 4.3: Annotation frameworks: intent, outcome, issue type, severity

Clean transcripts are necessary but not sufficient. To make sessions comparable and analyzable, annotate them with a lightweight framework that can scale from moderated notes to unmoderated logs. A practical minimum schema includes: intent (what the user is trying to do), outcome (what actually happened), issue type (why it failed or succeeded), and severity (how much it matters).

Define intents at the level your product roadmap cares about, not at the level of linguistic phrasing. For example, “request refund status,” “choose plan,” “troubleshoot setup,” “summarize policy,” “draft message.” Outcomes should be categorical and mutually exclusive where possible: success, partial success, failure, unsafe, needs human handoff. Then add issue types that bridge UX and LLM quality, such as: missing context request, incorrect factual claim, hallucinated capability, refusal when allowed, allowed when should refuse, tone mismatch, or broken tool invocation.

Severity is where you apply judgment. Use a 0–3 or 1–4 scale with explicit anchors. Example: 1 (cosmetic) tone is slightly off but user proceeds; 2 (friction) requires rephrase or extra turn; 3 (critical) blocks task or causes wrong action; 4 (harm) safety, legal, privacy, or medical/financial risk. Capture severity per issue, not per conversation, and then compute rollups (e.g., “critical issues per 10 conversations”).

  • Practical workflow: annotate in a table with columns for turn range, intent, issue type, severity, and evidence link.
  • Common mistake: mixing “what happened” with “why” in one label (e.g., “confusing answer”)—separate outcome from cause.
  • Practical outcome: annotations become your bridge to analytics instrumentation (events, intents, slots, outcomes).
Section 4.4: Rubrics and scorecards: helpfulness, correctness, safety, tone

A balanced rubric prevents teams from optimizing for “pleasant chat” while missing correctness and safety. Use a scorecard that combines UX quality with LLM evaluation concepts. At minimum, score: helpfulness (does it move the user forward), correctness (is it accurate and grounded), safety (does it avoid harmful guidance and respect policy), and tone (is it appropriate for context and brand).

Define each dimension with concrete anchors and examples. For correctness, specify what “grounded” means for your system: citations to trusted docs, tool results, or explicit uncertainty when data is missing. For safety, include both content safety (harmful instructions) and product safety (privacy, data leakage, unauthorized actions). Tone should not be purely aesthetic; it affects trust calibration. Overconfident tone with low grounding is a measurable risk.

Score at the conversation level and at the turn level when needed. Conversation-level scores help you compare variants; turn-level scores tell you where the failure starts (often the first missed clarification question). A practical approach is a 1–5 scale per dimension plus a binary “would you ship?” gate. Encourage evaluators to cite evidence: “Correctness = 2 because it claimed feature X exists; no tool call; user attempted action and failed.”

Common mistakes include: (1) treating helpfulness as a proxy for correctness (“it gave steps, therefore it’s good”), (2) ignoring refusal quality (a safe refusal can still be unhelpful if it lacks alternatives), and (3) using a rubric without calibration. Run a short calibration session where multiple raters score the same transcripts and reconcile differences. This is the fastest way to raise signal and reduce debate later.

Section 4.5: Handling variability: temperature, randomness, and reproducibility

LLMs are variable by design. If you don’t control variability, your findings can be dismissed as “one weird run.” Treat model settings as experimental conditions. Record model version, system prompt, temperature, top-p, tool configuration, and any retrieval parameters. When possible, fix a random seed (some platforms support it) or run multiple replicates per task to estimate variability.

For moderated studies, variability can be a feature: it reveals how robust the experience is across natural phrasing. But you still need reproducibility for debugging. When a critical failure occurs, immediately capture the exact conversation state: the full transcript, any hidden system messages, the tool outputs, and the retrieved documents. If the system uses memory, log what memory entries were available. Without this, engineering cannot reproduce, and your evidence loses power.

For unmoderated tests at scale, standardize everything except the variable you’re testing. Use consistent task prompts, consistent success criteria, and consistent logging. Consider an A/B approach where only one dimension changes (e.g., a new system prompt or a new safety policy). If temperature is high, require more runs per task and report distributions (median score, worst-case, and failure rate). Worst-case behavior is often what harms users, even if averages look fine.

  • Practical rule: if a behavior is safety-critical, evaluate at low temperature and also stress-test at realistic temperatures.
  • Common mistake: mixing multiple changes (prompt + UI + retrieval) and then being unable to attribute improvements.
  • Practical outcome: your research report should include “test conditions” like a lab notebook.
Section 4.6: Turning sessions into evidence: clips, quotes, and traceability

Stakeholders act on what they can see. Your synthesis should convert raw sessions into traceable evidence: short clips, crisp quotes, annotated transcripts, and a clear path from observation to recommendation. Create an evidence bundle per finding: (1) a one-sentence problem statement, (2) 1–3 artifacts (clip, transcript excerpt, screenshot), (3) annotation tags (intent, issue type, severity), (4) rubric scores, and (5) a proposed opportunity area with likely fix owners (prompting, retrieval, UI, policy, tooling).

Write problem statements that separate user goal from system failure. Example: “Users seeking setup guidance cannot complete task because the assistant assumes permissions it doesn’t have and does not offer a handoff.” Then quantify where possible: “5/8 participants encountered this; 3 attempted the suggested action and failed.” This is how you translate UX design experience into AI UX research deliverables that product teams trust.

Traceability is your safeguard against opinion wars. Every recommendation should link back to a specific turn range and condition. If you suggest changing the system prompt, cite the exact moment where the assistant failed to ask for missing slots. If you propose a UI change (e.g., suggested prompts or a ‘What I can do’ panel), link it to observed confusion and rephrasing loops. If you propose safety guardrails, include the overtrust evidence: the user’s stated intention to follow harmful advice.

Finally, synthesize into opportunity areas, not just issues. Opportunity areas combine frequency, severity, and fix leverage. For example: “Improve clarification strategy for multi-slot intents” (reduces friction broadly) or “Add grounding citations for policy answers” (reduces overtrust risk). Your chapter deliverable is a set of findings that are reproducible, annotated, scored, and easy to route into engineering backlogs.

Chapter milestones
  • Run moderated sessions and probe model + user mental models
  • Collect clean transcripts and structured annotations during sessions
  • Execute unmoderated tests at scale with consistent logging
  • Score conversations with a balanced rubric (UX + LLM quality)
  • Synthesize findings into problem statements and opportunity areas
Chapter quiz

1. Why can a chat flow that looks perfect in a script still fail in real use?

Show answer
Correct answer: Because the interface includes a probabilistic system that may improvise, misread context, or overclaim capabilities
The chapter emphasizes that the interface is partly probabilistic, so model behavior can diverge from scripted expectations.

2. Which combination best describes the three qualities of high-signal evidence?

Show answer
Correct answer: Reproducible (or bounded), attributable, and decision-ready
High-signal evidence is defined as reproducible/bounded, attributable to specific causes, and mapped to decisions/fixes.

3. What does it mean to treat session outputs like a “trace” rather than a narrative?

Show answer
Correct answer: Focus on a chronological chain of observable inputs/outputs (user ask, system context, retrieval, response) and what changed after repair
A trace connects user actions, what the system saw/did (including retrieval), the response, user belief, and post-repair changes.

4. In a moderated session for an LLM chat experience, what is a key goal beyond observing task success?

Show answer
Correct answer: Probe both the user’s mental model and the model’s behavior
The chapter highlights probing user mental models and the model’s behavior during moderated sessions.

5. What is the purpose of scoring conversations with a balanced rubric in this chapter’s approach?

Show answer
Correct answer: To evaluate both UX quality and LLM quality so findings translate into actionable fixes
A balanced rubric (UX + LLM quality) supports decision-ready evidence that maps to concrete changes.

Chapter 5: LLM Analytics—Measure What the Model and UX Are Doing

Chat UX research becomes credible (and repeatable) when you can show evidence at scale: what users attempted, what the assistant did, where it failed, and what improved after a change. In traditional product UX, analytics often answers “what happened.” In LLM-powered experiences, you also need to answer “why it happened” because the system’s behavior emerges from prompts, tools, retrieval, policies, and UI constraints.

This chapter gives you a practical workflow to move from raw transcripts to decisions. You’ll design an instrumentation plan (events + properties), define a taxonomy for intent and outcomes, compute core conversation metrics for a weekly dashboard, and connect qualitative coding to quantitative monitoring. Then you’ll diagnose failure modes (hallucination, refusal mismatch, tool errors) and propose experiments with guardrails and success thresholds.

The key mindset shift for a designer transitioning into AI UX research: you are no longer just measuring interface interactions. You are measuring a socio-technical loop—user intent, conversation design, model behavior, and external systems—all within the same “session.”

Practice note for Design an instrumentation plan: events, properties, and taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compute core conversation metrics and create a weekly dashboard: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Perform qualitative coding at scale and connect to metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Diagnose failure modes and prioritize fixes using impact estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Propose experiments: prompt tweaks, UX changes, and model policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design an instrumentation plan: events, properties, and taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compute core conversation metrics and create a weekly dashboard: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Perform qualitative coding at scale and connect to metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Diagnose failure modes and prioritize fixes using impact estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Propose experiments: prompt tweaks, UX changes, and model policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Analytics foundations for chat: sessionization and identifiers

Section 5.1: Analytics foundations for chat: sessionization and identifiers

If you can’t reliably reconstruct “a conversation,” every downstream metric becomes noisy. Start with sessionization: define what counts as a session, how long a session can be idle, and when a new session should begin. A typical rule is a new session after 30 minutes of inactivity, but you should align to your product context (e.g., enterprise workflows may have longer gaps; consumer chat may need shorter).

Next, establish identifiers. You need at least: user_id (or hashed), session_id, conversation_id (if your system supports threaded conversations), turn_id (monotonic within a session), and message_id. If your assistant uses tools or retrieval, also include tool_call_id and retrieval_request_id so you can trace errors to their source.

  • Event granularity: log both UI events (send message, click suggestion) and model pipeline events (prompt built, tool called, tool returned, response streamed, response shown).
  • Canonical timestamps: store server-side timestamps for sequencing and latency; client timestamps can be additive but are often skewed.
  • Privacy-aware text handling: decide where raw text is stored, how it’s redacted, and which analytics tables contain only derived features (intent labels, embeddings, safety flags).

Common mistake: logging only “message sent” and “assistant responded.” That hides whether the issue is UX (user confusion), orchestration (wrong tool), retrieval (no results), or generation (hallucination). A practical outcome of this section is a one-page instrumentation spec listing events, required properties, and ownership (who implements, who validates).

Section 5.2: Taxonomy design: intents, topics, outcomes, and escalation types

Section 5.2: Taxonomy design: intents, topics, outcomes, and escalation types

A taxonomy makes conversations measurable. Without it, you can count turns but not progress. Design your taxonomy like an interface: it must be usable by analysts, labelers, and dashboards. Start simple and evolve. You typically need four layers: intent (what the user is trying to do), topic/domain (what it’s about), outcome (what happened), and escalation type (how/why it left automation).

Intents should map to user goals and system capabilities (e.g., “reset password,” “compare plans,” “draft email,” “find policy”). Avoid mixing intent with sentiment (“angry user”) or with channel (“mobile”). Topics help route to owners (billing vs. onboarding) and enable trend monitoring. Outcomes should be mutually exclusive at the session or task level: completed, abandoned, escalated, blocked by policy, tool failed, or unresolved.

  • Escalation types: user-requested human, system-initiated (low confidence), compliance-required, repeated failure, or out-of-scope.
  • Slots/entities: capture key parameters (product name, date range, account type). These become analytic dimensions for “where it breaks.”
  • Taxonomy governance: define a change process, versioning, and “other/unknown” handling so your labels remain stable across quarters.

Engineering judgment matters: don’t overfit the taxonomy to the model. Your labels should reflect user intent even when the model misinterprets it; otherwise, analytics will mask model errors by “agreeing” with the assistant. The practical deliverable is a taxonomy document with definitions, examples, edge cases, and a mapping to events/properties so each conversation can be tagged consistently.

Section 5.3: Key metrics: containment, fallback rate, retries, latency, CSAT proxies

Section 5.3: Key metrics: containment, fallback rate, retries, latency, CSAT proxies

Once you can reconstruct sessions and label outcomes, compute a small set of core metrics and review them weekly. Start with metrics that expose both UX friction and model reliability. A good dashboard answers: Are users getting what they need? Where are they getting stuck? Did performance change after releases?

  • Containment rate: % of sessions resolved without escalation. Pair it with quality checks; high containment can be bad if users give up.
  • Fallback rate: % of turns where the assistant admits uncertainty (“I can’t help”) or triggers a fallback state. Track by intent/topic.
  • Retry rate: % of sessions with rephrases, repeated questions, or “No, that’s not what I meant.” This is a strong UX+LLM signal.
  • Turn count to resolution: median turns for successful outcomes; rising trends often indicate prompt drift or confusing UI affordances.
  • Latency: p50/p95 end-to-end response time and tool latency separately. Users experience the slowest component, not your average.
  • CSAT proxies: thumbs up/down, “thanks” signals, rage clicks, immediate abandonment, or post-chat survey completion. Use multiple proxies to reduce bias.

Common mistakes: (1) reporting a single blended containment number that hides failing intents; (2) ignoring denominator definitions (per turn vs. per session); (3) treating thumbs-down as ground truth without reading transcripts. Practical outcome: a weekly dashboard with a stable set of tiles, each sliced by intent/topic, plus release annotations so you can correlate changes with deployments.

Also add “health” counters for analytics itself: % of sessions missing session_id, % of turns missing taxonomy labels, and % of tool calls without status codes. If your instrumentation degrades, your conclusions will too.

Section 5.4: Transcript analysis: clustering, themes, and pattern libraries

Section 5.4: Transcript analysis: clustering, themes, and pattern libraries

Quant metrics tell you where problems concentrate; transcript analysis tells you what to fix. The goal is qualitative coding at scale without losing rigor. Use a two-pass approach: (1) broad clustering to find recurring patterns, then (2) focused coding to build a durable pattern library.

For clustering, you can embed user turns (or whole sessions) and group them by similarity. Combine embedding clusters with taxonomy slices (“high retry + billing intent”) to avoid clusters that are semantically neat but product-irrelevant. Then sample within each cluster to read transcripts and name themes in plain language.

  • Theme examples: “User lacks required account identifier,” “Assistant answers without checking policy date,” “Tool returns empty but assistant fills in details,” “UI suggestion chips lead to dead ends.”
  • Pattern library fields: pattern name, trigger signals (events/phrases), example transcripts, suspected cause (prompt/tool/UX), recommended fix, and metric to validate improvement.
  • Coder calibration: run small overlap sets weekly to keep labels consistent, and track inter-rater agreement for critical categories.

Connect themes back to metrics by creating “coded flags” as analytic properties (e.g., has_retry, hallucination_suspected, tool_empty_then_confident_answer). This is how qualitative insight becomes measurable. Practical outcome: a living pattern library that PMs and engineers can act on, plus a set of coded indicators that appear in your dashboard alongside core metrics.

Section 5.5: Failure mode analysis: hallucination, refusal mismatch, tool errors

Section 5.5: Failure mode analysis: hallucination, refusal mismatch, tool errors

LLM failures are not all the same, and fixing the wrong layer wastes weeks. Build a failure mode checklist and apply it to the highest-impact clusters from your dashboard. Three common categories deserve special attention: hallucination, refusal mismatch, and tool errors.

Hallucination is not just “made-up facts.” In product terms, it’s a breach of groundedness: the assistant claims certainty without verified sources, or invents steps that don’t exist in your system. Instrument signals such as “no retrieval sources returned,” “tool status=error,” and “assistant still produced confident instructions.” Add a property like grounding_evidence_count and track hallucination-coded sessions per intent.

Refusal mismatch happens when the assistant refuses allowed requests or answers disallowed ones. This is where UX research meets policy. You need to log which policy rule triggered (if any), whether the refusal included a helpful alternative, and whether the user escalated or abandoned. A good refusal is still an outcome you can optimize: clarity, tone, and next steps.

Tool errors include timeouts, malformed parameters, empty results, and partial failures. The UX symptom is often “assistant loops” or “tries again” without explaining. Log tool name, latency, status code, and retry count; store sanitized request/response metadata for debugging. Then estimate impact: sessions affected × severity × frequency. This impact estimate is your prioritization engine; it helps you argue for fixes that reduce user harm, not just those that are technically interesting.

  • Common mistake: blaming the base model for orchestration bugs. Separate “model said the wrong thing” from “system provided the wrong context.”
  • Practical outcome: a prioritized failure backlog where each item includes evidence (transcripts + metrics), suspected root cause layer, and a proposed validation metric.
Section 5.6: Experiment design: A/B tests, guardrails, and success thresholds

Section 5.6: Experiment design: A/B tests, guardrails, and success thresholds

Analytics becomes transformative when you can prove improvement. Treat changes as experiments: prompt tweaks, UX changes (buttons, clarifying questions), and model policies (stricter grounding, different refusal templates). The key is to define success thresholds and guardrails before you ship, not after you see the graph.

Start by writing a crisp hypothesis tied to a failure mode and a metric: “If we add a clarifying question for ambiguous billing intents, retry rate will drop by 10% relative, without increasing abandonment.” Choose the smallest change that tests the idea. For many teams, prompt updates are fastest, but UX changes can be more reliable when ambiguity is the root cause.

  • A/B unit: randomize by user_id when possible to avoid learning effects; randomizing by session is acceptable but can bias frequent users.
  • Primary metric: one metric that reflects the goal (containment with quality checks, or task success). Keep it stable across experiments.
  • Guardrails: safety rate, hallucination-coded sessions, escalation-to-human rate, and p95 latency. An “improvement” that harms safety or speed is not a win.
  • Success thresholds: define minimum detectable effect, sample size expectations, and a decision rule (ship, iterate, rollback).

Common mistake: running many prompt variants without isolating variables, then being unable to explain why metrics moved. Another mistake is optimizing for containment while hidden dissatisfaction rises; mitigate this by pairing containment with proxies like retries and post-chat feedback. Practical outcome: an experiment brief template that includes hypothesis, variant description, instrumentation changes (if any), metrics, thresholds, and a rollout plan with monitoring windows.

Chapter milestones
  • Design an instrumentation plan: events, properties, and taxonomy
  • Compute core conversation metrics and create a weekly dashboard
  • Perform qualitative coding at scale and connect to metrics
  • Diagnose failure modes and prioritize fixes using impact estimates
  • Propose experiments: prompt tweaks, UX changes, and model policies
Chapter quiz

1. Why does analytics for LLM-powered experiences need to answer both “what happened” and “why it happened”?

Show answer
Correct answer: Because behavior emerges from prompts, tools, retrieval, policies, and UI constraints, not just clicks
LLM system outcomes are shaped by multiple interacting components, so you must diagnose causes, not only outcomes.

2. Which set best describes what an instrumentation plan should include in this chapter’s workflow?

Show answer
Correct answer: Events, properties, and a taxonomy for intent and outcomes
The chapter emphasizes designing instrumentation around events + properties and defining a taxonomy to interpret conversations.

3. What is the purpose of computing core conversation metrics and creating a weekly dashboard?

Show answer
Correct answer: To monitor performance at scale and make improvements repeatable over time
A weekly dashboard supports ongoing monitoring and evidence-based iteration, not just one-off analysis.

4. How does the chapter recommend connecting qualitative coding to quantitative monitoring?

Show answer
Correct answer: Use qualitative coding at scale to label patterns, then link those labels to metrics for monitoring and decision-making
The workflow ties coded themes/failures to measurable trends so you can track and prioritize issues.

5. When diagnosing failure modes and prioritizing fixes, what approach does the chapter emphasize?

Show answer
Correct answer: Identify failure modes (e.g., hallucination, refusal mismatch, tool errors) and prioritize fixes using impact estimates, then propose experiments with guardrails and success thresholds
The chapter stresses diagnosing specific failure modes, estimating impact to prioritize, and running guarded experiments with clear success criteria.

Chapter 6: Ship Research—Make Recommendations and Build Your Portfolio

Research only matters if it changes what ships. In AI UX research, “shipping” includes product decisions (copy, flows, error states), model decisions (prompting, retrieval, guardrails), and operational decisions (monitoring, audits, and regression testing). This chapter shows how to convert test evidence into an insights-to-actions backlog with owners and acceptance criteria, write a report stakeholders will actually use, and set up a governance loop so quality doesn’t decay after launch. You’ll also learn how to package the work into portfolio artifacts and interview stories that prove you can operate as an AI UX researcher—not just a designer who ran a study once.

The mindset shift: treat your research deliverables like product interfaces. They should be skimmable, actionable, and designed for the people who must act. A strong chapter-6 output is a “ship-ready” bundle: (1) prioritized recommendations with implementation details, (2) a concise narrative report with uncertainty explained, (3) a set of quality gates and regression sets for ongoing evaluation, and (4) portfolio-ready artifacts showing before/after impact.

Practice note for Create an insights-to-actions backlog with owners and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a concise AI UX research report and present it to stakeholders: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a governance loop: monitoring, audits, and regression testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package a portfolio case study for AI UX research roles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Career transition plan: resume bullets, interview stories, and skill signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create an insights-to-actions backlog with owners and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a concise AI UX research report and present it to stakeholders: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a governance loop: monitoring, audits, and regression testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package a portfolio case study for AI UX research roles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Career transition plan: resume bullets, interview stories, and skill signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Recommendation formats: prompt diffs, UX copy, flow changes, policies

AI UX research recommendations fail when they are too abstract (“make it more helpful”) or too one-size-fits-all (“improve the model”). Convert insights into implementation-ready formats. Your default output should be an insights-to-actions backlog: each item has an owner (PM, designer, ML, backend), priority, evidence link, and acceptance criteria that can be verified in a follow-up eval.

Prompt diffs are the fastest path to impact for LLM apps. Write them like code reviews: show the “before” and “after” system/developer prompt snippet, and include rationale tied to evidence. Example acceptance criteria: “On the regression set ‘Returns Policy—Ambiguous’, assistant asks one clarifying question before stating policy; groundedness score ≥ 0.8 in 20/25 cases.”

UX copy recommendations should specify exact strings and where they appear: empty state, error state, safety refusal, and follow-up nudges. Common mistake: changing assistant tone without considering legal/safety constraints. Add a policy note when copy changes the boundary (“I can’t do that, but I can help you…”).

Flow changes are often higher leverage than prompt tweaks. Use a state diagram or step list: entry conditions, assistant action, user choices, and exit criteria. If your study found users repeatedly “loop” after a vague answer, propose a new branch: “Offer three next steps + ‘ask a different question’ affordance.” Acceptance criteria should include completion rate, turns-to-resolution, and reduction in fallback intents.

Policies turn recurring edge cases into consistent behavior. Document: allowed/blocked content, required disclaimers, escalation triggers (handoff to human), and data boundaries. Engineers can implement policies in guardrail logic; researchers can validate via targeted red-team prompts. The practical outcome is a backlog item that can be scheduled and tested, not a slide that gets applauded and forgotten.

Section 6.2: Communicating uncertainty and model variability to stakeholders

Stakeholders are used to deterministic UI behavior; LLMs are probabilistic. Your job is to communicate uncertainty without sounding like you’re hedging. Replace vague caveats (“the model is inconsistent”) with structured variability: what varies, why it varies, and how you propose to control it.

Use three tools in your report and presentation. First, ranges not single numbers: “Task success 62–74% across 3 seeds” or “Groundedness dropped from 0.85 on curated queries to 0.61 on long-tail queries.” Second, confidence labels tied to evidence volume and representativeness: High (≥30 sessions + regression set), Medium (10–29 sessions or biased sample), Low (exploratory, anecdotal). Third, repro steps: provide the exact prompt, context, retrieval settings, and model version so others can see the behavior.

When presenting, separate model behavior from product behavior. If the issue is that the assistant “hallucinates policies,” show whether it’s a retrieval failure (no source returned), a prompt failure (didn’t require citations), or a UI failure (sources hidden, users can’t verify). This framing prevents the common mistake of blaming the model for what is actually an instrumentation or UX design gap.

Finally, make uncertainty actionable by proposing risk-based rollouts: feature flags, limited domains, or “answer with citations only” modes. Tie recommendations to governance: monitoring dashboards, weekly audits of high-risk intents, and regression testing before each release. You’re not asking for perfection; you’re defining how the team will detect drift and respond.

Section 6.3: Quality gates: pre-release evals, red-teaming inputs, regression sets

Shipping AI research means defining quality gates that decide whether a change can launch. A gate is a measurable check with a threshold, an owner, and a clear “stop-ship” condition. Without gates, teams rely on anecdotal demos and end up surprised by failures in production.

Pre-release evals should include both UX and LLM evaluation concepts: helpfulness (did it solve the task), groundedness (is it supported by sources), and safety (does it refuse and redirect appropriately). Build a small evaluation suite from your study: the top tasks, the top failure modes, and a handful of long-tail prompts. Set thresholds that match risk: for low-risk features you might accept “helpfulness ≥ 0.75,” while for compliance-heavy domains you may require “citation present in 95% of policy answers.”

Red-teaming inputs are not only for security teams. Researchers contribute by translating observed user workarounds into adversarial prompts: users who try to “trick” the assistant, bypass refusals, or request private data. Include “benign misuse” too—people asking medical questions in a shopping assistant. A common mistake is writing unrealistic jailbreak prompts; prioritize prompts that real users actually attempted in sessions or that are likely in your domain.

Regression sets are the backbone of the governance loop. Every time you fix a failure, add the prompt + expected behavior to a regression set tagged by intent and risk level. Store the expected response as criteria, not exact text: “Must ask one clarifying question,” “Must cite source,” “Must refuse and offer alternative.” Run regressions on every prompt change, model upgrade, or retrieval index update. The practical outcome is a repeatable release process: research findings become tests, and tests prevent backsliding.

Section 6.4: Research ops for AI: repositories, tagging, and reusability

AI UX research generates more artifacts than traditional usability testing: transcripts, prompt variants, model versions, retrieval configs, eval results, and analytics events. Research ops turns that complexity into a reusable system so each study compounds value.

Start with a central repository (Notion, Confluence, or a Git-backed docs folder). Standardize folders: Briefs, Scripts, Stimuli (prompt/flow versions), Sessions, Findings, Backlog, and Regression Sets. Every artifact should include “run metadata”: date, model/version, system prompt hash, tools enabled, and data sources. Without metadata, you can’t compare results across time.

Tagging makes retrieval possible. Tag findings by: intent (e.g., Order Status), failure mode (hallucination, refusal mismatch, over-verbosity, tool misuse), risk (low/med/high), and lifecycle stage (pre-launch, beta, post-launch). Pair tags with a consistent naming scheme for conversation snippets so teams can search and reuse them.

Make reusability a deliverable. Create a rubric template for chat quality (helpfulness, groundedness, safety, tone, interaction cost) and reuse it across studies so your metrics are comparable. Create an instrumentation map linking user goals to events, intents, slots, and outcomes; this becomes the bridge between qualitative findings and analytics monitoring. Common mistake: treating research ops as “nice to have.” In AI, ops is how you maintain a governance loop: monitoring, audits, and regression testing become routine rather than heroic.

Section 6.5: Portfolio artifacts: briefs, rubrics, dashboards, and before/after examples

Your portfolio should prove you can do the AI UX research job end-to-end: define the problem, run tests, interpret evidence, and drive changes with measurable impact. Hiring teams want artifacts, not just claims. Package a case study as a compact narrative with attachments.

Include a research brief (one page) showing goals, hypotheses, tasks, success criteria, and constraints (model limits, safety policies, data boundaries). Add a conversation test script and a flow/state diagram snippet so reviewers can see you understand chat mechanics. Show a rubric that evaluates helpfulness, groundedness, and safety—ideally with example scores and what “good” looks like.

Add a before/after section with concrete changes: prompt diffs, updated UX copy, and flow changes. Pair each change with evidence: a transcript snippet, a metric shift (task success, turns-to-resolution), or a reduction in safety incidents. If you used analytics, include a simple dashboard mock: funnels for key intents, fallback rate, handoff rate, and CSAT (if available). Show how instrumentation ties to outcomes—events and slots are not the goal; resolution is.

Finally, show the insights-to-actions backlog with owners and acceptance criteria. This signals that you can operate in a cross-functional environment and that your research is shippable. Common mistake: over-indexing on “AI buzzwords” and under-showing decision-making. Your portfolio should read like a product story where research drove a specific release and improved reliability.

Section 6.6: Interview readiness: case prompts, take-home tests, and negotiation

To complete the career transition, translate your work into skill signals: resume bullets, interview stories, and a plan for take-home exercises. Start by rewriting your experience in outcomes + methods + AI-specific rigor. Example structure: “Ran 12 moderated chat-flow sessions for LLM assistant; identified top 5 failure modes; shipped prompt + flow changes; improved task success from X to Y; established regression set of 60 prompts and release quality gates.”

Prepare for case prompts by practicing a repeatable walkthrough: clarify domain and risk, define user goals, propose hypotheses, outline tasks and success metrics, describe instrumentation (events/intents/slots/outcomes), and explain how you’d evaluate helpfulness/groundedness/safety. Interviewers often test whether you can balance UX quality with model variability; explicitly mention seeds, reruns, and confidence levels.

For take-home tests, optimize for clarity and actionability. Deliver a concise report (1–2 pages) plus an appendix: transcripts, rubric, and prioritized backlog with acceptance criteria. Include at least one example of communicating uncertainty and one example of a governance loop (monitoring + audit cadence + regression plan). Common mistake: producing a beautiful deck with no implementable next steps.

In negotiation, anchor on scope and risk. Roles that own evaluation, safety collaboration, and release gates are higher leverage than “UX writing for chat.” Ask what model changes you can influence, whether there is an existing eval pipeline, and who owns monitoring. Those answers tell you whether you’ll be set up to succeed—and give you concrete levers to justify level and compensation.

Chapter milestones
  • Create an insights-to-actions backlog with owners and acceptance criteria
  • Write a concise AI UX research report and present it to stakeholders
  • Build a governance loop: monitoring, audits, and regression testing
  • Package a portfolio case study for AI UX research roles
  • Career transition plan: resume bullets, interview stories, and skill signals
Chapter quiz

1. In Chapter 6, what best describes “shipping” in AI UX research?

Show answer
Correct answer: Changes to product, model, and operations decisions such as copy/flows, prompting/guardrails, and monitoring/audits/regression testing
The chapter defines shipping broadly as product, model, and operational decisions influenced by research.

2. What is the primary purpose of converting test evidence into an insights-to-actions backlog?

Show answer
Correct answer: To turn findings into prioritized, assignable work with owners and acceptance criteria
The backlog format ensures recommendations are actionable, owned, and verifiable via acceptance criteria.

3. Which deliverable best reflects the mindset shift to treat research deliverables like product interfaces?

Show answer
Correct answer: A skimmable, actionable set of outputs designed for the people who must act
Chapter 6 emphasizes deliverables that are skimmable and designed for action by their audiences.

4. Why does Chapter 6 recommend building a governance loop after launch?

Show answer
Correct answer: To prevent quality from decaying by using monitoring, audits, and regression testing
A governance loop keeps quality high over time through ongoing monitoring, audits, and regression tests.

5. Which set of elements matches the chapter’s definition of a “ship-ready” bundle?

Show answer
Correct answer: Prioritized recommendations with implementation details; concise narrative report with uncertainty explained; quality gates and regression sets; portfolio-ready artifacts showing before/after impact
The chapter explicitly lists these four components as the ship-ready bundle.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.