Career Transitions Into AI — Intermediate
Turn design skills into AI UX research with tested chat flows and metrics.
Conversational AI changes the rules of UX. Instead of static screens, you’re designing and researching a system that generates language, reasons with partial context, and can be helpful one moment and confidently wrong the next. This course is a short technical book that shows you how to move from product design into AI UX research by building testable conversational prototypes and evaluating them with a blend of usability methods and LLM-specific analytics.
You’ll work through a practical progression: define the assistant’s job-to-be-done, prototype realistic conversational flows (including error recovery), design research studies that capture real user behavior, and then quantify what’s happening with instrumentation and transcript analysis. The goal is not to turn you into a data scientist; it’s to give you the frameworks, artifacts, and vocabulary an AI product team expects from a UX researcher working on LLM experiences.
LLM experiences require more than classic usability testing. You’ll learn how to assess groundedness, helpfulness, safety, and consistency without getting lost in model internals. We’ll treat prompts and system instructions as part of the interface, and we’ll show how to test variability, capture reproducible traces, and interpret failure modes like hallucinations, refusal mismatches, tool errors, and ambiguous user intent.
This course is designed for product designers, UX designers, service designers, and researchers who want to pivot into AI-focused roles. If you already know UX fundamentals but want a clear, end-to-end method for researching chat and assistant experiences—this is your playbook.
Each chapter is structured like a book chapter with milestones and sub-sections. You’ll start with a scoped use case and end with a repeatable research and analytics workflow you can apply to new products. Along the way, you’ll practice turning ambiguous conversation quality into measurable criteria, so your recommendations are easier to prioritize and ship.
Ready to begin? Register free to access the course, or browse all courses to compare learning paths in AI career transitions.
You’ll be able to prototype conversational flows that are explicitly testable, run studies that capture both user experience and model behavior, and use LLM analytics to measure outcomes and guide iteration. Most importantly, you’ll have a coherent set of artifacts that demonstrate AI UX research capability—ideal for interviews, internal mobility, or client work.
Conversational AI UX Research Lead
Sofia Chen leads conversational UX research for AI assistants across fintech and healthcare products. She specializes in LLM evaluation, human-in-the-loop testing, and turning qualitative insights into measurable improvements. Her work focuses on safe, reliable chat experiences and research operations that scale.
Transitioning from product design to AI UX research is less about abandoning your design craft and more about aiming it at a new kind of interface: one that responds in language, adapts in real time, and sometimes makes things up. In traditional UI, you ship screens and flows; the system’s behavior is mostly bounded by what you designed. In LLM-powered chat, you ship a behavioral envelope: prompts, policies, tools, guardrails, and evaluation criteria that shape what the assistant can do and how reliably it does it.
Your design background already includes the foundations of this work: goal framing, task analysis, usability heuristics, journey mapping, prototyping, and evidence-based iteration. What changes is the research surface area. You’re no longer only validating “can users find and click the right thing?” You’re validating “can users express the right thing, can the assistant interpret it, and does the response remain helpful, grounded, and safe across varied contexts?”
This chapter sets up your new workflow. You’ll map your transferable skills and identify AI UX research gaps, define the assistant’s job-to-be-done (JTBD) and user outcomes, draft a conversation research brief with hypotheses, set up a research repository and evidence standards, and plan a portfolio of artifacts you’ll produce throughout the course.
Most importantly, you’ll practice engineering judgment: knowing what to test first, what to hold constant, what to instrument, and how to interpret messy conversational evidence without overfitting to anecdotes.
Practice note for Map your transferable skills and AI UX research gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the assistant’s job-to-be-done and user outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a conversation UX research brief and hypothesis set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up your research repository and evidence standards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Portfolio plan: what artifacts you’ll produce in this course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Map your transferable skills and AI UX research gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the assistant’s job-to-be-done and user outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a conversation UX research brief and hypothesis set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
When the UI talks back, the interaction model becomes turn-based, interpretive, and open-ended. The “interface” is not just layout—it’s phrasing, timing, repair strategies, and the assistant’s ability to keep context. Users don’t browse; they negotiate meaning. That’s why AI UX research puts more emphasis on misunderstandings, recovery, and expectation management than many screen-based studies.
As a designer, you likely already run usability tests with tasks and success criteria. In chat, you still do that, but tasks must include variations in how people ask, what they omit, and what they assume the assistant knows. Your prototypes also change: instead of high-fidelity screens, you’ll prototype conversational flows with scripts, branching logic, and state diagrams. A lightweight way to start is to write a “happy path” transcript, then add at least three common deviations: ambiguous request, missing detail, and user correction.
Common mistake: treating chat like a search box with a long answer. The research consequence is you’ll measure satisfaction while missing failure modes like goal drift (assistant changes the task), over-verbosity (user abandons), or hidden errors (assistant sounds confident but is wrong). Practical outcome: you’ll begin mapping your transferable skills (task design, facilitation, synthesis) and your gaps (LLM evaluation, tool use, instrumentation) so your learning plan is targeted rather than overwhelming.
AI products blur responsibilities that used to sit neatly between design, research, and engineering. Understanding the roles helps you collaborate—and helps you position your career transition. In many teams, “AI UX Researcher” sits at the intersection of conversation design, evaluation, and research operations.
AI UX (product/experience): Defines user outcomes, sets interaction principles, and decides what the assistant should and should not do. This includes aligning on the assistant’s JTBD: a crisp statement of the job users hire the assistant to do, plus the outcomes that indicate success (time saved, fewer errors, higher confidence, reduced handoffs).
Conversation design: Crafts the assistant’s voice, prompts, clarifying questions, and fallback behaviors. Even if you’re not the primary conversation designer, you’ll research whether these behaviors actually work for real users and real inputs.
Evaluation (LLM quality): Operationalizes “good” with rubrics and scorecards—helpfulness, groundedness, safety, and task completion—then runs systematic testing. UX research becomes more quantitative here: you’ll compare variants, define rating scales, and ensure raters interpret criteria consistently.
Research ops: Builds the system that keeps evidence usable: repositories, tagging, privacy standards, and repeatable templates. Without this, chat studies become piles of transcripts no one can trust.
Practical outcome: you’ll define a working “assistant role” (what it’s allowed to do), identify stakeholder owners (product, engineering, legal, data), and decide early what evidence standards you’ll hold for decisions—especially when the model is persuasive but unreliable.
To research chat UX effectively, you need a small set of concepts that explain most failures. Start with intent: what the user is trying to accomplish in this turn. Intent is rarely a single label; it’s often layered (e.g., “draft an email” plus “match my tone” plus “use the attached policy”). Next is context: the relevant information the assistant should use—conversation history, user profile, system policies, and external sources (documents, databases, tools).
Grounding is the discipline of tying responses to reliable sources or known truths. In LLM UX, grounding is not an abstract ML concept; it’s a user trust lever. If the assistant gives a policy answer, where did it come from? If it proposes a next step, is it consistent with the user’s constraints? Your research will often reveal that users don’t mind clarification questions, but they do mind confidently wrong answers.
Ambiguity is the default state of language. People omit details (timeframes, formats, audiences) and use overloaded words (“report,” “optimize,” “safe”). Your job is to test whether the assistant detects ambiguity, asks the right clarifying questions, and allows users to correct it without penalty.
Practical outcome: you’ll define the assistant’s JTBD and user outcomes in a way that is testable, then create a hypothesis set tied to intent handling, context retention, and grounding behavior—so your study isn’t just “do people like it?”
LLM experiences fail in ways that look like “good UX” until you check the truth. Hallucinations—fabricated facts, citations, or actions—are the classic risk, but they’re not the only one. Safety issues include harmful instructions, biased outputs, harassment, or inappropriate content. Privacy issues include leaking personal data, retaining sensitive information, or encouraging users to paste confidential content. Compliance can include regulated advice (medical, legal, financial), accessibility, record retention, and enterprise data handling rules.
AI UX research must treat these as first-class constraints, not edge cases. That changes how you write tasks and how you store data. For example, if your participants paste real customer data into a prototype, you’ve created a compliance problem. Your research repository needs redaction practices, storage permissions, and clear rules for what can appear in transcripts and screenshots.
Practical outcome: you’ll establish evidence standards in your repository (what must be captured for each finding: conversation transcript, model version, settings, sources used, and severity). You’ll also begin defining refusal and escalation expectations: when the assistant should say “I can’t help with that,” when it should offer safer alternatives, and when it should hand off to a human.
Great chat research starts with questions that isolate behavior. Instead of a generic usability goal, define a practical research plan with explicit goals, hypotheses, tasks, and success criteria. Your hypotheses should be falsifiable and tied to user outcomes, not model mystique. For example: “If the assistant asks one clarifying question before drafting, users will report higher confidence and require fewer edits.”
In moderated sessions, you’ll watch how people formulate requests, what they reveal about intent, and whether they notice uncertainty. In unmoderated tests, you’ll prioritize scalability and consistency: tight tasks, clear stopping rules, and structured post-task questions. In both, you’ll capture high-signal evidence—moments that change a product decision—rather than collecting endless transcripts.
Common mistake: using satisfaction as the primary success metric. LLMs can be highly satisfying while being wrong. Practical outcome: you’ll draft a conversation UX research brief that includes (1) assistant JTBD and target users, (2) top risks, (3) hypotheses mapped to measurable criteria (completion, correction rate, grounding rate), and (4) a task set that intentionally includes ambiguity and adversarial-but-realistic inputs.
Your portfolio as an AI UX Researcher is built from artifacts that make chat behavior testable and improvable. The first is a conversation UX research brief: a one- to two-page document that states the assistant’s JTBD, user outcomes, in-scope capabilities, constraints (safety/privacy/compliance), and the hypotheses you’ll test. This is where you translate design instincts into research responsibilities.
Next is a test plan tailored to conversational UX. It specifies study type (moderated vs. unmoderated), participant criteria, tasks (including ambiguous and correction scenarios), what you’ll log, and how you’ll judge success. For LLMs, pair the plan with a scorecard that operationalizes evaluation concepts: groundedness, safety, and helpfulness. Define rating anchors (e.g., 1–5) with concrete examples so different reviewers agree.
To avoid “insights theater,” set up your research repository early with evidence standards: naming conventions, tags (intent, failure mode, severity), and required context (model version, prompt, tools, retrieval sources). This is research ops applied to conversational systems—without it, you can’t track regressions or improvements.
Practical outcome: by the end of this course, your portfolio plan should show a coherent narrative—how you identified skill gaps, defined an assistant role, designed research to validate conversational behavior, and connected findings to both product decisions and analytics instrumentation (events, intents, slots, outcomes) so improvements can be measured after launch.
1. In Chapter 1, what is the key mindset shift when moving from traditional UI design to LLM-powered chat experiences?
2. What does Chapter 1 mean by saying you ship a “behavioral envelope” in LLM chat?
3. Compared to traditional UI validation, what additional core question does AI UX research need to validate in LLM chat?
4. Which set best matches the chapter’s described primary deliverable shift for this role transition?
5. What is the chapter’s description of “engineering judgment” in AI UX research?
Designers transitioning into AI UX research often underestimate one thing: you can’t “just test the chat” unless you’ve first made the chat testable. Traditional UX prototypes tend to be stable—screens don’t change their meaning between sessions. LLM experiences are different: wording, context, and subtle prompt changes can produce materially different outcomes. Your job in this chapter is to learn how to prototype conversational flows with enough structure that they can be evaluated, compared, and iterated—without accidentally testing a moving target.
A testable conversational prototype has three characteristics. First, it has explicit scope boundaries (what it will and won’t do). Second, it has a flow model (even if the UI is a simple chat window) that anticipates branching and failure paths. Third, it has an evaluation frame: a rubric describing expected versus acceptable model behavior, including how the assistant should respond when it’s uncertain, unsafe, or out of scope.
As you work through the sections, keep the end deliverable in mind: something you can put in front of participants (or run unmoderated), capture evidence from, and analyze. You are not aiming for a perfect product; you’re aiming for a prototype that produces high-signal learning about user goals, assistant behavior, and interaction breakdowns.
Practice note for Choose a use case and define scope boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Draft a conversation script with branching and error paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design prompts/system instructions as a product interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a clickable or runnable prototype for testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a rubric: expected vs acceptable model behaviors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose a use case and define scope boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Draft a conversation script with branching and error paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design prompts/system instructions as a product interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a clickable or runnable prototype for testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a rubric: expected vs acceptable model behaviors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The fastest way to build momentum is to pick a use case that is real, repeatable, and bounded. “Help users with HR questions” is too broad. “Help a new hire understand how to submit expenses” is testable because it has a clear goal, known constraints, and observable outcomes. As a former designer, you already know how to define a scenario; the AI twist is that you must define boundaries in language, not just screens.
Start by writing a one-paragraph use-case brief: who the user is, what they’re trying to accomplish, what inputs they can provide, and what the assistant can access (documents, policies, tools, or nothing). Then define scope boundaries explicitly: topics that are out of scope, actions the assistant cannot take, and assumptions about data freshness. This becomes the basis for your system instructions and your test plan.
Define “done” in terms of user outcomes and model behavior. User outcomes are observable: “User correctly identifies reimbursable categories and submits a complete expense report.” Model behavior outcomes are evaluative: “Assistant cites the policy source, asks for missing fields, and avoids inventing reimbursement rules.” If you can’t describe success criteria without using vague words like ‘helpful,’ your prototype isn’t ready to test.
The output of this step is not a full spec; it’s a boundary box that makes later prompt-writing and rubric design possible.
Conversational flows are not all the same. If you test an open-ended experience with a guided script (or vice versa), you will misdiagnose failures. Choose a flow type intentionally based on risk, complexity, and the cost of errors.
Guided flows resemble forms: the assistant asks targeted questions to fill required slots (date, amount, category). They are easier to test because paths are predictable and “done” is clear. They are ideal for high-stakes domains or when you need consistent data capture.
Open-ended flows prioritize exploration: users ask anything, the assistant responds and pivots. This is closer to “search plus synthesis,” but it’s harder to test because user paths vary widely. Here, your prototype must emphasize guardrails, source attribution, and graceful uncertainty handling.
Mixed-initiative flows combine both: the user starts with an open question, and the assistant decides when to guide (“To answer that, I need your country and employment type—full-time or contractor?”). Mixed-initiative is often the best product experience, but it requires clear rules for when the assistant should take control.
Draft your conversation script with branching and error paths aligned to your chosen flow type. For guided flows, sketch the required slots and order, then add branches for missing info and user corrections. For open-ended flows, script representative “intents” and add boundaries: what happens when the user requests something the assistant shouldn’t do. For mixed-initiative, define triggers that switch modes (e.g., “unclear request,” “multi-step task,” “policy-sensitive topic”).
By the end of this section, you should have a script outline with at least three branches: happy path, missing-information path, and out-of-scope or error path.
In screen-based UX, “state” is what the system currently knows: logged-in status, items in a cart, selected filters. In LLM UX, state still exists, but it’s split across different layers that affect testability: conversation history (what’s in the context window), explicit memory (saved preferences), and external state (tools, databases, files).
For prototyping, treat state as a diagram, not a vibe. Create a lightweight state model that lists: (1) what must be captured from the user (slots), (2) what can be inferred but should be confirmed, and (3) what must never be assumed. Then decide where each piece of state lives. If it’s only in the chat history, it may be forgotten when the conversation gets long or when you change the prompt. If it’s in explicit memory, you must design consent and editability (“Forget my location”). If it’s in external state, you must design latency and failure handling (“I couldn’t access your policy portal right now”).
Context windows introduce a practical testing constraint: identical user input can produce different outputs depending on what the model “sees” above it. This is why you should define a test harness conversation prefix: a fixed system message, fixed few-shot examples (if used), and a consistent starting message. When running studies, reset sessions between tasks unless you are explicitly testing carryover memory.
Thinking in state also helps analytics later: intents and slots are essentially tracked state transitions. The prototype you build now should reflect what you’ll eventually instrument.
In LLM products, prompts and system instructions are part of the interface. They decide what the assistant prioritizes, how it speaks, and—most importantly—what it refuses to do. A strong prompt is not a clever incantation; it’s a UX specification written for a language model.
Start with a system instruction that defines role, audience, and boundaries: what sources to use, what not to do, and how to respond under uncertainty. Keep it structured: short sections with headings like “Purpose,” “Allowed inputs,” “Safety constraints,” “Output format,” and “When to ask clarifying questions.” Then define tone intentionally. Tone is not just brand voice; it affects perceived competence. For policy and troubleshooting, concise and procedural often tests better than playful.
Constraints must be observable. Instead of “be accurate,” specify behaviors you can evaluate: “If you are not sure, say you’re not sure and ask for X,” “Cite the policy section title,” “Do not provide legal advice; offer to connect the user to HR.” These become rubric criteria later.
Design fallbacks as first-class flow steps. Your assistant should have a planned response for: missing information, ambiguous intent, tool failure, and out-of-scope requests. Write these fallback patterns as reusable snippets so your prototype behaves consistently across branches.
When you treat prompts as UX, you also get cleaner research: participants experience a stable interaction policy, and your findings map to specific prompt decisions you can iterate.
You have three practical prototype levels, and choosing the right one is an exercise in research judgment: match fidelity to the questions you need answered.
1) Script-only prototypes are fastest. You write the conversation as a branching script (like a play), including user utterances, assistant responses, and annotations for intent/slot/state changes. This is ideal for early-stage concept validation and for aligning stakeholders on scope boundaries. It’s also the easiest way to ensure you’ve covered error paths and escalation before any tool is built.
2) Simulators add controlled variability. A simulator can be as simple as a spreadsheet with “if user says X, respond with Y,” or a lightweight tool that lets a researcher select assistant responses from predefined options. Simulators support moderated testing because you can keep the experience consistent while observing user language and expectations.
3) Low-code or runnable chat UIs (a basic web chat hooked to an LLM, or a prototyping platform with an LLM connector) are best when you must observe real model behavior: hallucinations, instruction-following gaps, and sensitivity to phrasing. If you go runnable, lock down versions: save the prompt, model name, temperature, and any retrieval sources so results are reproducible.
Regardless of prototype type, create a rubric of expected vs acceptable behaviors before you test. “Expected” means ideal product behavior; “acceptable” means still usable without harming trust. For example: expected = “asks for missing receipt date before calculating,” acceptable = “gives steps but flags missing date as required.” The rubric keeps you from overreacting to minor phrasing issues and helps you focus on user impact.
The goal is not to impress with polish—it is to generate stable, comparable evidence across participants and iterations.
Edge cases are not rare in conversational UX; they are the moments users remember. In AI UX research, you must prototype them deliberately because they shape trust, safety, and perceived competence. Three categories matter most: refusal (the assistant should not comply), uncertainty (the assistant cannot be sure), and escalation (a human or alternative channel is needed).
Refusal design should be specific and helpful. A good refusal states the boundary, gives a brief reason in user terms, and offers safe alternatives. For example, if asked for medical diagnosis, refuse and suggest speaking to a clinician, plus provide general information disclaimers. Avoid scolding, long policy quotes, or vague “I can’t help with that” responses that force the user to guess the boundary.
Uncertainty handling is where groundedness meets UX. Prototype how the assistant signals confidence: cite sources, show assumptions, or ask a clarifying question instead of guessing. Build rubric criteria that distinguish “transparent uncertainty” (acceptable) from “confident fabrication” (failure). If your use case involves documents, prototype what happens when sources conflict or are missing.
Escalation is a flow, not a dead end. Define triggers (user frustration, repeated failure, high-risk topics), the handoff content (summary of what’s been collected), and the user’s next step (link, contact method, ticket creation). Even in a prototype, write the escalation message and capture what data would be passed along.
When you design edge cases into the prototype, your research sessions stop being improvisational. You can measure whether the assistant preserves trust under stress—exactly the capability stakeholders will care about when deciding whether an AI experience is ready to ship.
1. Why does Chapter 2 argue you can’t “just test the chat” in LLM experiences?
2. Which set best describes the three characteristics of a testable conversational prototype in this chapter?
3. What is the main purpose of defining explicit scope boundaries before testing a conversational flow?
4. What does the chapter mean by having a “flow model” for a chat prototype?
5. Which rubric element is emphasized as necessary for evaluating the assistant during testing?
Conversational AI research is not “regular usability testing with a chat box.” You are studying a system that can generate novel responses, follow (or ignore) instructions, and produce failures that look confident. That changes how you define success, how you write tasks, and what evidence counts as “high-signal.” Your job is to turn product intent into measurable conversation outcomes, then choose methods that expose breakdowns in understanding, reasoning, safety, and user trust.
In practice, research design for chat experiences is a workflow: (1) align on product goals and the decisions your study must inform, (2) translate those goals into success criteria and hypotheses, (3) craft scenarios that elicit realistic and risky behavior, (4) recruit and screen the right participants, (5) build a test plan with a moderator guide, logging, and consent, and (6) pilot to refine tasks and rubrics before spending your full sample.
Unlike many UI studies, conversational research benefits from blending qualitative evidence (transcripts, probes, error taxonomies) with quantitative signals (resolution rate, time-to-answer, escalation, deflection). You will also find that engineering judgment matters: instrumentation, data retention, and privacy constraints can shape what you can measure, which in turn shapes your study design.
The sections below walk through concrete choices and templates you can apply immediately.
Practice note for Turn product goals into measurable conversation success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design tasks and scenarios that elicit realistic user behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Plan participant recruitment, screening, and ethics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a test plan: moderation guide, logging, and consent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Pilot your study and refine tasks and rubrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Turn product goals into measurable conversation success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design tasks and scenarios that elicit realistic user behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Plan participant recruitment, screening, and ethics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Method choice starts with the question you must answer and the maturity of the chat experience. For early concepts (scripted flows, state diagrams, prompt prototypes), a moderated usability test is usually the fastest path to actionable insight because you can probe intent, clarify user mental models, and observe how users recover from unexpected responses. For later-stage products with real traffic, log review and benchmarking become essential to quantify performance at scale and to detect long-tail failures you will never see in a lab.
Use this practical rule: if you’re uncertain why failures happen, run moderated sessions; if you’re uncertain how often they happen, analyze logs; if you’re uncertain whether you’re improving, run a benchmark. Diary studies are best when the assistant’s value emerges over time (e.g., workflow coaching, health habit support) or when context matters (mobile, on-the-go, intermittent use). Diaries capture the “messy middle” between sessions: repeated prompts, trust shifts, and when users stop using the tool.
A common mistake is choosing a method based on convenience rather than decision impact. For example, testing only “happy path” prompts in moderated sessions can make an assistant look excellent while hiding that real users ask messy, multi-part questions. Another mistake is running a benchmark without stable scoring criteria; if you can’t define what “resolved” means, you can’t claim improvement. Tie each method to a deliverable: a prioritized breakdown list, a metrics dashboard spec, or a go/no-go decision for launch.
Conversation success criteria must be measurable, aligned to product goals, and robust to ambiguity. Start by translating a goal into a user outcome and then into a metric definition. Example: “reduce support load” becomes “users solve issues without human help,” which becomes deflection rate—but only if you also measure whether users were actually helped (otherwise you incentivize premature closures).
Define metrics with operational rules you can apply consistently across transcripts and logs. Recommended core set:
Engineering judgment matters: instrument what you can reliably capture. “Time-to-answer” is meaningless if your timestamping is inconsistent across clients or if streaming responses complicate the endpoint. Decide whether you measure time to first response, time to final response, or time to user-confirmed resolution—each answers a different question.
Common mistakes: (1) using satisfaction alone (users can be satisfied with a wrong answer), (2) defining “resolution” too loosely (“the model responded”) or too strictly (“perfect answer”), and (3) mixing product outcomes across intents. A practical fix is to write a metric rubric per intent: what counts as resolved, partially resolved, unresolved, and unsafe. In moderated tests, apply the same rubric while coding transcripts; in logs, map rubric outcomes to observable signals (e.g., “clicked source,” “asked follow-up,” “escalated,” “abandoned”). This alignment makes your qualitative findings comparable to analytics later.
Tasks for conversational AI must elicit realistic language, not “researcher-speak.” Start by grounding each scenario in a believable situation, a user role, and a constraint. Then add variability so the assistant is tested against the same underlying intent expressed in different ways. For example, “Change my flight” should appear as: direct request, multi-part request, vague request, and request with missing details.
Write scenarios to test not just success, but failure handling. LLM systems will fail differently than deterministic UIs: they may hallucinate, over-commit, or answer confidently without enough information. Include tasks that force the assistant to ask clarifying questions, cite sources, refuse unsafe requests, or hand off appropriately.
Connect scenarios to hypotheses and success criteria. Example hypothesis: “If the assistant shows a short plan before acting, users will trust it more and require fewer follow-ups.” Your task must create a moment where planning matters, and your measures must capture trust and turns-to-resolution.
A frequent mistake is over-scripting participant language (“Ask the bot: ‘Please assist me with…’”). Instead, give the goal and let participants speak naturally, then capture their original phrasing as data. Another mistake is ignoring edge cases until after launch; for LLM chat, edge cases are often the brand-damaging ones. Build a task set that includes 60–70% common intents, 20–30% messy/ambiguous cases, and 10% adversarial/safety cases, then pilot to confirm the difficulty is realistic.
Sampling for conversational AI is about matching intent distribution and risk exposure. Start with your primary user segments, then map which segments produce high-stakes conversations (financial decisions, medical topics, vulnerable populations, regulated workflows). Your recruitment plan should reflect both frequency (common users) and consequence (high-risk users), even if high-risk users are a smaller share.
Create a screener that captures: domain familiarity, frequency of the target task, comfort with chat tools, and constraints that affect language (non-native speakers, accessibility needs). For workplace assistants, screen for role-specific vocabulary and workflow ownership; novices and experts fail differently. Novices reveal onboarding and expectation gaps; experts reveal precision requirements and tolerance for ambiguity.
Common mistake: recruiting “generic users” for a domain-specific assistant. This produces misleading findings—participants will ask basic questions or behave unlike real users. Another mistake is screening only for demographics and not for task reality (do they actually do the thing?). A practical approach is to require a recent example: “In the last 30 days, how did you handle X?” and ask for the steps they took. This both validates eligibility and gives you language to seed realistic scenarios.
Finally, plan how you will handle participants encountering model errors. In high-stakes domains, you may need guardrails in the prototype, a moderator intervention rule, or a study disclaimer that the assistant is not authoritative. These choices affect who you can ethically recruit and what claims you can make from the data.
A strong test plan makes chat studies repeatable and debuggable. Your materials should include: consent language, a standardized intro, tasks/scenarios, probes, a logging plan, and a debrief. For moderated sessions, the moderator script is your “control system”—it reduces variation introduced by different facilitators and keeps you from rescuing the product with leading hints.
Structure the moderator guide around moments that matter in chat:
Plan your logging and evidence capture before you run anyone. At minimum, capture: full transcript, timestamps, model/system version, retrieval sources shown, tool calls/actions taken, safety filters triggered, and any user feedback events. Without versioning, you cannot interpret changes across sessions—LLM behavior can drift with prompt tweaks or model updates.
Debrief prompts should separate usability from trust and safety. Examples: “Which answers did you trust least, and why?” “Where did it feel like it made assumptions?” “Did you notice anything you’d consider risky or inappropriate?” A common mistake is ending with “Any other feedback?” and missing the chance to extract decision-quality insights. End with a structured comparison: what worked, what failed, and what the assistant should do when uncertain (ask, cite, refuse, or escalate). This supports a clear set of design and research deliverables: updated conversation flows, revised prompts, and a scored breakdown list.
Conversational data is uniquely sensitive because users naturally paste personal details into chat—often without realizing it. Ethics and privacy are not a legal afterthought; they shape your study design, tooling, and what you can store. Start by classifying what counts as PII in your context (names, emails, IDs, addresses) and what counts as sensitive data (health, finances, minors, employment disputes). Then design your tasks to avoid unnecessary collection and build safeguards for accidental disclosure.
Consent must be specific to conversational capture: participants should understand that free-form text may be recorded, that the system may generate unexpected content, and that they can skip any prompt. For studies involving safety or adversarial scenarios, avoid exposing participants to harmful content; keep adversarial tests bounded (e.g., testing refusal behavior with mild policy-violating requests) and pre-approve scripts with stakeholders.
Common mistakes include collecting real customer data in prototypes without proper safeguards, sharing raw transcripts broadly in slide decks, and forgetting that “model training” implications differ by vendor and configuration. Coordinate with engineering and legal early: verify whether chats are sent to third-party model providers, whether data is retained, and how to disable training on user inputs. Build these constraints into your research plan and instrumentation so your evidence is both actionable and ethically obtained.
Finally, pilot your consent and privacy instructions the same way you pilot tasks. If multiple participants still paste real data, your instructions or scenario design is failing—and you should fix that before scaling the study.
1. Why does conversational AI research require a different approach than “regular usability testing with a chat box”?
2. Which sequence best reflects the chapter’s recommended workflow for research design in chat experiences?
3. What is the primary purpose of turning product intent into measurable conversation outcomes?
4. Which approach best matches the chapter’s guidance on what evidence should be collected in conversational AI studies?
5. Why does the chapter emphasize piloting early in conversational AI research?
As a designer moving into AI UX research, your biggest mindset shift is that “the interface” is partially a probabilistic system. A chat flow can look perfect in a script and still fail when the model improvises, misreads context, or confidently answers beyond the product’s real capabilities. This chapter shows how to run moderated and unmoderated tests for LLM chat experiences and, more importantly, how to capture evidence that engineering and product teams can act on.
High-signal evidence has three qualities: it is reproducible (or clearly bounded by conditions), it is attributable (you can point to the prompt, model settings, and user action that caused it), and it is decision-ready (it maps to a fix: prompt change, policy, UI affordance, retrieval update, or guardrail). You will learn how to probe both the user’s mental model and the model’s behavior, collect clean transcripts with structured annotations, score conversations with a balanced rubric, and synthesize findings into problem statements and opportunity areas.
Throughout, treat your session outputs like a “trace” rather than a narrative. A good AI UX researcher can answer: What did the user ask? What did the system see? What did it retrieve? What did it respond? What did the user believe happened? And what changed after the repair? When you can connect those dots, your research shifts from anecdote to evidence.
Practice note for Run moderated sessions and probe model + user mental models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Collect clean transcripts and structured annotations during sessions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Execute unmoderated tests at scale with consistent logging: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Score conversations with a balanced rubric (UX + LLM quality): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Synthesize findings into problem statements and opportunity areas: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run moderated sessions and probe model + user mental models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Collect clean transcripts and structured annotations during sessions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Execute unmoderated tests at scale with consistent logging: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Score conversations with a balanced rubric (UX + LLM quality): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Moderating LLM chat tests differs from classic usability testing because the participant is co-creating the interface in real time. Your job is to keep the session structured while leaving room for natural language exploration. Start with a pre-brief that sets expectations: what the assistant can/can’t do, whether it can be wrong, and what “good” looks like (e.g., accurate steps, safe boundaries, transparent uncertainty). Then run tasks that mirror real intents, not UI steps. A task prompt like “Plan my first week using this tool” produces richer behavior than “Click the planner.”
Use a two-layer script: (1) a user-facing task list and (2) an internal moderator checklist. The checklist should include: model configuration (model version, system prompt, tools enabled), required logging, and probes you will use consistently. Example probes: “What do you think the assistant knows about you right now?” (mental model), “Why did you trust that answer?” (overtrust), “What would you do next if you couldn’t ask again?” (decision impact), and “If you rewrote your question, what would you change?” (repair behavior).
In moderated sessions, avoid “teaching the model” on behalf of the participant. If the participant asks, “What should I type?”, redirect: “What would you naturally ask?” Likewise, don’t rescue the conversation too early. Let the breakdown occur, then observe repair. Your goal is not to showcase the best-case chat, but to expose the edges where the product needs better prompts, UI affordances, or guardrails.
Breakdowns in conversational UX often look polite: the participant continues the chat even when they’re confused, or they accept a wrong answer because it sounds confident. Watch for subtle signals: long pauses before sending, repeated rephrasing, “Okay…” without action, copying text elsewhere, or switching from goal-seeking to system-debugging (“Why are you saying that?”). These behaviors indicate a mismatch between the user’s mental model and the system’s actual capabilities.
Classify breakdowns into at least three buckets during observation: confusion (user doesn’t know what to ask or what happened), overtrust (user believes incorrect or unsafe output), and repair (user or system attempts to recover). Repair strategies are a core research target because they are designable. Examples include: providing a clarification question, offering structured options, exposing sources, admitting uncertainty, or suggesting a safer alternative action.
Probe the model’s “mental model” indirectly by asking what inputs it had and what assumptions it made. In tool-using assistants, ask engineering for traces (tool calls, retrieval results). Then, during sessions, note when the assistant behaves as if it has memory, permissions, or authority it does not actually have. A classic overtrust scenario: the model gives confident policy advice without citations, and the participant says, “Great, I’ll do that.” Your evidence should capture both the unsafe answer and the user’s intention to act.
Engineering judgment matters when deciding whether the fix is conversational (prompting and guardrails), informational (better retrieval/grounding), or interactional (UI that sets expectations). If users repeatedly ask the same clarifying question, that’s often an interaction design gap: the system should proactively surface constraints or ask for missing slots rather than forcing the user to guess the “right prompt.”
Clean transcripts are necessary but not sufficient. To make sessions comparable and analyzable, annotate them with a lightweight framework that can scale from moderated notes to unmoderated logs. A practical minimum schema includes: intent (what the user is trying to do), outcome (what actually happened), issue type (why it failed or succeeded), and severity (how much it matters).
Define intents at the level your product roadmap cares about, not at the level of linguistic phrasing. For example, “request refund status,” “choose plan,” “troubleshoot setup,” “summarize policy,” “draft message.” Outcomes should be categorical and mutually exclusive where possible: success, partial success, failure, unsafe, needs human handoff. Then add issue types that bridge UX and LLM quality, such as: missing context request, incorrect factual claim, hallucinated capability, refusal when allowed, allowed when should refuse, tone mismatch, or broken tool invocation.
Severity is where you apply judgment. Use a 0–3 or 1–4 scale with explicit anchors. Example: 1 (cosmetic) tone is slightly off but user proceeds; 2 (friction) requires rephrase or extra turn; 3 (critical) blocks task or causes wrong action; 4 (harm) safety, legal, privacy, or medical/financial risk. Capture severity per issue, not per conversation, and then compute rollups (e.g., “critical issues per 10 conversations”).
A balanced rubric prevents teams from optimizing for “pleasant chat” while missing correctness and safety. Use a scorecard that combines UX quality with LLM evaluation concepts. At minimum, score: helpfulness (does it move the user forward), correctness (is it accurate and grounded), safety (does it avoid harmful guidance and respect policy), and tone (is it appropriate for context and brand).
Define each dimension with concrete anchors and examples. For correctness, specify what “grounded” means for your system: citations to trusted docs, tool results, or explicit uncertainty when data is missing. For safety, include both content safety (harmful instructions) and product safety (privacy, data leakage, unauthorized actions). Tone should not be purely aesthetic; it affects trust calibration. Overconfident tone with low grounding is a measurable risk.
Score at the conversation level and at the turn level when needed. Conversation-level scores help you compare variants; turn-level scores tell you where the failure starts (often the first missed clarification question). A practical approach is a 1–5 scale per dimension plus a binary “would you ship?” gate. Encourage evaluators to cite evidence: “Correctness = 2 because it claimed feature X exists; no tool call; user attempted action and failed.”
Common mistakes include: (1) treating helpfulness as a proxy for correctness (“it gave steps, therefore it’s good”), (2) ignoring refusal quality (a safe refusal can still be unhelpful if it lacks alternatives), and (3) using a rubric without calibration. Run a short calibration session where multiple raters score the same transcripts and reconcile differences. This is the fastest way to raise signal and reduce debate later.
LLMs are variable by design. If you don’t control variability, your findings can be dismissed as “one weird run.” Treat model settings as experimental conditions. Record model version, system prompt, temperature, top-p, tool configuration, and any retrieval parameters. When possible, fix a random seed (some platforms support it) or run multiple replicates per task to estimate variability.
For moderated studies, variability can be a feature: it reveals how robust the experience is across natural phrasing. But you still need reproducibility for debugging. When a critical failure occurs, immediately capture the exact conversation state: the full transcript, any hidden system messages, the tool outputs, and the retrieved documents. If the system uses memory, log what memory entries were available. Without this, engineering cannot reproduce, and your evidence loses power.
For unmoderated tests at scale, standardize everything except the variable you’re testing. Use consistent task prompts, consistent success criteria, and consistent logging. Consider an A/B approach where only one dimension changes (e.g., a new system prompt or a new safety policy). If temperature is high, require more runs per task and report distributions (median score, worst-case, and failure rate). Worst-case behavior is often what harms users, even if averages look fine.
Stakeholders act on what they can see. Your synthesis should convert raw sessions into traceable evidence: short clips, crisp quotes, annotated transcripts, and a clear path from observation to recommendation. Create an evidence bundle per finding: (1) a one-sentence problem statement, (2) 1–3 artifacts (clip, transcript excerpt, screenshot), (3) annotation tags (intent, issue type, severity), (4) rubric scores, and (5) a proposed opportunity area with likely fix owners (prompting, retrieval, UI, policy, tooling).
Write problem statements that separate user goal from system failure. Example: “Users seeking setup guidance cannot complete task because the assistant assumes permissions it doesn’t have and does not offer a handoff.” Then quantify where possible: “5/8 participants encountered this; 3 attempted the suggested action and failed.” This is how you translate UX design experience into AI UX research deliverables that product teams trust.
Traceability is your safeguard against opinion wars. Every recommendation should link back to a specific turn range and condition. If you suggest changing the system prompt, cite the exact moment where the assistant failed to ask for missing slots. If you propose a UI change (e.g., suggested prompts or a ‘What I can do’ panel), link it to observed confusion and rephrasing loops. If you propose safety guardrails, include the overtrust evidence: the user’s stated intention to follow harmful advice.
Finally, synthesize into opportunity areas, not just issues. Opportunity areas combine frequency, severity, and fix leverage. For example: “Improve clarification strategy for multi-slot intents” (reduces friction broadly) or “Add grounding citations for policy answers” (reduces overtrust risk). Your chapter deliverable is a set of findings that are reproducible, annotated, scored, and easy to route into engineering backlogs.
1. Why can a chat flow that looks perfect in a script still fail in real use?
2. Which combination best describes the three qualities of high-signal evidence?
3. What does it mean to treat session outputs like a “trace” rather than a narrative?
4. In a moderated session for an LLM chat experience, what is a key goal beyond observing task success?
5. What is the purpose of scoring conversations with a balanced rubric in this chapter’s approach?
Chat UX research becomes credible (and repeatable) when you can show evidence at scale: what users attempted, what the assistant did, where it failed, and what improved after a change. In traditional product UX, analytics often answers “what happened.” In LLM-powered experiences, you also need to answer “why it happened” because the system’s behavior emerges from prompts, tools, retrieval, policies, and UI constraints.
This chapter gives you a practical workflow to move from raw transcripts to decisions. You’ll design an instrumentation plan (events + properties), define a taxonomy for intent and outcomes, compute core conversation metrics for a weekly dashboard, and connect qualitative coding to quantitative monitoring. Then you’ll diagnose failure modes (hallucination, refusal mismatch, tool errors) and propose experiments with guardrails and success thresholds.
The key mindset shift for a designer transitioning into AI UX research: you are no longer just measuring interface interactions. You are measuring a socio-technical loop—user intent, conversation design, model behavior, and external systems—all within the same “session.”
Practice note for Design an instrumentation plan: events, properties, and taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compute core conversation metrics and create a weekly dashboard: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Perform qualitative coding at scale and connect to metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Diagnose failure modes and prioritize fixes using impact estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Propose experiments: prompt tweaks, UX changes, and model policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design an instrumentation plan: events, properties, and taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compute core conversation metrics and create a weekly dashboard: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Perform qualitative coding at scale and connect to metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Diagnose failure modes and prioritize fixes using impact estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Propose experiments: prompt tweaks, UX changes, and model policies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
If you can’t reliably reconstruct “a conversation,” every downstream metric becomes noisy. Start with sessionization: define what counts as a session, how long a session can be idle, and when a new session should begin. A typical rule is a new session after 30 minutes of inactivity, but you should align to your product context (e.g., enterprise workflows may have longer gaps; consumer chat may need shorter).
Next, establish identifiers. You need at least: user_id (or hashed), session_id, conversation_id (if your system supports threaded conversations), turn_id (monotonic within a session), and message_id. If your assistant uses tools or retrieval, also include tool_call_id and retrieval_request_id so you can trace errors to their source.
Common mistake: logging only “message sent” and “assistant responded.” That hides whether the issue is UX (user confusion), orchestration (wrong tool), retrieval (no results), or generation (hallucination). A practical outcome of this section is a one-page instrumentation spec listing events, required properties, and ownership (who implements, who validates).
A taxonomy makes conversations measurable. Without it, you can count turns but not progress. Design your taxonomy like an interface: it must be usable by analysts, labelers, and dashboards. Start simple and evolve. You typically need four layers: intent (what the user is trying to do), topic/domain (what it’s about), outcome (what happened), and escalation type (how/why it left automation).
Intents should map to user goals and system capabilities (e.g., “reset password,” “compare plans,” “draft email,” “find policy”). Avoid mixing intent with sentiment (“angry user”) or with channel (“mobile”). Topics help route to owners (billing vs. onboarding) and enable trend monitoring. Outcomes should be mutually exclusive at the session or task level: completed, abandoned, escalated, blocked by policy, tool failed, or unresolved.
Engineering judgment matters: don’t overfit the taxonomy to the model. Your labels should reflect user intent even when the model misinterprets it; otherwise, analytics will mask model errors by “agreeing” with the assistant. The practical deliverable is a taxonomy document with definitions, examples, edge cases, and a mapping to events/properties so each conversation can be tagged consistently.
Once you can reconstruct sessions and label outcomes, compute a small set of core metrics and review them weekly. Start with metrics that expose both UX friction and model reliability. A good dashboard answers: Are users getting what they need? Where are they getting stuck? Did performance change after releases?
Common mistakes: (1) reporting a single blended containment number that hides failing intents; (2) ignoring denominator definitions (per turn vs. per session); (3) treating thumbs-down as ground truth without reading transcripts. Practical outcome: a weekly dashboard with a stable set of tiles, each sliced by intent/topic, plus release annotations so you can correlate changes with deployments.
Also add “health” counters for analytics itself: % of sessions missing session_id, % of turns missing taxonomy labels, and % of tool calls without status codes. If your instrumentation degrades, your conclusions will too.
Quant metrics tell you where problems concentrate; transcript analysis tells you what to fix. The goal is qualitative coding at scale without losing rigor. Use a two-pass approach: (1) broad clustering to find recurring patterns, then (2) focused coding to build a durable pattern library.
For clustering, you can embed user turns (or whole sessions) and group them by similarity. Combine embedding clusters with taxonomy slices (“high retry + billing intent”) to avoid clusters that are semantically neat but product-irrelevant. Then sample within each cluster to read transcripts and name themes in plain language.
Connect themes back to metrics by creating “coded flags” as analytic properties (e.g., has_retry, hallucination_suspected, tool_empty_then_confident_answer). This is how qualitative insight becomes measurable. Practical outcome: a living pattern library that PMs and engineers can act on, plus a set of coded indicators that appear in your dashboard alongside core metrics.
LLM failures are not all the same, and fixing the wrong layer wastes weeks. Build a failure mode checklist and apply it to the highest-impact clusters from your dashboard. Three common categories deserve special attention: hallucination, refusal mismatch, and tool errors.
Hallucination is not just “made-up facts.” In product terms, it’s a breach of groundedness: the assistant claims certainty without verified sources, or invents steps that don’t exist in your system. Instrument signals such as “no retrieval sources returned,” “tool status=error,” and “assistant still produced confident instructions.” Add a property like grounding_evidence_count and track hallucination-coded sessions per intent.
Refusal mismatch happens when the assistant refuses allowed requests or answers disallowed ones. This is where UX research meets policy. You need to log which policy rule triggered (if any), whether the refusal included a helpful alternative, and whether the user escalated or abandoned. A good refusal is still an outcome you can optimize: clarity, tone, and next steps.
Tool errors include timeouts, malformed parameters, empty results, and partial failures. The UX symptom is often “assistant loops” or “tries again” without explaining. Log tool name, latency, status code, and retry count; store sanitized request/response metadata for debugging. Then estimate impact: sessions affected × severity × frequency. This impact estimate is your prioritization engine; it helps you argue for fixes that reduce user harm, not just those that are technically interesting.
Analytics becomes transformative when you can prove improvement. Treat changes as experiments: prompt tweaks, UX changes (buttons, clarifying questions), and model policies (stricter grounding, different refusal templates). The key is to define success thresholds and guardrails before you ship, not after you see the graph.
Start by writing a crisp hypothesis tied to a failure mode and a metric: “If we add a clarifying question for ambiguous billing intents, retry rate will drop by 10% relative, without increasing abandonment.” Choose the smallest change that tests the idea. For many teams, prompt updates are fastest, but UX changes can be more reliable when ambiguity is the root cause.
Common mistake: running many prompt variants without isolating variables, then being unable to explain why metrics moved. Another mistake is optimizing for containment while hidden dissatisfaction rises; mitigate this by pairing containment with proxies like retries and post-chat feedback. Practical outcome: an experiment brief template that includes hypothesis, variant description, instrumentation changes (if any), metrics, thresholds, and a rollout plan with monitoring windows.
1. Why does analytics for LLM-powered experiences need to answer both “what happened” and “why it happened”?
2. Which set best describes what an instrumentation plan should include in this chapter’s workflow?
3. What is the purpose of computing core conversation metrics and creating a weekly dashboard?
4. How does the chapter recommend connecting qualitative coding to quantitative monitoring?
5. When diagnosing failure modes and prioritizing fixes, what approach does the chapter emphasize?
Research only matters if it changes what ships. In AI UX research, “shipping” includes product decisions (copy, flows, error states), model decisions (prompting, retrieval, guardrails), and operational decisions (monitoring, audits, and regression testing). This chapter shows how to convert test evidence into an insights-to-actions backlog with owners and acceptance criteria, write a report stakeholders will actually use, and set up a governance loop so quality doesn’t decay after launch. You’ll also learn how to package the work into portfolio artifacts and interview stories that prove you can operate as an AI UX researcher—not just a designer who ran a study once.
The mindset shift: treat your research deliverables like product interfaces. They should be skimmable, actionable, and designed for the people who must act. A strong chapter-6 output is a “ship-ready” bundle: (1) prioritized recommendations with implementation details, (2) a concise narrative report with uncertainty explained, (3) a set of quality gates and regression sets for ongoing evaluation, and (4) portfolio-ready artifacts showing before/after impact.
Practice note for Create an insights-to-actions backlog with owners and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a concise AI UX research report and present it to stakeholders: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a governance loop: monitoring, audits, and regression testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package a portfolio case study for AI UX research roles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Career transition plan: resume bullets, interview stories, and skill signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create an insights-to-actions backlog with owners and acceptance criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a concise AI UX research report and present it to stakeholders: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a governance loop: monitoring, audits, and regression testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package a portfolio case study for AI UX research roles: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Career transition plan: resume bullets, interview stories, and skill signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
AI UX research recommendations fail when they are too abstract (“make it more helpful”) or too one-size-fits-all (“improve the model”). Convert insights into implementation-ready formats. Your default output should be an insights-to-actions backlog: each item has an owner (PM, designer, ML, backend), priority, evidence link, and acceptance criteria that can be verified in a follow-up eval.
Prompt diffs are the fastest path to impact for LLM apps. Write them like code reviews: show the “before” and “after” system/developer prompt snippet, and include rationale tied to evidence. Example acceptance criteria: “On the regression set ‘Returns Policy—Ambiguous’, assistant asks one clarifying question before stating policy; groundedness score ≥ 0.8 in 20/25 cases.”
UX copy recommendations should specify exact strings and where they appear: empty state, error state, safety refusal, and follow-up nudges. Common mistake: changing assistant tone without considering legal/safety constraints. Add a policy note when copy changes the boundary (“I can’t do that, but I can help you…”).
Flow changes are often higher leverage than prompt tweaks. Use a state diagram or step list: entry conditions, assistant action, user choices, and exit criteria. If your study found users repeatedly “loop” after a vague answer, propose a new branch: “Offer three next steps + ‘ask a different question’ affordance.” Acceptance criteria should include completion rate, turns-to-resolution, and reduction in fallback intents.
Policies turn recurring edge cases into consistent behavior. Document: allowed/blocked content, required disclaimers, escalation triggers (handoff to human), and data boundaries. Engineers can implement policies in guardrail logic; researchers can validate via targeted red-team prompts. The practical outcome is a backlog item that can be scheduled and tested, not a slide that gets applauded and forgotten.
Stakeholders are used to deterministic UI behavior; LLMs are probabilistic. Your job is to communicate uncertainty without sounding like you’re hedging. Replace vague caveats (“the model is inconsistent”) with structured variability: what varies, why it varies, and how you propose to control it.
Use three tools in your report and presentation. First, ranges not single numbers: “Task success 62–74% across 3 seeds” or “Groundedness dropped from 0.85 on curated queries to 0.61 on long-tail queries.” Second, confidence labels tied to evidence volume and representativeness: High (≥30 sessions + regression set), Medium (10–29 sessions or biased sample), Low (exploratory, anecdotal). Third, repro steps: provide the exact prompt, context, retrieval settings, and model version so others can see the behavior.
When presenting, separate model behavior from product behavior. If the issue is that the assistant “hallucinates policies,” show whether it’s a retrieval failure (no source returned), a prompt failure (didn’t require citations), or a UI failure (sources hidden, users can’t verify). This framing prevents the common mistake of blaming the model for what is actually an instrumentation or UX design gap.
Finally, make uncertainty actionable by proposing risk-based rollouts: feature flags, limited domains, or “answer with citations only” modes. Tie recommendations to governance: monitoring dashboards, weekly audits of high-risk intents, and regression testing before each release. You’re not asking for perfection; you’re defining how the team will detect drift and respond.
Shipping AI research means defining quality gates that decide whether a change can launch. A gate is a measurable check with a threshold, an owner, and a clear “stop-ship” condition. Without gates, teams rely on anecdotal demos and end up surprised by failures in production.
Pre-release evals should include both UX and LLM evaluation concepts: helpfulness (did it solve the task), groundedness (is it supported by sources), and safety (does it refuse and redirect appropriately). Build a small evaluation suite from your study: the top tasks, the top failure modes, and a handful of long-tail prompts. Set thresholds that match risk: for low-risk features you might accept “helpfulness ≥ 0.75,” while for compliance-heavy domains you may require “citation present in 95% of policy answers.”
Red-teaming inputs are not only for security teams. Researchers contribute by translating observed user workarounds into adversarial prompts: users who try to “trick” the assistant, bypass refusals, or request private data. Include “benign misuse” too—people asking medical questions in a shopping assistant. A common mistake is writing unrealistic jailbreak prompts; prioritize prompts that real users actually attempted in sessions or that are likely in your domain.
Regression sets are the backbone of the governance loop. Every time you fix a failure, add the prompt + expected behavior to a regression set tagged by intent and risk level. Store the expected response as criteria, not exact text: “Must ask one clarifying question,” “Must cite source,” “Must refuse and offer alternative.” Run regressions on every prompt change, model upgrade, or retrieval index update. The practical outcome is a repeatable release process: research findings become tests, and tests prevent backsliding.
AI UX research generates more artifacts than traditional usability testing: transcripts, prompt variants, model versions, retrieval configs, eval results, and analytics events. Research ops turns that complexity into a reusable system so each study compounds value.
Start with a central repository (Notion, Confluence, or a Git-backed docs folder). Standardize folders: Briefs, Scripts, Stimuli (prompt/flow versions), Sessions, Findings, Backlog, and Regression Sets. Every artifact should include “run metadata”: date, model/version, system prompt hash, tools enabled, and data sources. Without metadata, you can’t compare results across time.
Tagging makes retrieval possible. Tag findings by: intent (e.g., Order Status), failure mode (hallucination, refusal mismatch, over-verbosity, tool misuse), risk (low/med/high), and lifecycle stage (pre-launch, beta, post-launch). Pair tags with a consistent naming scheme for conversation snippets so teams can search and reuse them.
Make reusability a deliverable. Create a rubric template for chat quality (helpfulness, groundedness, safety, tone, interaction cost) and reuse it across studies so your metrics are comparable. Create an instrumentation map linking user goals to events, intents, slots, and outcomes; this becomes the bridge between qualitative findings and analytics monitoring. Common mistake: treating research ops as “nice to have.” In AI, ops is how you maintain a governance loop: monitoring, audits, and regression testing become routine rather than heroic.
Your portfolio should prove you can do the AI UX research job end-to-end: define the problem, run tests, interpret evidence, and drive changes with measurable impact. Hiring teams want artifacts, not just claims. Package a case study as a compact narrative with attachments.
Include a research brief (one page) showing goals, hypotheses, tasks, success criteria, and constraints (model limits, safety policies, data boundaries). Add a conversation test script and a flow/state diagram snippet so reviewers can see you understand chat mechanics. Show a rubric that evaluates helpfulness, groundedness, and safety—ideally with example scores and what “good” looks like.
Add a before/after section with concrete changes: prompt diffs, updated UX copy, and flow changes. Pair each change with evidence: a transcript snippet, a metric shift (task success, turns-to-resolution), or a reduction in safety incidents. If you used analytics, include a simple dashboard mock: funnels for key intents, fallback rate, handoff rate, and CSAT (if available). Show how instrumentation ties to outcomes—events and slots are not the goal; resolution is.
Finally, show the insights-to-actions backlog with owners and acceptance criteria. This signals that you can operate in a cross-functional environment and that your research is shippable. Common mistake: over-indexing on “AI buzzwords” and under-showing decision-making. Your portfolio should read like a product story where research drove a specific release and improved reliability.
To complete the career transition, translate your work into skill signals: resume bullets, interview stories, and a plan for take-home exercises. Start by rewriting your experience in outcomes + methods + AI-specific rigor. Example structure: “Ran 12 moderated chat-flow sessions for LLM assistant; identified top 5 failure modes; shipped prompt + flow changes; improved task success from X to Y; established regression set of 60 prompts and release quality gates.”
Prepare for case prompts by practicing a repeatable walkthrough: clarify domain and risk, define user goals, propose hypotheses, outline tasks and success metrics, describe instrumentation (events/intents/slots/outcomes), and explain how you’d evaluate helpfulness/groundedness/safety. Interviewers often test whether you can balance UX quality with model variability; explicitly mention seeds, reruns, and confidence levels.
For take-home tests, optimize for clarity and actionability. Deliver a concise report (1–2 pages) plus an appendix: transcripts, rubric, and prioritized backlog with acceptance criteria. Include at least one example of communicating uncertainty and one example of a governance loop (monitoring + audit cadence + regression plan). Common mistake: producing a beautiful deck with no implementable next steps.
In negotiation, anchor on scope and risk. Roles that own evaluation, safety collaboration, and release gates are higher leverage than “UX writing for chat.” Ask what model changes you can influence, whether there is an existing eval pipeline, and who owns monitoring. Those answers tell you whether you’ll be set up to succeed—and give you concrete levers to justify level and compensation.
1. In Chapter 6, what best describes “shipping” in AI UX research?
2. What is the primary purpose of converting test evidence into an insights-to-actions backlog?
3. Which deliverable best reflects the mindset shift to treat research deliverables like product interfaces?
4. Why does Chapter 6 recommend building a governance loop after launch?
5. Which set of elements matches the chapter’s definition of a “ship-ready” bundle?