HELP

+40 722 606 166

messenger@eduailast.com

Psychology to Human-Centered AI Research: User Studies + Model Analysis

Career Transitions Into AI — Intermediate

Psychology to Human-Centered AI Research: User Studies + Model Analysis

Psychology to Human-Centered AI Research: User Studies + Model Analysis

Turn psychology training into rigorous HCAI studies and model behavior insights.

Intermediate human-centered-ai · ux-research · user-studies · model-behavior

Become a human-centered AI researcher—using your psychology training

This book-style course is built for psychology graduates and early-career researchers who want to pivot into human-centered AI (HCAI), AI UX research, or applied AI evaluation. You already know how to form hypotheses, design studies, run interviews, and interpret behavior. The missing piece is learning how those skills map to AI systems—where model behavior, uncertainty, and failure modes shape what users experience.

Across six chapters, you’ll learn to design user studies that stand up to scrutiny, analyze quantitative and qualitative data, and pair human evidence with model behavior analysis. The goal is practical: produce a portfolio-ready study plan and evaluation narrative that hiring managers recognize as real HCAI work.

What you’ll build as you progress

  • A clear role map and research brief that ties user outcomes to AI behavior
  • Ethics and privacy materials you can reuse (consent language, risk checks, data handling plan)
  • A complete AI user study design: tasks, recruitment plan, measures, and pilot runbook
  • An analysis approach for both quantitative results (effect sizes, uncertainty) and qualitative insights (coding → themes)
  • A lightweight model behavior evaluation plan (rubrics, test sets, human rating) that complements user findings
  • A case-study narrative you can turn into a portfolio piece and interview story

Why AI research needs both user studies and model behavior analysis

Traditional UX research often stops at usability findings. Traditional ML evaluation often stops at benchmark metrics. Human-centered AI research sits in the middle: you must show how model behaviors (errors, bias, inconsistency, calibration) translate into human outcomes (trust, reliance, decision quality, workload). This course teaches you how to connect those layers so your results drive concrete decisions: what to fix in the model, what to change in the interface, what to monitor in deployment, and what risks require guardrails.

How the 6 chapters fit together (like a short technical book)

You’ll start by translating psychology research competencies into HCAI responsibilities and artifacts, then establish ethical and privacy foundations specific to AI studies. Next, you’ll design studies that work for AI interactions (including prototypes and model comparisons). After that, you’ll learn practical quantitative analysis patterns for common research designs, followed by qualitative workflows and mixed-method triangulation. Finally, you’ll add model behavior analysis so your user evidence is paired with an evaluation rubric and test strategy—exactly what modern AI teams need.

Who this is for (and who it isn’t)

This course is for psychology grads, UX researchers, research assistants, and behavioral scientists who want to enter AI-facing roles without becoming full-time ML engineers. You do not need prior machine learning training, but you should be comfortable with basic research methods and willing to work with structured data (even in spreadsheets). If you’re looking for deep neural network implementation, this course will feel too applied and research-process focused.

Start here

If you’re ready to build credible, ethical, decision-driving evidence about AI systems, begin now and move chapter by chapter. Register free to access the course, or browse all courses to compare related career-transition paths.

What You Will Learn

  • Map psychology research skills to human-centered AI researcher responsibilities
  • Define research questions that connect human outcomes to model behavior
  • Design IRB-ready, ethical user studies for AI systems and prototypes
  • Choose methods (interviews, surveys, experiments) and justify trade-offs
  • Build tasks, stimuli, and measures for evaluating AI interactions
  • Analyze quantitative study results (effect sizes, confidence intervals, power basics)
  • Analyze qualitative data (coding, themes) and triangulate with metrics
  • Diagnose model failure modes (bias, hallucination, calibration, robustness) using evaluation frameworks
  • Write actionable research reports that inform model iteration and product decisions
  • Create a portfolio-ready study plan and evaluation narrative for job interviews

Requirements

  • Bachelor’s-level understanding of research methods (psychology or related field)
  • Comfort reading basic statistics (means, p-values) even if rusty
  • A laptop; no prior machine learning training required
  • Optional: basic spreadsheet skills (Google Sheets/Excel) for analysis practice

Chapter 1: From Psychology to Human-Centered AI Research

  • Role map: HCAI vs UX research vs data science
  • Translate psych methods into AI evaluation value
  • Research question templates for AI systems
  • Build your first HCAI study brief

Chapter 2: Ethics, Privacy, and Study Readiness for AI

  • Consent and risk assessment for AI studies
  • Privacy-first data handling plan
  • Bias and harm check: pre-mortem exercise
  • IRB-style materials draft pack

Chapter 3: Designing User Studies for AI Interactions

  • Method selection matrix for AI research questions
  • Task design and stimulus creation for model comparisons
  • Sampling, recruitment, and incentives plan
  • Pilot study runbook and iteration checklist
  • Measurement plan: behavioral + attitudinal + performance

Chapter 4: Quantitative Analysis for Study Results

  • Clean data and build an analysis-ready dataset
  • Choose tests and report effect sizes correctly
  • Power and sample size: practical decision rules
  • Create a results narrative with uncertainty

Chapter 5: Qualitative Analysis and Mixed-Methods Triangulation

  • Interview guide to codebook: end-to-end workflow
  • Run a rapid thematic analysis sprint
  • Triangulate qual themes with quant findings
  • Turn insights into model and product recommendations
  • Write a defensible limitations and ethics section

Chapter 6: Analyzing Model Behavior to Complement User Evidence

  • Define failure modes and an evaluation rubric
  • Build a minimal test set aligned to user tasks
  • Run behavior analysis and interpret trade-offs
  • Create a portfolio case study and interview story
  • 90-day transition plan: skills, projects, and networking

Sofia Chen

Human-Centered AI Researcher & Applied Experimentation Lead

Sofia Chen designs and evaluates AI systems using behavioral science, UX research, and experimental methods. She has led mixed-methods studies for conversational AI and decision-support tools, translating findings into measurable model and product improvements.

Chapter 1: From Psychology to Human-Centered AI Research

Human-centered AI (HCAI) research is the bridge between how AI systems behave and what happens to people because of that behavior. If you come from psychology, you already know how to turn messy human experience into testable hypotheses, measurable outcomes, and defensible conclusions. The shift is that the “stimulus” is often an AI system whose behavior can change across versions, prompts, data, or policies.

This chapter gives you a practical orientation: how HCAI differs from adjacent roles (UX research, data science), how to translate psychology methods into AI evaluation value, and how to produce your first HCAI study brief. You will learn to connect human outcomes (trust, reliance, comprehension, error detection, equity, wellbeing) to model behavior (accuracy, calibration, refusal patterns, hallucinations, toxicity, latency, and interaction design). By the end, you should be able to scope a study that is ethical, IRB-ready, and directly supports a product or policy decision.

Throughout, keep one principle in mind: HCAI research is not “studying users” in isolation and not “benchmarking models” in isolation. It is studying an AI-in-context system where user goals, interfaces, organizational constraints, and model behaviors jointly determine outcomes.

Practice note for Role map: HCAI vs UX research vs data science: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate psych methods into AI evaluation value: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Research question templates for AI systems: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build your first HCAI study brief: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Role map: HCAI vs UX research vs data science: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate psych methods into AI evaluation value: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Research question templates for AI systems: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build your first HCAI study brief: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Role map: HCAI vs UX research vs data science: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Translate psych methods into AI evaluation value: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What “human-centered AI” means in practice

In practice, “human-centered AI” means you evaluate and improve AI systems by measuring outcomes that matter to people and organizations, not just model metrics. A human-centered AI researcher asks: What does this system cause users to do, believe, or feel—and under what conditions? That question forces you to connect model behavior (e.g., confident but wrong answers, inconsistent refusals, biased classification) to human outcomes (e.g., overreliance, degraded decision quality, reduced autonomy, unfair treatment, or better learning).

Role map matters because teams often confuse HCAI with UX research or data science. UX research typically focuses on usability, needs, and product fit, often assuming the underlying system behavior is stable. Data science often focuses on prediction quality and offline metrics, sometimes assuming user behavior is a downstream concern. HCAI sits between them: it treats the model as a variable that can shape user behavior, and treats people as part of the system you must optimize and safeguard.

  • HCAI research: studies interaction with AI, human outcomes, and model behavior together; designs evaluations that reveal harms, benefits, and trade-offs.
  • UX research: studies user needs, workflows, and interface comprehension; may test AI features, but often not model behavior in depth.
  • Data science/ML evaluation: studies model performance, robustness, and data issues; may not measure human reliance, trust, or decision quality.

A common mistake during career transitions is to “port” a classic psych study without adapting it to AI realities. AI systems can be stochastic, updated weekly, and sensitive to prompt wording or UI affordances. Human-centered practice means you version-control stimuli, log model outputs, and define what “the system” is for your study (model + prompt + retrieval + UI + policy). Your goal is actionable evidence: what to change in the model, interface, or guidance to improve real outcomes.

Section 1.2: The HCAI research lifecycle (problem → evidence → change)

The HCAI lifecycle can be summarized as problem → evidence → change. You begin with a decision that someone needs to make (ship a feature, choose a model, adjust a refusal policy, redesign explanations). Then you define evidence that will reduce uncertainty for that decision, and finally you translate findings into concrete changes (product requirements, model constraints, safety mitigations, user education, monitoring).

Problem framing is where psychologists add immediate value. Write the problem in terms of human outcomes, not features: “Users rely on the AI’s medical summaries even when uncertain,” not “We need a better summarizer.” Next, define success criteria: measurable improvements in decision quality, appropriate reliance, comprehension, or fairness. Then map those to model behaviors you suspect are causal (e.g., overconfident language, missing uncertainty cues, uneven error rates across groups).

Evidence design usually mixes methods. Early stages often use interviews or contextual inquiry to understand workflows and failure consequences. Mid stages use controlled experiments or surveys to test causal hypotheses (e.g., does uncertainty calibration reduce overreliance?). Later stages use field studies and telemetry to validate outcomes at scale. Your psychology toolkit transfers directly, but you must add AI-specific controls: freeze model versions, capture full prompts/outputs, and predefine how you will handle unsafe or disallowed outputs.

Change is where research becomes engineering judgment. Results must translate into “do this next week” actions: adjust UI copy, add friction for high-risk actions, redesign feedback loops, retrain with counterexamples, or add guardrails. A common mistake is reporting statistically significant findings without specifying what the team should change and how you expect that change to shift human outcomes. Treat every result as a lever: what knob can the team turn, and what metric should move?

When you write your first HCAI study brief, include the lifecycle explicitly: decision → hypothesis → method → measure → expected action. This keeps research anchored to real product and policy constraints.

Section 1.3: Mental models: users, models, and contexts

HCAI research requires three linked mental models: users, models, and contexts. Psych training often emphasizes users (cognition, motivation, biases) and contexts (social and organizational factors). You now add a third actor: the model, which has its own “behavioral profile” shaped by data, prompts, safety policies, and interface constraints.

User mental models include what people believe the AI can do, when it is reliable, and what “confidence” signals mean. Mismatched mental models create predictable failure modes: users may over-trust fluent text, under-trust correct but terse outputs, or misinterpret refusals as moral judgment. Plan to measure these beliefs directly (e.g., comprehension checks, trust calibration questions) instead of assuming them.

Model mental models are your operational understanding of how the system behaves: where it hallucinates, how it handles ambiguity, which prompts induce unsafe outputs, and how outputs vary with temperature or retrieval quality. You do not need to be an ML engineer to study this, but you must treat the model as an experimental factor. That means piloting prompts, sampling outputs, and documenting configuration so the study is reproducible.

Context mental models capture incentives, stakes, and constraints: time pressure, accountability, domain expertise, and organizational policies. The same model output has different consequences in a classroom versus a hospital. In HCAI, “validity” depends on matching the study context to the real decision environment. Common mistakes include testing in low-stakes settings and generalizing to high-stakes use, or using generic tasks that don’t reflect real workflows.

Research question templates help you connect the three mental models. Examples include: “How does model behavior X under context Y affect human outcome Z?” and “Which interface cues shift users from overreliance to appropriate reliance when the model is uncertain?” These templates keep your questions causal, measurable, and actionable.

Section 1.4: Core artifacts: protocols, instruments, and reports

HCAI work is judged by the quality of its artifacts. Strong artifacts make your study ethical, repeatable, and useful to stakeholders. The core set is: protocols (what happens), instruments (what you measure), and reports (what you recommend).

Protocols include recruitment criteria, consent language, step-by-step session scripts, task instructions, and what you do when the AI produces harmful or disallowed content. If you are aiming for IRB readiness, write explicit risk mitigations: content warnings, skip/stop options, debriefing, data minimization, and plans for handling distress or safety concerns. For AI systems, also include technical reproducibility: model version, prompt templates, sampling parameters, and logging plans.

Instruments are your surveys, interview guides, rubrics, and behavioral measures. Psych methods translate well here, but be careful with common traps: using “trust” scales without defining whether you mean attitudinal trust or behavioral reliance; asking leading questions that teach participants what to think; and measuring self-report when you actually need behavioral outcomes. Pair subjective measures (perceived helpfulness, workload, confidence) with objective ones (error detection rate, time-to-decision, correction quality, calibration curves).

Reports should not be literature-style summaries. They should be decision documents: what you tested, what you found, how confident you are (effect sizes and confidence intervals, not just p-values), and what to change. Include trade-offs: a mitigation may reduce speed while improving safety, or increase user effort while reducing overreliance. When possible, add basic power reasoning to justify sample size: what minimum effect you care about, what uncertainty remains, and what follow-up study would reduce it.

These artifacts are also your career leverage. A well-written protocol and study brief demonstrates that you can operate in AI teams where ethics, reproducibility, and product decisions intersect.

Section 1.5: Stakeholders and decisions your research must support

Human-centered AI research succeeds when it supports a real decision owned by a real stakeholder. Typical stakeholders include product managers (ship/no-ship), ML engineers (model selection and guardrails), designers (interaction cues and feedback), policy/legal (compliance and risk), and domain leaders (fitness for high-stakes workflows). Each group needs different evidence, and your study must be designed with that audience in mind.

Start by writing the decision in one sentence: “Should we enable AI drafting for customer support agents by default?” Then list what could go wrong and for whom: customer harm, agent deskilling, privacy exposure, biased language, or longer resolution times. Convert those risks into measurable outcomes and acceptance thresholds. This is where psychology-to-AI translation is most powerful: you can define outcomes like appropriate reliance, perceived accountability, and workload, and connect them to model behaviors like hallucination rate and refusal consistency.

Method choice is a stakeholder negotiation. Interviews are persuasive for surfacing workflow realities and failure consequences, but they rarely settle causal debates. Surveys scale, but can over-index on attitudes. Experiments can establish causality, but may miss ecological validity. The practical approach is to justify trade-offs explicitly: what you gain, what you lose, and what decision remains uncertain after the study.

Common mistake: delivering “interesting findings” that do not change a roadmap. Fix this by including a decision table in your report: for each key decision, specify which outcome metrics matter, what result would trigger a change, and what action is recommended. When stakeholders see their decision reflected, your research becomes indispensable rather than optional.

Section 1.6: Scoping: feasibility, risks, and success criteria

Scoping is the difference between an elegant study on paper and a study that ships evidence on time. In HCAI, feasibility is constrained by model access, data privacy, safety policies, participant risk, and the pace of system iteration. Your job is to design a study that is rigorous enough to support a decision, while being realistic about time and engineering constraints.

Begin with feasibility checks: Can you freeze a model version for the duration of data collection? Can you log prompts and outputs without collecting sensitive data? Can you create stable stimuli (screenshots, recorded outputs) if the live system is too variable? Many teams underestimate variance introduced by model updates; a simple mitigation is to use fixed output sets for controlled experiments and reserve live-system testing for later validation.

Next, assess risks and IRB considerations. AI systems can generate unexpected content, so plan protocols for exposure risk, misinformation, and emotional distress. If your study involves vulnerable populations or high-stakes domains, narrow the scope and add safeguards: limit content domains, pre-screen stimuli, provide disclaimers, and ensure participants are not making real consequential decisions during the study.

Define success criteria in measurable terms. For quantitative outcomes, specify the minimum effect size that matters (e.g., a meaningful reduction in harmful reliance) and how you will report uncertainty (confidence intervals). For power basics, decide whether you need a directional signal (pilot) or a decision-grade estimate (larger sample). For qualitative outcomes, specify what saturation means for your question: are you mapping failure modes, or validating a known set of risks?

To build your first HCAI study brief, keep it one page with: background, decision, research questions (using the templates from this chapter), participants, tasks/stimuli, measures, analysis plan (including effect sizes and CIs), ethics/IRB notes, and “what we will change if we see X.” That last line forces clarity—and turns psychology skill into human-centered AI impact.

Chapter milestones
  • Role map: HCAI vs UX research vs data science
  • Translate psych methods into AI evaluation value
  • Research question templates for AI systems
  • Build your first HCAI study brief
Chapter quiz

1. What best captures the core focus of human-centered AI (HCAI) research described in the chapter?

Show answer
Correct answer: Studying how AI behavior and human outcomes influence each other in an AI-in-context system
The chapter emphasizes HCAI as a bridge between AI behavior and what happens to people, focusing on the AI-in-context system rather than users or models in isolation.

2. According to the chapter, what is the key shift for psychologists moving into HCAI research?

Show answer
Correct answer: The stimulus is often an AI system whose behavior can vary across versions, prompts, data, or policies
The chapter notes that psychologists already know how to form hypotheses and measures, but in HCAI the “stimulus” is frequently a changing AI system.

3. Which pairing best reflects the chapter’s idea of connecting human outcomes to model behavior?

Show answer
Correct answer: Trust and reliance linked to calibration and hallucinations
The chapter explicitly describes connecting outcomes like trust/reliance to model behaviors such as calibration and hallucinations (among others).

4. Which study goal aligns most closely with the chapter’s definition of HCAI (vs UX research alone or data science alone)?

Show answer
Correct answer: Evaluate how refusal patterns and interface design jointly affect user comprehension and error detection
HCAI research studies an AI-in-context system where model behaviors and interaction design jointly determine human outcomes.

5. What should a successful first HCAI study brief enable by the end of Chapter 1?

Show answer
Correct answer: A scoped study that is ethical, IRB-ready, and supports a product or policy decision
The chapter states that by the end you should be able to scope a study that is ethical, IRB-ready, and directly supports a product or policy decision.

Chapter 2: Ethics, Privacy, and Study Readiness for AI

Human-centered AI research sits at an intersection: you are studying people, but the “stimulus” is often an adaptive system that can change its behavior, remember user inputs, and generate unexpected outputs. This chapter turns psychology research ethics into a practical readiness checklist for AI evaluations—so you can run studies that are defensible, IRB-ready, and safe for participants and your team.

Ethical readiness is not a paperwork phase you do at the end. It is a design constraint that shapes what you build, what you log, who you recruit, and how you respond when the model misbehaves. A good rule: if you cannot explain your data flow and risk controls on one page, you are not ready to put participants in front of the system.

We’ll use four recurring deliverables throughout the chapter: (1) a consent and risk assessment tailored to AI, (2) a privacy-first data handling plan, (3) a bias-and-harm “pre-mortem” exercise to anticipate failure modes, and (4) an IRB-style draft pack (protocol, scripts, materials, mitigations). These artifacts protect participants, reduce rework, and make your research easier to replicate.

Practice note for Consent and risk assessment for AI studies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Privacy-first data handling plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Bias and harm check: pre-mortem exercise: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for IRB-style materials draft pack: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Consent and risk assessment for AI studies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Privacy-first data handling plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Bias and harm check: pre-mortem exercise: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for IRB-style materials draft pack: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Consent and risk assessment for AI studies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Privacy-first data handling plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Bias and harm check: pre-mortem exercise: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Human subjects basics for AI evaluations

AI evaluations become “human subjects research” when you collect data about living individuals through interaction, observation, or identifiable information. In practice, most user studies of AI assistants, recommenders, classifiers-in-the-loop, and decision-support prototypes qualify—even if you think you’re “just testing usability.” The key question is not whether your system is experimental, but whether people are.

Start your study planning with a compact risk assessment. In psychology, you may be used to “minimal risk” as the default; in AI, risk can increase quietly through logging, personalization, or the system generating content. Break risk into: interaction risk (stress, confusion, bad advice), information risk (privacy, re-identification), and equity risk (differential harms across groups). Then map each risk to controls you can implement.

  • Interaction risk controls: clear disclaimers, stop/skip options, escalation routes, time limits, and post-task debrief.
  • Information risk controls: minimize collection, pseudonymize, access control, and retention limits.
  • Equity risk controls: inclusive recruitment, accessibility checks, subgroup monitoring, and guardrails for harmful outputs.

Common mistake: treating the AI system as a static “stimulus.” Models can update, prompts can drift, and third-party APIs can change behavior. Freeze model versions where possible, log the configuration (model ID, parameters, system prompt), and store stimuli so results remain interpretable.

Practical outcome: by the end of this section you should be able to write a one-paragraph “human subjects determination + risk level” statement and identify which parts of your AI pipeline must be locked down before recruiting participants.

Section 2.2: Informed consent and deception in AI contexts

Informed consent is an ongoing communication, not a single signature. AI studies add special requirements because participants may over-trust systems (“automation bias”) or misunderstand what is human vs model-generated. Your consent process should be explicit about three things: what the system is, what it can get wrong, and what data it will capture.

Write consent language in user terms, then add an internal “researcher version” that is more technical. For participants, avoid “LLM,” “embedding,” or “fine-tuning.” Say: “You will interact with a computer program that generates text. It may produce incorrect or inappropriate content.” For the internal record, specify the model family, any safety layers, and whether user text is sent to an external provider.

Deception can be tempting in AI research (e.g., hiding that outputs are machine-generated, simulating capabilities, or using confederate “agent” behaviors). Use deception only when it is necessary to answer the question and when risks are minimal. If you must deceive, design the study so the deception cannot cause harm: avoid sensitive topics, avoid decisions with real-world consequences, and debrief promptly with a chance to withdraw data.

  • Consent and risk assessment workflow: define tasks → identify foreseeable harms → write mitigations → reflect them in consent → rehearse the session script to ensure the mitigations are real, not hypothetical.
  • AI-specific consent items: model limitations, potential harmful outputs, data transmission to vendors, recording/logging scope, and participant control (pause, skip, delete).

Common mistakes: (1) “broad” consent that fails to mention external API processing, (2) hiding that outputs may be unsafe, and (3) not telling participants what happens if they paste personal data into the system. Practical outcome: you should be able to draft an AI-ready consent form plus a short debrief template for any study that uses deception or simulated capabilities.

Section 2.3: Sensitive data, logging, and retention policies

AI studies create data in more places than traditional behavioral research: browser logs, app telemetry, prompt/response transcripts, screen recordings, error traces, and vendor dashboards. A privacy-first data handling plan prevents accidental collection and makes your IRB submission stronger because it shows you understand the true data surface area.

Begin by drawing a simple data-flow diagram: participant → interface → your server → model provider → storage/analytics. For each hop, list (a) what fields are transmitted, (b) who can access them, and (c) how long they persist. Then minimize. If you do not need raw text, do not store it. If you need it for qualitative coding, store it separately with strict access and remove identifiers early.

  • Minimization: collect only what you need to answer the research question; avoid full-screen recordings when event logs suffice.
  • Separation: store contact/compensation info separately from study data; use participant IDs.
  • Redaction: implement automated filters for phone numbers, emails, addresses, and other identifiers in transcripts.
  • Retention: set a deletion schedule (e.g., raw logs deleted after 30–90 days; de-identified coded data retained longer).

Engineering judgment matters: logging is useful for debugging, but it can silently turn into surveillance. Treat every additional log field as a cost that must earn its place. Another common mistake is “temporary” storage that becomes permanent by default (cloud buckets, third-party analytics, model provider retention). Verify vendor retention settings and document them.

Practical outcome: produce a one-page privacy plan that names storage locations, encryption/access controls, retention periods, and a redaction strategy for prompts/responses.

Section 2.4: Fairness, accessibility, and inclusive recruitment

Ethics includes who gets represented and who bears the burdens of research. AI systems often behave differently across language varieties, dialects, disability contexts (screen readers, motor impairments), and cultural expectations. If your recruitment is narrow, you will miss failure modes and your “human-centered” conclusions will not generalize.

Design recruitment like you would design sampling in psychology, but add system-specific considerations. If the model is intended for broad use, build inclusion criteria that reflect real users: varying age ranges, language proficiency, and technology comfort. If your study is early-stage and you cannot sample broadly, be explicit that you are testing “feasibility” rather than “effectiveness,” and do not over-claim.

Run a bias-and-harm pre-mortem before recruitment: gather the team and ask, “Six months from now, a blog post claims our study harmed people or misled stakeholders. What happened?” Capture scenarios such as exclusionary eligibility criteria, inaccessible interfaces, or differential exposure to harmful outputs. Turn each scenario into a mitigation (alternative formats, accessible tasks, subgroup checks, clearer boundaries).

  • Accessibility checks: keyboard-only navigation, captions for audio, readable contrast, screen-reader compatibility, and plain-language instructions.
  • Inclusive task design: avoid domain knowledge assumptions; provide examples; allow “prefer not to answer” on demographics.
  • Fairness monitoring: plan analyses that look for qualitative differences and quantitative disparities (e.g., error rates, trust ratings) across groups when sample size allows.

Common mistakes: using “college students with high tech literacy” to validate a tool intended for older adults; collecting demographic data without a plan to protect it; and treating accessibility as a post-launch feature. Practical outcome: a recruitment plan that aligns participant characteristics with intended users and documents accessibility accommodations.

Section 2.5: Safety: handling harmful model outputs during studies

Unlike most psychology stimuli, generative models can produce novel harmful content: harassment, sexual content, self-harm advice, medical misinformation, or targeted bias. Safety planning is therefore part of study design, not just content moderation. Your protocol should specify what happens if the model outputs something harmful, and what happens if a participant enters sensitive content.

Start by deciding the safety posture for the study: constrained (pre-written prompts, limited topics), semi-open (guided prompts with guardrails), or open-ended (exploratory, higher risk). Early-stage studies should usually be constrained. The more open-ended the interaction, the more you need real-time monitoring and intervention.

  • Prevent: system prompt guardrails, topic restrictions, refusal behavior, and curated tasks that avoid high-risk domains.
  • Detect: automated classifiers for toxicity/self-harm/PII, plus a human-in-the-loop monitor for live sessions.
  • Respond: a scripted “stop condition,” participant support language, and a clear path to terminate the session and remove the content from recordings where possible.
  • Debrief: explain what occurred, normalize discomfort, provide resources when relevant (e.g., mental health resources), and offer data withdrawal options.

Common mistakes: assuming vendor safety filters are sufficient; failing to rehearse stop/skip procedures; and leaving the facilitator to improvise when a participant is distressed. Engineering judgment: add friction where risk is high (confirmation dialogs before sensitive tasks, warnings for speculative advice), even if it slightly reduces realism. Practical outcome: a safety playbook that defines stop rules, monitoring roles, and participant support steps.

Section 2.6: Documentation: protocols, scripts, and risk mitigations

IRB-ready research is mostly about clarity. You are telling a reviewer—and your future self—exactly what participants experience, what you collect, and how you reduce risk. Assemble an “IRB-style materials draft pack” early, then iterate as the system changes. This pack also helps engineering and research stay aligned as prototypes evolve.

Your draft pack should include: a study protocol (purpose, procedures, duration), facilitator script, consent form, recruitment copy, screening criteria, task instructions, stimuli/prompt sets, survey instruments, debrief, compensation plan, and a risk mitigations table. The risk mitigations table is particularly effective: each identified risk maps to prevention/detection/response controls and who is responsible.

  • Protocol detail that matters for AI: model versioning, system prompt, temperature/decoding settings, tool access, and whether outputs are edited or post-processed.
  • Script detail that matters: what facilitators can and cannot explain, how to handle “Is this a person?” questions, and how to respond to requests for advice.
  • Data appendix: a list of logged fields and screenshots of any vendor settings related to retention/training on user data.

Common mistakes: inconsistent descriptions between protocol and consent, missing documentation of third-party services, and leaving “future analysis” vague. Practical outcome: a complete documentation bundle you can hand to an IRB or ethics review, and that your team can use to run the study consistently across sessions and moderators.

Chapter milestones
  • Consent and risk assessment for AI studies
  • Privacy-first data handling plan
  • Bias and harm check: pre-mortem exercise
  • IRB-style materials draft pack
Chapter quiz

1. Why does Chapter 2 say human-centered AI research needs special ethical readiness compared to many traditional psychology studies?

Show answer
Correct answer: Because the “stimulus” can be an adaptive system that changes behavior, remembers inputs, and can produce unexpected outputs
The chapter highlights that AI systems can adapt, retain user inputs, and generate unpredictable outputs, which changes risk and oversight needs.

2. What does the chapter mean by saying ethical readiness is a “design constraint”?

Show answer
Correct answer: Ethics should shape what you build, what you log, who you recruit, and how you respond to model misbehavior
Ethical readiness is described as something that actively constrains and guides design choices throughout the study lifecycle.

3. According to the chapter’s rule of thumb, what indicates you are not ready to run participants with the system?

Show answer
Correct answer: If you cannot explain your data flow and risk controls on one page
The chapter states that inability to clearly summarize data flow and risk controls signals lack of readiness.

4. Which set best matches the four recurring deliverables emphasized in Chapter 2?

Show answer
Correct answer: Consent and risk assessment; privacy-first data handling plan; bias-and-harm pre-mortem; IRB-style draft pack
The chapter explicitly lists these four artifacts as the recurring deliverables.

5. What is a primary purpose of creating these Chapter 2 artifacts (consent/risk, privacy plan, pre-mortem, IRB-style pack)?

Show answer
Correct answer: To protect participants, reduce rework, and make research easier to replicate
The chapter frames these artifacts as protective, efficiency-improving, and replication-supporting.

Chapter 3: Designing User Studies for AI Interactions

Human-centered AI research lives at the intersection of two moving targets: people adapt to AI tools over time, and AI behavior changes across prompts, contexts, and versions. This chapter shows how to design user studies that are credible to engineering teams, defensible to reviewers, and safe and respectful to participants. You will practice translating a psychological research question (e.g., “When do people over-trust automation?”) into a study plan that connects human outcomes to model behavior (e.g., “Does feature X increase reliance on incorrect suggestions under time pressure?”).

Start with a “research question → decision” mindset: what product or model decision will your results change? Then select a method using a simple matrix: (1) how well you need to explain “why” (high for interviews), (2) how precisely you need to estimate an effect (high for experiments), (3) how early the system is (prototype vs production), and (4) the risk level (privacy, sensitive content, consequential decisions). In practice, teams combine methods: qualitative work to map workflows and failure modes, pilot experiments to validate tasks and measures, and a larger controlled test to quantify impact.

Designing tasks and stimuli is where AI studies differ from many classic psychology studies. You must decide what “the AI” is in the study: a fixed set of model outputs (for comparability), a live model (for realism), or a simulated assistant (for early exploration). For model comparisons, pre-generate stimuli: a balanced set of prompts, outputs, and ground-truth references so each participant sees comparable conditions. Document the provenance: model version, parameters, prompt template, and any safety filters. This documentation is often as important as the statistical result.

Plan sampling, recruitment, and incentives with the same rigor you would apply to a lab study, but align to real user segments: novices vs experts, high-stakes vs casual usage, and varied accessibility needs. Incentives should compensate time and cognitive load; underpaying increases dropout and low-quality responses. Your IRB-ready study packet should include consent language, data handling, risk mitigations, and a clear statement that participants can stop at any time without penalty.

Before running the “real” study, pilot. A pilot runbook prevents wasted weeks: test the survey flow, verify that tasks take the expected time, confirm that the AI outputs are stable (or intentionally variable), and ensure that measures detect meaningful variation. Iteration is normal: in AI interactions, confusing task instructions and inconsistent stimuli are common failure points.

Finally, build a measurement plan that captures behavior (what people do), attitudes (what they report), and performance (what they achieve). You will rarely get a complete picture from one measure. For example, satisfaction can rise while accuracy falls if an AI is fluent but wrong; calibrated trust requires measuring both reliance and correctness. Throughout the chapter, you will see practical checklists and common mistakes to avoid.

Practice note for Method selection matrix for AI research questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Task design and stimulus creation for model comparisons: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Sampling, recruitment, and incentives plan: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Pilot study runbook and iteration checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Interviews and contextual inquiry for AI workflows

Interviews and contextual inquiry are your fastest path to understanding how an AI system fits into real work. Use them when you need to map workflows, uncover implicit goals, and identify where model behavior creates friction (e.g., “The tool saves time, but I don’t know when it’s safe to copy outputs”). For AI, the critical move is to anchor questions in concrete episodes: “Walk me through the last time you used the assistant for X—what did you type, what did it return, what did you do next?” This avoids opinions that are disconnected from actual interaction patterns.

In contextual inquiry, observe participants completing their own tasks in their natural environment. Treat the AI as one actor in a larger system: documents, teammates, policies, deadlines, and accountability. Capture decision points where AI output influences action, and label failure modes (hallucination, missing caveats, biased suggestions, unsafe advice, overconfident tone). A practical artifact is a journey map with “AI touchpoints,” showing inputs, outputs, handoffs, and verification steps.

  • Method selection matrix: choose interviews when “why” is primary, when tasks vary widely across users, or when you expect unknown failure modes.
  • Task design tip: bring lightweight stimuli (example prompts and outputs) to probe reactions consistently, but allow participants to introduce their own real prompts for ecological validity.
  • Common mistake: asking “Do you trust the AI?” without defining the context, stakes, and alternative options.

Engineering judgment: interviews won’t quantify improvement, but they will tell you what to measure later. Use them to define hypotheses for experiments (e.g., “Citations reduce uncritical copy-paste”) and to choose segments for recruitment (e.g., novices who rely heavily vs experts who verify).

Section 3.2: Survey design for trust, satisfaction, and usability

Surveys help you quantify perceptions across larger samples: trust, usability, satisfaction, perceived risk, and self-reported reliance. In AI studies, the challenge is construct clarity: “trust” can mean perceived accuracy, predictability, value alignment, or safety. Write items that specify the referent and situation (e.g., “In this task, I felt the assistant’s suggestions were accurate enough to use without checking”). Use validated scales when possible (e.g., usability scales, workload scales), but ensure wording fits the AI context and your population.

Design surveys to reduce bias. Put task-based questions immediately after each trial (to minimize memory distortion), and global questions at the end (to capture overall impressions). Avoid leading language like “helpful” or “smart.” Include attention checks sparingly and ethically; prefer “instructional manipulation checks” that don’t shame participants. Pre-register key survey outcomes when you can, especially if results will inform a product launch.

  • Measurement plan: combine attitudinal items (trust, satisfaction) with behavioral outcomes (acceptance rate, edits, verification clicks). A high trust score with high error acceptance is a red flag.
  • Sampling plan: match recruitment to the decisions you’ll make—if you’re designing for clinicians, don’t rely solely on crowd workers. If you must, explicitly frame claims as “proxy users.”
  • Common mistake: overloading a single survey with every possible scale, increasing fatigue and noisy data.

Practical outcome: a well-designed survey gives you comparable benchmarks across model versions. Treat it like an instrument—stable enough to track trends, but flexible enough to add task-specific items when a new feature introduces new risks.

Section 3.3: Controlled experiments and A/B tests for AI features

Controlled experiments are the backbone for causal claims: did feature A change behavior relative to feature B? For AI interactions, common manipulations include interface affordances (confidence indicators, citations, warnings), model variants (baseline vs fine-tuned), and policy constraints (safety filtering on/off). Define the independent variable precisely, and keep everything else constant—including prompts, task difficulty, time limits, and instructions.

Task and stimulus creation determine whether your experiment answers the real question. For model comparisons, build a stimulus set that spans easy/hard cases and known failure modes. Pre-generate model outputs when feasible; live generation increases realism but adds variance that can mask effects. If you use live models, log every prompt and response and lock the model version for the study window.

  • Method selection matrix: use experiments when you need effect sizes to justify shipping, or when stakeholders ask “Is the improvement real?”
  • Power basics: decide the smallest effect worth acting on (e.g., 5% reduction in critical errors). Estimate sample size accordingly, and report confidence intervals—not just p-values.
  • Outcome hierarchy: define primary outcomes (e.g., critical error rate, time-to-correct) and secondary outcomes (e.g., satisfaction) to avoid “metric shopping.”

Common mistake: optimizing for a proxy metric (e.g., “users accepted more suggestions”) without accounting for correctness. In AI systems, acceptance can increase because outputs are more persuasive, not more accurate. Your experimental design should explicitly connect human outcomes to model behavior by measuring both reliance and ground-truth performance.

Section 3.4: Wizard-of-Oz and prototype studies for early models

Early in development, you often need to study an interaction concept before the model is reliable, safe, or integrated. Wizard-of-Oz (WoZ) studies let a human simulate the AI behind the scenes so you can test workflows, UI copy, and failure handling. Prototype studies can use mocked outputs, scripted responses, or partially automated pipelines. The goal is to learn quickly while reducing participant risk and engineering cost.

Design WoZ ethically and transparently. Participants should know that some responses may be simulated, unless deception is essential and approved by oversight; even then, debrief thoroughly. Standardize the “wizard” behavior with a playbook: response templates, latency rules, escalation paths for unsafe requests, and constraints to mirror expected model capabilities. Without standardization, you cannot interpret results because the “AI” changes from participant to participant.

  • Pilot runbook: rehearse handoffs, logging, timing, and failure cases (e.g., what the wizard does when they can’t answer).
  • Iteration checklist: after each pilot day, review confusion points in instructions, tasks that are too easy/hard, and any safety issues; then revise stimuli and scripts.
  • Common mistake: treating prototype results as performance validation. WoZ is for interaction validity, not model accuracy claims.

Practical outcome: you can validate task flows, measure demand for features (e.g., “ask for sources”), and identify critical moments where users need guardrails—before investing in training or deployment.

Section 3.5: Measures: trust calibration, reliance, workload, error costs

A strong measurement plan reflects the reality that AI can be simultaneously useful and wrong. Measure outcomes at three levels: (1) behavioral (what users do), (2) performance (correctness/quality of the final work), and (3) attitudinal (what users believe and feel). For trust, aim for calibration, not maximization: users should rely more when the AI is right and less when it is wrong.

Operationalize trust calibration with paired measures: AI accuracy (ground-truth scoring of outputs) and user reliance (acceptance, copy-paste, edit distance, time spent verifying, seeking external sources). A simple approach is to compute reliance rates separately for correct vs incorrect AI suggestions. Over-reliance shows up when users accept incorrect suggestions; under-reliance shows up when users reject correct help and lose time.

  • Workload: use brief workload measures and objective proxies (task time, number of steps, backtracks). Workload matters because “better accuracy” that increases cognitive burden may be a net loss.
  • Error costs: categorize errors by severity (minor style issues vs factual errors vs safety-critical harms). Report outcomes weighted by cost, not just counts.
  • Practical reporting: provide effect sizes and confidence intervals for key outcomes (e.g., reduction in critical errors), and show distributions—not only averages—because AI often helps some users while hurting others.

Common mistake: relying exclusively on satisfaction. Fluent AI can inflate ratings even when it increases downstream errors. A defensible study ties subjective measures to observable behavior and task performance.

Section 3.6: Counterbalancing, randomization, and confounds

AI interaction studies are especially vulnerable to confounds: learning effects, prompt drift, order effects, and differences in task difficulty across conditions. Counterbalancing and randomization are your main tools for protecting causal inference. If participants experience multiple AI conditions (within-subjects), counterbalance the order (e.g., Latin square) so each condition appears equally often in each position. If you run between-subjects, randomize assignment and check baseline equivalence (experience level, domain knowledge).

Randomize at the right unit. If tasks vary in difficulty, randomize which stimuli each participant sees, or rotate stimuli across conditions so no model gets the “easy set.” When using pre-generated outputs, ensure both conditions are matched on length, tone, and topical coverage unless those are the manipulated features. When using live models, guard against temporal drift: lock versions, cap study duration, and monitor output changes.

  • Confounds checklist: instruction differences, UI layout changes, unequal latency, novelty effects, and differential logging failures.
  • Practical mitigation: include manipulation checks (e.g., “Did you notice citations?”) to confirm the feature was perceived, and track exposure (how often the feature appeared).
  • Common mistake: changing two things at once (model + UI + policy). If you must bundle changes, state explicitly that the test is of the package, not an isolated feature.

Engineering judgment: perfect control is rarely possible in applied AI. Your goal is to be explicit about what is controlled, what is randomized, and what limitations remain—so stakeholders can make decisions with appropriate confidence.

Chapter milestones
  • Method selection matrix for AI research questions
  • Task design and stimulus creation for model comparisons
  • Sampling, recruitment, and incentives plan
  • Pilot study runbook and iteration checklist
  • Measurement plan: behavioral + attitudinal + performance
Chapter quiz

1. What is the primary purpose of adopting a “research question → decision” mindset when designing a user study for AI interactions?

Show answer
Correct answer: To ensure the study results directly inform a product or model decision
The chapter emphasizes starting from what decision the findings will change, then choosing methods and measures accordingly.

2. According to the method selection matrix in the chapter, which method is most appropriate when you need a strong explanation of “why” something happens?

Show answer
Correct answer: Interviews
The matrix notes that explaining “why” is highest for interviews, while experiments prioritize precise effect estimation.

3. For a model comparison study where comparability across participants matters most, what task/stimulus approach does the chapter recommend?

Show answer
Correct answer: Pre-generate a balanced set of prompts, outputs, and ground-truth references
Pre-generated stimuli help ensure each participant experiences comparable conditions for valid model comparisons.

4. Why does the chapter emphasize documenting provenance (e.g., model version, parameters, prompt template, safety filters) for AI study stimuli?

Show answer
Correct answer: Because it can be as important as the statistical result for credibility and defensibility
AI behavior varies across versions and contexts, so provenance is critical for reproducibility and reviewer/engineering confidence.

5. Which measurement plan best matches the chapter’s guidance on evaluating AI interactions?

Show answer
Correct answer: Measure behavior, attitudes, and performance because any single measure can be misleading
The chapter warns that satisfaction can rise while accuracy falls and stresses combining behavioral, attitudinal, and performance measures.

Chapter 4: Quantitative Analysis for Study Results

Human-centered AI research lives at the intersection of people and systems: you observe user behavior, measure user outcomes, and connect those outcomes to model behavior and interface design. Quantitative analysis is the bridge that lets you turn study observations into claims you can defend. This chapter focuses on the practical workflow: cleaning and structuring your dataset, choosing tests and reporting effect sizes correctly, using power basics to make realistic sampling decisions, and writing results narratives that include uncertainty rather than hiding it.

If you are transitioning from psychology, you already know the logic of measurement, experimental design, and inference. The shift in AI settings is that your “stimulus” might be an evolving model, your “task” might be a product-like interaction, and your dependent variables often mix subjective ratings (trust, workload) with behavioral traces (time, acceptance rate, edits). Your job is to make the dataset analysis-ready, pick methods that respect the design (especially repeated measures), and present results in a way that is both statistically sound and meaningful for engineering decisions.

A consistent theme: treat analysis as an extension of design. The choices you make—how you encode trials, how you aggregate per participant, what counts as an outlier—change the estimand (what you are actually estimating). Good analysis makes those choices explicit and checks whether conclusions are robust to reasonable alternatives.

Practice note for Clean data and build an analysis-ready dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose tests and report effect sizes correctly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Power and sample size: practical decision rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a results narrative with uncertainty: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Clean data and build an analysis-ready dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose tests and report effect sizes correctly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Power and sample size: practical decision rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a results narrative with uncertainty: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Clean data and build an analysis-ready dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Data structures: participants, trials, and repeated measures

Most AI user studies are not “one row per person.” They are usually multilevel: participants complete multiple tasks, each task includes multiple trials, and each trial may include multiple model outputs (e.g., top-1 vs top-5, regenerated suggestions). If you flatten this into a single table without care, you can accidentally treat repeated observations as independent—one of the most common causes of overconfident p-values and narrow confidence intervals.

Start by deciding your unit of analysis for each outcome. Examples: (1) user-level outcomes like SUS, overall trust, or post-task satisfaction are naturally one row per participant; (2) trial-level outcomes like correctness, time-on-trial, or whether the user accepted the model suggestion are one row per trial; (3) text-edit distance or quality ratings might be one row per model output. Once you pick the level, ensure your dataset has stable identifiers: participant_id, condition, task_id, trial_id, and (if needed) model_output_id. Include timestamps and versioning: model version, prompt template version, UI build, and randomization seed. In AI work, reproducibility often fails because “the model changed” and the data no longer matches the analysis.

Cleaning to build an analysis-ready dataset is mostly about aligning the log data with the experimental design. Verify randomization (counts per condition), confirm counterbalancing worked (order distributions), and document exclusions with rules set before looking at outcomes (e.g., failed attention checks, impossible completion times, incomplete surveys). For repeated measures, decide how to handle missing trials: do you drop participants, impute, or analyze with models that tolerate missingness? A practical rule is to avoid ad hoc averaging that hides missingness; instead, compute per-participant summaries with explicit denominators (e.g., acceptance_rate = accepts / eligible_trials) and keep both numerator and denominator for transparency.

  • Common mistake: computing a t-test on trial-level rows without clustering by participant, inflating the effective sample size.
  • Practical outcome: a tidy dataset where each row matches a design unit, with IDs that let you aggregate correctly and diagnose anomalies.
Section 4.2: Descriptives and visualization for AI study data

Before any hypothesis test, run descriptives that answer: “What happened?” In human-centered AI, this includes both participant outcomes and model behavior. Compute means/medians, standard deviations/IQRs, and missingness rates by condition. Also compute model-centric descriptors: average confidence score, response length, refusal rate, toxicity score, or latency. The goal is to see whether a user effect might actually be driven by a model behavior shift (for example, one condition produces longer answers, which changes task time and perceived helpfulness).

Visualization is your fastest debugging tool. For between-subject outcomes, use dot plots with confidence intervals, not just bar charts. For repeated measures, use paired plots (lines connecting each participant across conditions) to reveal heterogeneity: maybe the average effect is small, but some users benefit a lot while others are harmed. For trial-level outcomes, use distributions and faceting by task; AI tasks often differ sharply in difficulty, so pooling can mislead. If you have time series interaction logs, plot learning curves or fatigue effects across trial order to check whether ordering confounds your main effect.

Descriptives also support engineering judgment. If you see a heavy-tailed time distribution, median and robust summaries may be more informative than mean. If acceptance rate is near 0% or 100%, a linear model on raw percentages may behave poorly; consider logistic models or analyze log-odds. When you later choose tests, these descriptive checks guide whether assumptions (normality, equal variance) are plausible and whether transformations or nonparametric options are warranted.

  • Common mistake: skipping plots and discovering post hoc that one participant contributed 200 trials due to a logging bug.
  • Practical outcome: a descriptive “data story” that validates the pipeline and narrows the plausible explanations before inference.
Section 4.3: Hypothesis tests vs estimation (CIs) in practice

In product-adjacent AI research, teams often ask, “Did it work?” A p-value seems to answer that, but it rarely answers the decision question you actually have: “How much did it change outcomes, and how uncertain are we?” Estimation-first reporting—effect size plus confidence interval—forces clarity. A non-significant result with a wide CI may still be compatible with a meaningful effect; it mainly says your study was not precise enough.

Use hypothesis tests when you need a controlled error rate for a specific decision threshold (e.g., releasing a risky feature) or when preregistration/IRB requires a confirmatory plan. Otherwise, treat tests as secondary checks. For most human-centered AI studies, report: (1) the estimand (difference in mean satisfaction, odds ratio of acceptance, median time difference), (2) the point estimate, (3) a 95% CI (or a CI aligned with your risk tolerance), and (4) a practical interpretation tied to the task.

How do you get CIs? For simple designs, analytic CIs are fine (t-based for mean differences). For messier metrics (e.g., edit distance, rank-based outcomes, bounded ratings), bootstrap CIs are practical and interpretable—just ensure you resample at the correct level (participant-level bootstrap for repeated measures, not row-level). If you report multiple outcomes, be explicit about which are primary; otherwise readers will correctly suspect selective emphasis. In AI contexts, it is common to have many correlated outcomes (trust, helpfulness, reliance). Estimation helps: you can show a pattern of effects with uncertainty rather than chasing a single “significant” metric.

  • Common mistake: interpreting “p > .05” as evidence of no effect, even when the CI includes large beneficial or harmful effects.
  • Practical outcome: results framed as magnitude + uncertainty, which better supports model and UI iteration.
Section 4.4: Common designs: within/between, mixed, and nonparametric options

Your design determines your analysis. Between-subject designs (each participant sees one condition) are simpler but require larger samples. Within-subject designs (each participant sees multiple conditions) are powerful and common in AI evaluation because they control for individual differences—especially useful when participants vary widely in baseline skill. The catch is carryover: seeing one model output can change how a participant judges the next. Use counterbalancing, randomization, and washout tasks when feasible.

For between-subject comparisons of continuous outcomes, independent-samples t-tests or linear regression are standard. For within-subject two-condition comparisons, paired t-tests or regression with participant fixed effects can work. For mixed designs (e.g., condition between-subject, task repeated within-subject), use mixed-effects models (random intercepts for participants, and often for items/tasks). In human-centered AI, “items” matter: if tasks are sampled from a broader universe (different prompts, documents, questions), treat them as random effects when possible to avoid overgeneralizing from a small item set.

Nonparametric options are not “inferior”; they are tools for distributions that violate assumptions or for ordinal data. Wilcoxon signed-rank (paired) and Mann–Whitney U (independent) are common, but remember they test differences in distributions, not strictly means. For binary outcomes (accepted suggestion yes/no), logistic regression (possibly mixed-effects) is usually more interpretable than comparing proportions with a z-test, especially with repeated measures.

Power and sample size decisions depend on design. Practical decision rules: (1) if you can do within-subject without serious carryover, it often cuts required N substantially; (2) if you must do between-subject, prioritize fewer, higher-quality tasks and a single primary outcome to avoid spreading power thin; (3) for mixed models, plan for more participants than a simple t-test calculation suggests, because clustering and item variability reduce effective information. When exact power is hard, use “minimum detectable effect” thinking: pick the smallest effect that would justify a product/model change, then plan N to estimate it with a CI narrow enough to make a call.

Section 4.5: Reliability, scale scoring, and measurement validity

Quantitative analysis is only as good as the measurement. In psychology-to-AI transitions, this is your home advantage: you know that a metric is not a fact; it is an operationalization. Start by scoring scales correctly. Reverse-score items where needed, verify response ranges, and decide how to handle missing items (common rule: compute scale score if at least, say, 80% of items are present; otherwise set missing). Keep both raw item responses and scored composites so others can audit your choices.

Reliability is not a checkbox but a diagnostic. Compute internal consistency (often coefficient alpha or omega) for multi-item scales, but interpret it with context: alpha can be inflated by many similar items and does not prove unidimensionality. If you have enough data, check item-total correlations and whether reliability differs by condition (a sign that participants interpreted items differently across interfaces). For behavioral measures, reliability may mean stability across repeated tasks or inter-rater agreement if humans label outputs (use ICC or Krippendorff’s alpha as appropriate).

Measurement validity is especially tricky in AI studies because the system can change what the construct “means.” For example, “trust” might shift from interpersonal trust to calibration of model uncertainty; “helpfulness” may be confounded with verbosity. Use convergent checks: do self-reported reliance correlate with observed acceptance rate? Use discriminant checks: does your “trust” scale accidentally track mere satisfaction? When feasible, include manipulation checks tied to model behavior (e.g., did participants notice uncertainty cues) and track whether the model’s actual accuracy changed between conditions; otherwise you may attribute effects to UI when they are driven by model quality.

  • Common mistake: averaging Likert items into a score without checking reverse-keying or whether items form a coherent scale.
  • Practical outcome: defensible measures that let you connect human outcomes to model behavior without construct drift.
Section 4.6: Reporting: effect sizes, practical significance, and limitations

Reporting is where analysis becomes decision support. Start with effect sizes, not p-values. For mean differences, report the raw difference in natural units (e.g., +0.6 points on a 7-point helpfulness scale, −18 seconds per task) and a standardized effect size when helpful (Cohen’s d for between-subject, paired d or standardized mean change for within-subject). For binary outcomes, report risk difference or odds ratio with a CI; odds ratios are common in regression but can be unintuitive—pair them with predicted probabilities at representative baselines.

Always pair effect sizes with confidence intervals and clarify what the interval conditions on (your sample, your design, your model assumptions). If you used mixed-effects models, report the fixed effect estimate, CI, and the random-effects structure. If you used bootstrapping, state the resampling unit (participants, not trials) and the number of resamples. These details protect you from the “looks significant because of pseudoreplication” critique.

Practical significance is the missing link in many AI studies. Translate effects into user and business impact: fewer errors per 100 tasks, reduced time per document, improved calibration (higher accuracy at same confidence), or reduced overreliance (acceptance decreases when model is wrong but remains high when model is right). For human-centered AI, also report trade-offs: a feature may increase trust but also increase time, or improve speed while reducing understanding. A good results narrative presents these trade-offs with uncertainty and suggests next experiments or mitigations.

Limitations should be specific and actionable. Common ones: sample not representative (e.g., crowdworkers vs target professionals), item set too narrow (few prompts/documents), model version drift, learning effects in within-subject designs, and measurement validity threats (ratings influenced by output length). End your chapter-style report with what would change your mind: what additional data, replications, or ablations (e.g., holding model output constant while changing UI) are needed to confirm causality. This mindset makes your quantitative work credible to both IRB reviewers and engineering teams.

Chapter milestones
  • Clean data and build an analysis-ready dataset
  • Choose tests and report effect sizes correctly
  • Power and sample size: practical decision rules
  • Create a results narrative with uncertainty
Chapter quiz

1. Why does Chapter 4 describe quantitative analysis as a “bridge” in human-centered AI research?

Show answer
Correct answer: It turns study observations into defensible claims by connecting user outcomes to model/interface factors
The chapter emphasizes using quantitative analysis to move from observations to claims you can defend, linking people-level outcomes to system design and model behavior.

2. What is the key shift when moving from psychology studies to AI settings, according to the chapter?

Show answer
Correct answer: Dependent variables often combine subjective ratings with behavioral traces in product-like interactions with evolving models
AI contexts often involve evolving model “stimuli,” product-like tasks, and mixed outcomes (e.g., trust ratings plus time/acceptance/edit logs).

3. Which choice best reflects the chapter’s guidance on selecting analysis methods?

Show answer
Correct answer: Pick methods that respect the study design, especially repeated-measures structures
The chapter highlights choosing tests appropriate to the design (notably repeated measures) and reporting effect sizes correctly.

4. According to the chapter, why do data cleaning and structuring decisions matter beyond convenience?

Show answer
Correct answer: They can change the estimand (what is being estimated), affecting the conclusions you can draw
Encoding trials, aggregation level, and outlier rules can alter what quantity you estimate, so those choices must be explicit.

5. What is a core principle for writing the results narrative in Chapter 4?

Show answer
Correct answer: Include uncertainty rather than hiding it, and present results in a way meaningful for engineering decisions
The chapter stresses results narratives that acknowledge uncertainty and translate statistical findings into decision-relevant meaning.

Chapter 5: Qualitative Analysis and Mixed-Methods Triangulation

Human-centered AI research lives at the intersection of lived experience and measurable system behavior. Qualitative analysis is how you make that intersection legible: you turn messy conversations, observations, and open-ended feedback into evidence you can act on. Mixed-methods triangulation then connects that evidence to your quantitative results (task success, error rates, calibration, trust scores) and to model analysis (failure clusters, uncertainty, sensitivity to prompts, toxicity). Done well, this chapter’s workflow helps you move from an interview guide to a codebook, run a rapid thematic analysis sprint, triangulate themes with quantitative findings, and finally turn insights into model and product recommendations—while writing defensible limitations and ethics sections.

The common failure mode is “insight theater”: quotes without counts, themes without boundaries, and recommendations without links to either user outcomes or model behavior. Your goal is tighter: produce traceable claims. Each claim should be supported by (1) qualitative evidence (themes and illustrative excerpts), (2) quantitative evidence when available (who, how many, how strongly), and (3) a plausible mechanism tied to system behavior (e.g., the model’s hedging language increases perceived uncertainty, which lowers compliance for high-stakes tasks). That is the essence of triangulation.

  • Core deliverables: anonymized corpus, codebook, coded dataset, theme map, triangulation table, recommendation backlog, limitations + ethics write-up.
  • Engineering judgment points: what to record, how to anonymize, how deep to code, when to stop (saturation), and how to prioritize actions.

This chapter assumes you already have a study design and have collected data ethically. Now you will make it analyzable, credible, and useful for decisions.

Practice note for Interview guide to codebook: end-to-end workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a rapid thematic analysis sprint: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Triangulate qual themes with quant findings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Turn insights into model and product recommendations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a defensible limitations and ethics section: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Interview guide to codebook: end-to-end workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a rapid thematic analysis sprint: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Triangulate qual themes with quant findings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Turn insights into model and product recommendations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Recording, transcription, and anonymization basics

Start with data hygiene. If you cannot safely store, anonymize, and retrieve your recordings and transcripts, you cannot credibly analyze them—especially for IRB-reviewed studies or industry compliance. Record with explicit consent and a clear purpose statement (“We record to accurately capture your feedback; only the research team will access it; we remove identifying details”). For remote studies, capture audio plus screen if interaction behavior matters (e.g., how participants interpret model outputs, where they hesitate). Name files with a non-identifying participant ID, not a name or email.

Transcription is not just clerical work; it is an analytic choice. Automated transcription is typically acceptable for rapid sprints, but you must spot-check for systematic errors (accents, technical terms, product names). If your research hinges on subtle language (e.g., perceived judgment in a mental health chatbot), consider human review or targeted correction. Preserve time stamps if you plan to align quotes with interaction logs, model prompts, or specific UI states.

  • Minimum anonymization: remove names, employers, specific locations, unique incidents, and any account identifiers. Replace with bracketed tags (e.g., [CITY], [HOSPITAL]).
  • AI-system-specific risk: prompts and screenshots can contain personal data. Treat them like transcripts; redact before sharing.
  • Storage: keep raw audio in restricted folders; share only redacted transcripts for coding when possible.

Practical workflow: (1) export recordings, (2) generate transcripts, (3) run an anonymization pass, (4) create a “data dictionary” describing what is redacted and why, (5) lock the raw dataset and analyze from a working copy. This supports a defensible methods section and reduces the risk that qualitative artifacts leak sensitive data into slide decks or issue trackers.

Section 5.2: Coding approaches: inductive, deductive, hybrid

Coding is how you translate interview content into a structured dataset. In human-centered AI, you typically want a hybrid approach: deductive codes anchored to your research questions plus inductive codes for unexpected behaviors and mental models. Deductive codes might reflect constructs like trust, reliance, perceived competence, or calibration. Inductive codes capture emergent patterns like “users treat the model as a search engine” or “participants re-prompt until they see confident language.”

Move from interview guide to codebook with an end-to-end workflow: start by listing each question’s intent and the decision it supports. Then draft a codebook that includes (1) code name, (2) definition, (3) inclusion/exclusion criteria, (4) examples, and (5) links to related codes. This prevents a classic mistake: codes that are labels without boundaries, which produces inconsistent application and weak conclusions.

  • Inductive coding: best for discovery, early-stage prototypes, and identifying unanticipated harms.
  • Deductive coding: best when you must map findings to predefined constructs (e.g., TAM, trust scales) or when stakeholders need continuity across studies.
  • Hybrid coding: best default in AI research because model behavior often creates new interaction phenomena.

Engineering judgment: code at the right granularity. If you code every sentence, you may drown in detail. If you code only per-interview summaries, you lose traceability. A practical compromise is coding by “meaningful segment” (a few sentences around one idea), keeping the quote that best captures the segment. Use memos to note hypotheses tying user statements to model behavior (“Hedged answers → users seek external confirmation”). Those memos become the bridge to mixed-methods triangulation later.

Section 5.3: Thematic analysis and saturation (pragmatic use)

Thematic analysis turns codes into higher-level explanations: what is happening, for whom, and under what conditions. In a rapid thematic analysis sprint, timebox your work so you can inform product and model iterations. A practical sprint looks like: Day 1—align on the research questions and codebook; Day 2—double-code a small subset and refine definitions; Day 3—complete coding; Day 4—cluster codes into candidate themes; Day 5—write theme narratives with evidence and implications.

Theme narratives should include a mechanism and a boundary. Mechanism explains why the theme occurs (e.g., “Users over-rely when the model uses authoritative tone”), and boundary states when it does not (e.g., “Less reliance when the UI displays uncertainty bands and provides citations”). Without boundaries, stakeholders may overgeneralize, and you will have trouble writing limitations.

  • Theme quality check: each theme should be coherent internally and distinct from others.
  • Evidence mix: include 2–4 strong quotes plus a short note on prevalence (e.g., “observed in 7/12 participants”).
  • Link to system context: specify which tasks, prompts, or UI states triggered the theme.

Saturation is often misunderstood as a magical number. Use it pragmatically: you are “saturated enough” when new interviews stop changing your decision-relevant conclusions. Track this explicitly with a simple log: after each interview, note whether it introduced a new code, changed a theme, or added an important boundary condition. For model-facing work, also watch for new failure modes; even one interview can surface a serious harm pathway that warrants action, regardless of saturation.

Section 5.4: Trustworthiness: bias, reflexivity, and inter-rater alignment

Qualitative credibility is not achieved by claiming objectivity; it is achieved through transparency and disciplined processes. Start with reflexivity: document what you expect to find, your relationship to the product, and any incentives (e.g., you built the feature). Write a short reflexive memo before coding and update it when your interpretation changes. This becomes part of a defensible limitations and ethics section because it shows you actively managed bias.

Inter-rater alignment is useful, but treat it as a calibration tool, not a scoreboard. In industry, compute agreement only if it helps you reduce ambiguity. A strong workflow is: (1) two coders independently code the same 10–20% of transcripts, (2) meet to discuss disagreements, (3) refine code definitions and examples, (4) repeat on a small set if needed. The goal is shared understanding, not perfect numerical agreement.

  • Common mistake: redefining codes silently midstream. Fix by versioning the codebook and noting changes.
  • Audit trail: keep memos on key decisions, theme merges, and exclusion criteria.
  • Negative cases: explicitly look for counterexamples (participants who did not experience the theme). They sharpen boundaries and prevent overclaiming.

Bias can also enter through tooling: AI-assisted summarizers may flatten nuance or hallucinate structure. If you use AI to accelerate synthesis, constrain it to mechanical tasks (formatting excerpts, clustering candidate codes) and always verify against raw quotes. Trustworthiness comes from being able to show how each claim traces back to original participant language and observed interaction behavior.

Section 5.5: Mixed-methods patterns: convergence and dissonance

Triangulation is where human outcomes meet model behavior. Build a triangulation table that lists each qualitative theme alongside relevant quantitative metrics and model analysis artifacts. For each row, state whether the evidence converges (points in the same direction), complements (adds different facets), or shows dissonance (contradiction). Dissonance is not failure—it is often the most valuable signal.

Example of convergence: interviews reveal “users trust confident answers even when wrong,” and your experiment shows higher reliance scores when the model uses assertive phrasing, while model evaluation reveals those assertive completions correlate with higher hallucination rates on certain topics. That triangle supports an actionable claim about calibration and tone control. Example of dissonance: surveys show high satisfaction, but interviews show participants quietly avoid using the feature for high-stakes decisions. The resolution might be social desirability bias in surveys, or a mismatch between “liking” and “relying.”

  • Quant ↔ qual bridge: map themes to measurable constructs (trust, workload, comprehension) and specify which instrument or metric operationalizes them.
  • Model ↔ user bridge: connect themes to failure clusters (e.g., sensitive-topic refusals, citation errors, prompt sensitivity) and to UI affordances.
  • Segment analysis: check whether themes differ by expertise, risk tolerance, or domain familiarity; align with quantitative subgroup comparisons when available.

When triangulation produces dissonance, do not “average it out.” Treat it as a hypothesis generator: you may need a follow-up study, targeted logging, or a model ablation test (e.g., remove hedging, add citations, alter refusal policy) to identify the mechanism. This is how qualitative insights become testable model and product questions, not just narrative.

Section 5.6: Insight-to-action: recommendation framing and prioritization

Insights are only as useful as the decisions they enable. Convert each theme into a recommendation using a consistent template: Observation (theme + evidence), Impact (who is harmed or helped, and severity), Mechanism (tie to model behavior or UI), Recommendation (specific change), and Validation plan (how you will test improvement). This keeps you from proposing vague fixes like “improve accuracy” or “make it more transparent.”

Prioritize recommendations with a simple rubric that fits AI work: (1) user risk (severity × likelihood), (2) business or mission impact, (3) feasibility (data, engineering complexity, policy constraints), and (4) evaluability (can you measure improvement with available metrics?). Include both product changes (UI, onboarding, guardrails, explanations) and model changes (fine-tuning targets, decoding constraints, retrieval quality, refusal thresholds). Mixed-methods evidence should determine which lever you pull first.

  • Model recommendation example: “Reduce authoritative tone in low-confidence outputs; add calibrated uncertainty language + citations for retrieval-backed answers.”
  • Product recommendation example: “Add a ‘verify’ panel showing sources and a one-click way to compare alternative answers; log verification actions.”
  • Research follow-up: “Run an A/B study measuring reliance calibration and task accuracy; interview a subset to check for new confusion.”

Finally, write limitations and ethics as part of the action package, not an afterthought. Limitations should cover sampling constraints, reactivity (participants behaving differently under observation), tooling artifacts (transcription errors), and generalizability boundaries (task domain, language, expertise). Ethics should address privacy (recordings, prompts), potential downstream harms (overreliance, sensitive content), and fairness considerations (who might be disproportionately affected). A defensible section states what you did, what you could not do, and what risks remain—so stakeholders can make informed decisions about deploying or iterating the system.

Chapter milestones
  • Interview guide to codebook: end-to-end workflow
  • Run a rapid thematic analysis sprint
  • Triangulate qual themes with quant findings
  • Turn insights into model and product recommendations
  • Write a defensible limitations and ethics section
Chapter quiz

1. In Chapter 5, what best describes the goal of mixed-methods triangulation in human-centered AI research?

Show answer
Correct answer: Connect qualitative themes to quantitative results and model behavior to produce traceable, mechanism-based claims
Triangulation links qual evidence to quant findings and model/system behavior so claims are actionable and traceable.

2. Which set of elements most closely matches the chapter’s definition of a well-supported (traceable) claim?

Show answer
Correct answer: Qualitative evidence, quantitative evidence when available, and a plausible mechanism tied to system behavior
Chapter 5 specifies three supports: qual themes/excerpts, quant strength/coverage when available, and a system-behavior mechanism.

3. What is the chapter’s named common failure mode that the workflow is designed to prevent?

Show answer
Correct answer: Insight theater: quotes without counts, themes without boundaries, recommendations without links to outcomes or model behavior
The chapter warns against “insight theater,” where outputs look persuasive but are not evidentially grounded.

4. Which deliverable best captures the chapter’s intent to connect qualitative themes to quantitative and model-analysis evidence?

Show answer
Correct answer: Triangulation table
A triangulation table is explicitly meant to align qual themes with quantitative results and model/system behavior evidence.

5. Which decision is presented as an engineering judgment point during qualitative analysis and triangulation?

Show answer
Correct answer: What to record and how to anonymize the corpus
The chapter lists judgment calls such as what to record, how to anonymize, how deep to code, when to stop (saturation), and prioritization.

Chapter 6: Analyzing Model Behavior to Complement User Evidence

User studies tell you what people experience: confusion, delight, mistrust, overreliance, abandonment. Model behavior analysis tells you why those experiences happen and what changes are likely to fix them. Human-centered AI research needs both. If you only run interviews and surveys, you may misattribute a problem to “UX” when it is actually a systematic model failure. If you only run benchmarks, you may optimize metrics that don’t matter for real decisions, workflows, and risks.

This chapter gives you a practical workflow for combining user evidence with model behavior evidence. You will define failure modes and an evaluation rubric, build a minimal test set aligned to user tasks, run behavior analysis to surface trade-offs, and then turn results into a portfolio case study and an interview narrative. Finally, you will create a 90-day transition plan that mirrors how human-centered AI teams operate: fast, ethical, and decision-focused.

The key mindset shift for psychologists: treat the model as a participant whose behavior can be measured repeatedly under controlled conditions. Your familiar tools—operational definitions, coding manuals, inter-rater agreement, and validity checks—translate directly. The difference is that the “participant” changes with prompts, temperature, retrieval sources, and guardrails, and you must document those conditions as carefully as you would document a lab protocol.

Practice note for Define failure modes and an evaluation rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a minimal test set aligned to user tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run behavior analysis and interpret trade-offs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a portfolio case study and interview story: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for 90-day transition plan: skills, projects, and networking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define failure modes and an evaluation rubric: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a minimal test set aligned to user tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run behavior analysis and interpret trade-offs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a portfolio case study and interview story: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for 90-day transition plan: skills, projects, and networking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Model behavior basics: why users see “errors” differently

Section 6.1: Model behavior basics: why users see “errors” differently

In user research, an “error” is rarely just a wrong answer. It is a breakdown in a goal-directed interaction: the user can’t complete a task, becomes less confident, wastes time, or makes a harmful decision. Two outputs can be equally inaccurate yet feel very different to users. A fluent hallucination can increase overtrust; a hesitant but partially correct response can decrease trust even when it is safer. This is why model behavior analysis must be anchored to user tasks and outcomes, not abstract correctness.

Start by mapping the user journey to model touchpoints: where the model proposes, summarizes, classifies, or instructs. For each touchpoint, write a task-level success definition (e.g., “user can draft a complaint email that includes required facts and tone constraints”). Then decide what “failure” means in that context (e.g., “fabricates order numbers,” “omits legal disclaimers,” “uses hostile tone”). This is your bridge from psychology-style operationalization to engineering-style evaluation.

Common mistake: treating any model mistake as equal severity. In practice, severity depends on user context (stakes, expertise, time pressure) and failure visibility (obvious vs subtle). Build a simple severity scale (e.g., 0–3) and include “detectability” as a separate dimension: an undetectable error is usually more dangerous than an obvious one.

  • Practical outcome: a clear list of user tasks, model touchpoints, and failure definitions you can test repeatedly.
  • Engineering judgment: prioritize failures that change user decisions, not just outputs.

This foundation sets you up to define failure modes precisely enough to measure, yet in language that stakeholders understand.

Section 6.2: Error taxonomies: hallucination, bias, brittleness, toxicity

Section 6.2: Error taxonomies: hallucination, bias, brittleness, toxicity

A useful taxonomy turns scattered “bad examples” into categories you can quantify, compare, and mitigate. Keep the taxonomy small at first; you can refine it after you see real outputs. For human-centered AI work, four categories recur across products and studies: hallucination, bias, brittleness, and toxicity. Each has different user impacts and mitigation levers.

Hallucination is content presented as fact without support. In user terms, it causes misinformation, wasted effort, and misplaced trust. Subtypes that matter: fabricated citations, invented personal data, and incorrect procedural steps. Bias includes differential quality or harmful stereotypes across groups; users experience it as unfairness, exclusion, or reputational harm. Brittleness is sensitivity to phrasing, formatting, or edge cases; users experience it as unpredictability and “walking on eggshells.” Toxicity includes hateful, harassing, sexual, or self-harm-related content; users experience it as harm and safety risk, with organizational consequences.

Translate each category into observable signals you can code. For example, hallucination might be “claims a policy exists when it does not,” bias might be “assigns higher risk scores for equivalent profiles,” brittleness might be “fails when date format changes,” toxicity might be “uses slurs or encourages self-harm.” Add a context tag tied to user tasks (customer support, medical education, hiring screening). The same output can change category or severity depending on the task context.

  • Practical rubric tip: include “harm pathway” notes: how the error could lead to a user or organizational harm.
  • Common mistake: mixing categories with causes (e.g., “bad prompt”)—keep the taxonomy about observable behavior first, then diagnose causes separately.

This taxonomy is the backbone of your evaluation rubric and the language you will use in portfolio artifacts and stakeholder memos.

Section 6.3: Evaluation design: prompts, rubrics, and human rating protocols

Section 6.3: Evaluation design: prompts, rubrics, and human rating protocols

To analyze model behavior reliably, you need three ingredients: (1) a minimal test set aligned to user tasks, (2) prompts that reflect realistic use, and (3) a rubric with a consistent human rating protocol. Think of this as building a small, IRB-like measurement instrument for model outputs: standardized inputs, standardized scoring, documented procedures.

Minimal test set: begin with 20–60 cases per key task, not hundreds. Choose cases that represent the task distribution and the risky corners: common intents, rare but high-severity scenarios, and known user pain points from interviews. If you have user transcripts, extract prompts users actually wrote (after privacy review). If not, synthesize prompts that mirror real constraints (tone, length, required fields). Document each case with a short “why this matters” note tied to user outcomes.

Prompt design: specify system instructions, user messages, retrieval sources (if any), and decoding settings. Treat changes like experimental manipulations: you cannot interpret improvements if you changed multiple things at once. A common workflow is A/B prompts: baseline prompt vs improved prompt, evaluated on the same test set. Another is stress testing: paraphrases, formatting changes, adversarial wording, or multilingual variants to measure brittleness.

Rubrics and rating protocols: build a rubric that includes both task success and failure modes. Use anchored scales (e.g., 1–5 with examples) rather than vague labels. Train raters with a short codebook and calibration set; then measure agreement (percent agreement or a kappa statistic) on a subset. When disagreement is high, refine definitions, not just rater behavior.

  • Practical outcome: a repeatable “evaluation harness” you can run weekly as the model or prompt changes.
  • Common mistake: rating without blinding. Whenever possible, blind raters to condition (baseline vs new prompt) to reduce expectancy effects.

This section is where psychology measurement discipline becomes a competitive advantage in AI research teams.

Section 6.4: Metrics vs human judgment: alignment and disagreement

Section 6.4: Metrics vs human judgment: alignment and disagreement

Teams often want a single number: accuracy, win-rate, toxicity score, or “hallucination rate.” Metrics are useful, but only when they align with user outcomes and the rubric. Your job is to identify when metrics are trustworthy proxies and when they mislead. A model can improve a benchmark metric while making users less successful—especially when the metric ignores tone, clarity, uncertainty, or compliance requirements.

Use a two-layer approach. First, compute simple rubric-based rates: task success rate, severe failure rate, hallucination count per response, refusal appropriateness. Add confidence intervals so stakeholders understand uncertainty, especially with small test sets. Second, compare these to automated metrics (e.g., similarity scores, classifier-based toxicity) and look for systematic disagreements. Disagreement is not “noise”; it is diagnostic. For example, a toxicity classifier might flag reclaimed language in supportive contexts, while humans rate it as safe. Or a similarity metric might penalize a correct answer phrased differently, while users find it clearer.

When metrics disagree with human judgment, ask: is the metric missing a construct users care about (construct undercoverage)? Or is it measuring something else (construct-irrelevant variance)? This is classic validity reasoning from psychology. Then decide what to do: revise the rubric, change the metric, add human review for high-stakes cases, or segment analysis by scenario type.

  • Practical outcome: a short “alignment table” showing which automated metrics are acceptable proxies for which rubric dimensions.
  • Common mistake: averaging everything. Segment by failure mode and severity; a small increase in rare severe failures can outweigh many minor improvements.

This analysis helps you interpret trade-offs honestly: safer models may refuse more, more helpful models may risk hallucination, and you need to quantify which trade-off best supports user goals.

Section 6.5: Connecting user outcomes to model changes and guardrails

Section 6.5: Connecting user outcomes to model changes and guardrails

The goal is not to “prove the model is good,” but to decide what to change next. Connect user outcomes (time saved, decision quality, trust calibration, error recovery) to concrete levers: prompt changes, retrieval grounding, tool use, content filters, uncertainty messaging, or UI guardrails. This is where model behavior evidence complements user evidence: user studies reveal where people struggle; behavior analysis reveals the mechanism and how often it happens.

Create a simple causal chain for each prioritized failure mode: interaction context → model behavior → user interpretation → downstream action → harm or success. Then propose mitigations at multiple layers. Example: hallucinated policy details in customer support. Model-side mitigation: require citations from a policy KB and refuse if no source. UX mitigation: show quoted sources and a “verify in portal” link. Process mitigation: human review for refunds above a threshold. Your evaluation should then test both model behavior (hallucination rate drops) and user outcome (users resolve issues faster without increased confusion).

When you deploy guardrails, watch for displacement effects. A stricter safety filter can increase refusals, which can increase user workarounds or abandonment. Your behavior analysis should include “good refusal” vs “bad refusal,” not just refusal rate. Similarly, adding uncertainty language can reduce overtrust but may also reduce adoption; test it against real tasks.

  • Practical outcome: a prioritized backlog of changes with expected impact, risks, and how you will measure success.
  • Common mistake: changing the model without updating the evaluation harness; you lose the ability to attribute improvements.

This is also where ethical and IRB-ready thinking matters: if a mitigation shifts risk onto users (e.g., “verify elsewhere”), be explicit and evaluate whether that burden is acceptable for the target population.

Section 6.6: Communicating results: exec-ready memos and portfolio artifacts

Section 6.6: Communicating results: exec-ready memos and portfolio artifacts

Your analysis only matters if it changes decisions. Package your work in two formats: an exec-ready memo for stakeholders and a portfolio case study for your career transition. Both should tell a coherent story: the user problem, the model behavior evidence, the trade-offs, and the decision you recommend.

Exec-ready memo structure (1–2 pages): (1) decision needed, (2) user impact summary, (3) evaluation setup (test set, prompts, rubric, rating protocol), (4) key results with uncertainty (rates and confidence intervals), (5) risks and failure modes (severity-focused), (6) recommendation and next experiment. Use a small table: baseline vs new prompt/guardrail, task success, severe failures, refusals, and top failure modes. Keep appendices for examples and rater notes.

Portfolio case study structure: show your workflow discipline. Include: problem statement tied to human outcomes, your failure mode taxonomy, the minimal task-aligned test set design, excerpts of the rubric, inter-rater agreement approach, and before/after examples. Add one chart that shows a trade-off (e.g., hallucinations down, refusals up) and explain how you chose a point on the curve based on user context. End with “what I’d do next” to demonstrate research thinking.

Interview story: frame yourself as the person who connects qualitative signals to measurable behavior changes. Use a STAR-like narrative: user pain point → hypothesis about model behavior → evaluation harness → results → decision and iteration.

90-day transition plan: Days 1–30: pick one domain, build a small test set and rubric, run baseline evaluations, write one memo. Days 31–60: iterate prompts/guardrails, add rater calibration and a lightweight user study to validate outcomes. Days 61–90: publish the portfolio case study, request informational interviews, and contribute to an open-source eval harness or community benchmark. Networking target: 2 conversations/week with researchers or PMs; bring your memo and ask for feedback on decision usefulness.

  • Common mistake: over-indexing on jargon. Executives want risk, user impact, and a clear recommendation grounded in evidence.

By the end of this chapter, you should be able to demonstrate a core human-centered AI competency: translating messy user experiences into testable model behaviors, then translating model evidence back into product decisions.

Chapter milestones
  • Define failure modes and an evaluation rubric
  • Build a minimal test set aligned to user tasks
  • Run behavior analysis and interpret trade-offs
  • Create a portfolio case study and interview story
  • 90-day transition plan: skills, projects, and networking
Chapter quiz

1. Why does Chapter 6 argue that human-centered AI research needs both user studies and model behavior analysis?

Show answer
Correct answer: User studies reveal what people experience, while model behavior analysis explains why it happens and what changes may fix it
User studies capture experiences (e.g., mistrust or overreliance); behavior analysis helps identify systematic model causes and effective fixes.

2. What is a key risk of relying only on interviews and surveys when evaluating an AI system?

Show answer
Correct answer: You may misattribute a systematic model failure to “UX” issues
The chapter warns that without behavior analysis, you can blame the interface when the root cause is a repeatable model failure mode.

3. Which workflow step best ensures your evaluation reflects real user needs and tasks?

Show answer
Correct answer: Build a minimal test set aligned to user tasks
A task-aligned minimal test set ties model evaluation directly to real decisions, workflows, and risks.

4. What does the chapter mean by treating the model as a participant?

Show answer
Correct answer: Measure the model repeatedly under controlled conditions using operational definitions and coding practices
The model is evaluated like a participant: controlled conditions, repeatability, and structured measurement methods.

5. Which detail must be documented as carefully as a lab protocol when analyzing model behavior?

Show answer
Correct answer: Conditions such as prompts, temperature, retrieval sources, and guardrails
Because the “participant” changes with prompts and system settings, those conditions must be recorded to interpret results and trade-offs.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.