HELP

+40 722 606 166

messenger@eduailast.com

PM to LLM Solutions Architect: Requirements to Reference Builds

Career Transitions Into AI — Intermediate

PM to LLM Solutions Architect: Requirements to Reference Builds

PM to LLM Solutions Architect: Requirements to Reference Builds

Translate PM clarity into LLM architectures that ship and scale.

Intermediate llm · solutions-architecture · rag · prompt-engineering

Course Overview

This course is a short technical book for project managers who want to transition into an LLM Solutions Architect role. You’ll learn how to move from clear, testable requirements to a production-ready architecture and a reference implementation plan—without getting lost in research rabbit holes. The focus is practical: patterns you can reuse, artifacts you can ship, and decision-making you can defend in design reviews.

Who It’s For

If you already know how to run discovery, manage stakeholders, and translate needs into outcomes, you’re halfway there. The missing half is learning how LLM systems behave in real deployments: where they fail, what they cost, how to evaluate them, and how to put guardrails around them. This course bridges that gap with an architect’s workflow that still feels familiar to a strong PM.

What You’ll Build (Conceptually)

Across six chapters, you’ll produce a coherent set of architecture and delivery artifacts for a single end-to-end LLM feature—starting with a use-case brief and finishing with a reference implementation blueprint. You’ll be able to explain:

  • Why a specific LLM pattern (prompting, tool use, RAG, fine-tuning) fits your use case
  • How data moves through ingestion, indexing, retrieval, orchestration, and APIs
  • How you’ll measure quality and safety before and after launch
  • How security, privacy, and compliance constraints shape the architecture
  • How a reference implementation should be structured so teams can extend it

How the Chapters Progress

Chapter 1 reframes PM skills into LLM-architecture inputs: constraints, non-goals, and traceable acceptance criteria. Chapter 2 gives you a pattern-selection toolkit to avoid overbuilding—choosing between plain prompting, tool/function calling, retrieval-augmented generation (RAG), and fine-tuning with explicit tradeoffs. Chapter 3 turns those decisions into an end-to-end system design: retrieval pipelines, orchestration flows, guardrails, and integration contracts.

Chapter 4 makes the system measurable by defining datasets, metrics, and release gates so you can detect regressions and control risk. Chapter 5 adds “production reality”: threat modeling, injection defenses, privacy handling, auditability, and reliability practices. Chapter 6 ties it all together into a reference implementation plan and portfolio package so you can present your work credibly in interviews and internal role transitions.

Outcomes You Can Use Immediately

By the end, you’ll have a repeatable method for scoping and designing LLM solutions. You’ll know what to document, how to justify decisions, and how to collaborate with engineering, security, and data teams. The deliverables are intentionally aligned with what hiring managers and architecture review boards expect: clear diagrams, crisp tradeoffs, and an evaluation plan that proves the system works.

Get Started

Ready to pivot from managing projects to designing AI systems? Register free to begin, or browse all courses to compare learning paths.

What You Will Learn

  • Convert business goals into LLM-ready requirements, constraints, and acceptance criteria
  • Choose the right pattern (prompting, tools, RAG, fine-tuning) with clear tradeoffs
  • Design end-to-end LLM architectures: data, retrieval, orchestration, and APIs
  • Create an evaluation strategy with gold sets, metrics, and release gates
  • Build a reference implementation plan: repo structure, interfaces, and test scaffolds
  • Apply security, privacy, and compliance controls for enterprise LLM deployments
  • Estimate cost and latency, and design observability for production operations
  • Communicate architecture decisions with artifacts stakeholders understand

Requirements

  • Comfort writing product requirements or user stories
  • Basic understanding of APIs and web applications
  • Familiarity with spreadsheets and simple metrics
  • No prior machine learning experience required

Chapter 1: From PM Thinking to LLM Architecture Thinking

  • Map stakeholders, users, and job-to-be-done into an AI use case brief
  • Define success metrics and non-goals for an LLM feature
  • Identify data readiness and operational constraints early
  • Draft an architecture one-pager that aligns business and engineering
  • Create a delivery plan with risks, dependencies, and milestones

Chapter 2: Model & Pattern Selection (Prompting, Tools, RAG, Fine-Tune)

  • Choose an interaction pattern for the use case and justify it
  • Select a baseline model and deployment option with constraints
  • Design prompts and tool schemas for controllable behavior
  • Decide whether retrieval or fine-tuning is needed using evidence
  • Document tradeoffs and get stakeholder sign-off

Chapter 3: Designing the LLM System End-to-End

  • Design the data and retrieval pipeline with governance controls
  • Create the orchestration flow for multi-step tasks and fallbacks
  • Define API contracts, error handling, and versioning strategy
  • Plan caching, batching, and token budgets to hit cost/latency targets
  • Produce a reference architecture diagram and integration checklist

Chapter 4: Evaluation, Testing, and Release Gates

  • Build a gold dataset and define what “good” means per scenario
  • Select metrics and judge methods for quality and safety
  • Design offline evaluation and online A/B or canary strategies
  • Create test plans for prompts, tools, retrieval, and regressions
  • Set release gates and escalation procedures for failures

Chapter 5: Security, Privacy, Compliance, and Reliability by Design

  • Threat-model the system and identify highest-risk paths
  • Design access controls, secrets handling, and data minimization
  • Implement defenses for injection and untrusted content
  • Plan incident response, auditability, and compliance evidence
  • Define SLOs, monitoring, and reliability practices for production

Chapter 6: From Blueprint to Reference Implementation (and Your Portfolio)

  • Define the minimal reference implementation scope and interfaces
  • Design a repo layout and scaffolding for rapid team adoption
  • Create runbooks for deployment, monitoring, and on-call readiness
  • Estimate delivery: milestones, staffing, and cost of ownership
  • Package your work into a portfolio-ready case study

Sofia Chen

LLM Solutions Architect & AI Product Delivery Lead

Sofia Chen designs and delivers production LLM systems across support, knowledge, and workflow automation. She specializes in translating product requirements into secure architectures, evaluation plans, and reference implementations teams can extend. Her background spans program management, cloud integration, and applied NLP.

Chapter 1: From PM Thinking to LLM Architecture Thinking

Moving from Product Management to LLM Solutions Architecture is not a switch from “business” to “technical.” It is a shift in how you translate business intent into systems that behave consistently under real operational constraints. A PM is often rewarded for clarity of prioritization and narrative alignment; an LLM solutions architect is rewarded for measurable behavior, controlled failure modes, and designs that can be implemented, evaluated, and operated safely.

In traditional software, requirements tend to map cleanly to deterministic logic: if inputs match conditions, outputs follow. In LLM systems, outputs are probabilistic, context-sensitive, and heavily influenced by data availability and prompt construction. That changes the work: you still start with stakeholders and a job-to-be-done, but you must convert that into LLM-ready requirements, constraints, and acceptance criteria that include evaluation plans and safety guardrails.

This chapter gives you a practical workflow for turning an AI use case idea into an implementable architecture direction. You will map stakeholders and users into an AI use case brief, define success metrics and non-goals, identify data readiness and operational constraints early, draft an architecture one-pager that aligns business and engineering, and create a delivery plan that acknowledges risks, dependencies, and milestones. Treat these artifacts as a traceable chain: each decision must be explainable from business goal to technical pattern, and each technical constraint must be reflected back into scope, expectations, and release gates.

  • Outcome mindset: define “good behavior,” not just “feature shipped.”
  • Constraint mindset: budget, latency, privacy, and reliability are design inputs, not afterthoughts.
  • Evidence mindset: evaluation datasets and metrics are part of requirements, not post-launch analytics.

By the end of the chapter, you should be able to take a vague request like “add AI to support” and produce a bounded, testable architecture direction with clear tradeoffs among prompting, tool use, retrieval (RAG), and fine-tuning.

Practice note for Map stakeholders, users, and job-to-be-done into an AI use case brief: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define success metrics and non-goals for an LLM feature: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Identify data readiness and operational constraints early: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft an architecture one-pager that aligns business and engineering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a delivery plan with risks, dependencies, and milestones: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map stakeholders, users, and job-to-be-done into an AI use case brief: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define success metrics and non-goals for an LLM feature: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: The LLM solutions architect role vs PM responsibilities

Section 1.1: The LLM solutions architect role vs PM responsibilities

A PM typically owns “what” and “why”: customer problems, success metrics, prioritization, and stakeholder alignment. An LLM solutions architect owns “how it will work end-to-end” and “how we will know it works safely.” The overlap is real—both roles require crisp problem framing—but the architect must make early commitments about system boundaries, interfaces, and operational behavior.

The biggest mindset change is that you design a system, not a model. An LLM feature is rarely “call the model and show the answer.” It is orchestration (prompting, tool calls, retrieval), data flows (indexing, permissions, freshness), and controls (policy enforcement, redaction, logging, evaluation gates). A PM may accept ambiguity while exploring; the architect must turn ambiguity into explicit assumptions, then test and revise them.

Practically, start each initiative with an AI use case brief that maps stakeholders and users to jobs-to-be-done. Name at least: (1) the primary user, (2) the user’s workflow step where AI is inserted, (3) the decision the user will make based on output, and (4) the cost of being wrong. Include stakeholders beyond the obvious: Security, Legal/Privacy, Support Ops, and Data Engineering. Common mistake: writing a “feature description” that ignores who will trust, operate, and audit the system. If nobody can own the failure modes, the feature is not ready.

Section 1.2: Use-case qualification: value, feasibility, and risk

Section 1.2: Use-case qualification: value, feasibility, and risk

Not every “AI idea” deserves an LLM. Qualification is your first architectural decision: decide whether the use case has enough value, is feasible with available data and constraints, and has acceptable risk. A simple triage rubric helps you avoid building impressive demos that collapse in production.

Value: quantify the lever. Is it time saved per task, deflection rate, conversion uplift, or reduced error? Tie the AI output to a measurable business outcome and define success metrics early. For example: “Reduce average handle time by 20% on password reset tickets while maintaining CSAT ≥ 4.5.” Then define non-goals to prevent scope creep: “Not replacing human agents; not generating policy decisions; not handling payment disputes.” Non-goals are architectural constraints because they limit what data you need, what tools you expose, and what safety controls you must implement.

Feasibility: ask what must be true for the system to work. Does the user have stable, well-defined inputs? Do you have a reliable source of truth? Can the model access what it needs within policy? If the “answer” depends on scattered tribal knowledge, you likely need retrieval and knowledge curation, not just prompt engineering.

Risk: treat “cost of wrong” as a first-class requirement. High-stakes domains (finance, healthcare, HR, legal) require tighter controls: explicit citations, constrained actions, human review, and strong audit logs. Common mistake: evaluating feasibility only on demo quality, ignoring distribution shift, adversarial prompts, and permission boundaries.

Section 1.3: Functional vs non-functional requirements for LLM systems

Section 1.3: Functional vs non-functional requirements for LLM systems

PMs often write functional requirements (features and behaviors). LLM systems demand equally explicit non-functional requirements (NFRs): latency, cost, reliability, privacy, safety, and observability. Because LLM behavior is probabilistic, acceptance criteria must describe both expected outputs and acceptable error handling.

Functional requirements should specify the interaction contract. Examples: supported input types (text, files), response format (JSON schema vs prose), and tool boundaries (“may create a draft ticket but cannot close tickets”). Include grounding requirements: “Answers must cite at least one internal source for policy questions” or “If confidence is low, ask a clarifying question.” These are not UX niceties; they shape whether you choose prompting-only, RAG, tool use, or fine-tuning.

NFRs should be measurable. Define targets such as p95 latency, maximum cost per request, uptime objectives, and limits on data retention. Define safety constraints: prohibited content, PII handling, and rules for escalation. Convert them into acceptance criteria and release gates, not aspirational statements. For instance: “On the gold set, hallucination rate (unsupported claims) ≤ 3%” or “PII leakage incidents = 0 in red-team suite.”

Common mistake: treating evaluation as a later phase. For LLM features, the evaluation strategy is part of requirements because it defines what “working” means. If you cannot describe how you will measure quality and safety, you cannot claim the requirement is complete.

Section 1.4: Data inventory: sources, ownership, freshness, and access

Section 1.4: Data inventory: sources, ownership, freshness, and access

Data readiness is the hidden determinant of whether an LLM project succeeds. Your job is to inventory the data that will ground the system and identify constraints before anyone writes orchestration code. Start with a data inventory table: source name, system of record, owner, access method, update frequency, sensitivity classification, and permission model.

For RAG-oriented features, you need more than “we have documents.” You need to know which documents are authoritative, how often they change, and whether users have different entitlements. Freshness matters: if policies change weekly, your indexing pipeline and cache invalidation strategy are part of the architecture. Ownership matters: if Legal owns the policy docs, you need a content lifecycle (approval, versioning, deprecation) or your outputs will drift from truth.

Access matters at two levels: system access (APIs, connectors, rate limits) and authorization (what each user is allowed to see). A common enterprise failure mode is retrieving correct information that the current user should not have access to. Plan for permission-aware retrieval (e.g., document-level ACL filters) and for logging that supports audit requirements.

Common mistake: assuming a single “knowledge base” exists. In reality, the sources of truth are fragmented: ticket histories, CRM notes, wikis, policy PDFs, and product telemetry. Your inventory should identify what is in scope now versus later, aligned with your non-goals and delivery milestones.

Section 1.5: Latency, cost, and reliability constraints as design inputs

Section 1.5: Latency, cost, and reliability constraints as design inputs

LLM architecture choices are often decided by three constraints: latency (how fast), cost (how much per interaction), and reliability (how consistently it works). Treat these as inputs during requirements and use-case qualification, not as optimization work after the prototype.

Latency: identify the user’s tolerance in the real workflow. A support agent tool can often tolerate a few seconds; an autocomplete feature cannot. Latency budgets should be broken down: retrieval time, model inference time, tool call time, and post-processing. This breakdown influences whether you can afford multi-step reasoning, multiple tool calls, or large-context prompts.

Cost: estimate cost per request and cost per successful outcome. LLM calls are variable, and retrieval adds infrastructure costs. Your architect’s job is to match the pattern to the economics: prompting-only for low-risk, low-context tasks; tool use for deterministic actions; RAG when correctness depends on dynamic internal knowledge; fine-tuning only when you have stable, high-volume needs where the model must learn consistent style or classification boundaries. Common mistake: defaulting to fine-tuning to “make it smarter,” when retrieval and better prompts would provide better controllability and lower maintenance.

Reliability: define what happens when components fail (model timeout, vector store outage, tool API errors). Build graceful degradation into requirements: cached answers, human fallback, or “draft-only” mode. Reliability targets should include not just uptime but behavioral reliability: consistency, reduced hallucinations, and safe refusals in prohibited cases.

Section 1.6: Core artifacts: PRD-to-architecture traceability

Section 1.6: Core artifacts: PRD-to-architecture traceability

The fastest way to align business and engineering is to maintain traceability from the PRD to architecture and delivery. Your goal is not more documentation; it is a short set of artifacts that make decisions testable and reversible.

Artifact 1: the AI use case brief. This is the one-page distillation: stakeholder map, primary user and job-to-be-done, value hypothesis, success metrics, non-goals, and risk rating. This brief prevents the team from optimizing for impressive outputs that do not move the business metric.

Artifact 2: the architecture one-pager. Keep it concrete: a diagram of components (UI, API, orchestrator, retrieval/indexing, tools, model provider), data flows, trust boundaries, and key constraints (latency budget, privacy rules, retention). Add “pattern selection” with tradeoffs: why you chose prompting vs tools vs RAG vs fine-tuning, and what would cause you to change course. Common mistake: drawing boxes without listing the decisions those boxes imply (permissions, caching, evaluation hooks, and observability).

Artifact 3: an evaluation and release gate plan. Define gold sets (representative tasks), metrics (task success, groundedness, safety, latency, cost), and thresholds for moving from prototype to pilot to GA. This is how you convert probabilistic behavior into engineering acceptance criteria.

Artifact 4: a delivery plan with risks, dependencies, and milestones. Dependencies usually include data access approvals, connector buildout, policy reviews, and platform decisions (hosting, model provider). Risks should be explicit: missing entitlements data, low-quality documents, prompt injection exposure, or unclear human-in-the-loop ownership. Milestones should map to de-risking events (e.g., “permission-aware retrieval working on 50-document corpus”) rather than vague dates.

When these artifacts stay linked—metrics tied to requirements, requirements tied to architecture, architecture tied to evaluation—you make the transition from PM thinking to LLM architecture thinking: you are no longer shipping a feature, you are operating a system with measurable behavior.

Chapter milestones
  • Map stakeholders, users, and job-to-be-done into an AI use case brief
  • Define success metrics and non-goals for an LLM feature
  • Identify data readiness and operational constraints early
  • Draft an architecture one-pager that aligns business and engineering
  • Create a delivery plan with risks, dependencies, and milestones
Chapter quiz

1. What is the key shift when moving from PM thinking to LLM solutions architecture thinking?

Show answer
Correct answer: Translating business intent into systems with measurable behavior under operational constraints
The chapter emphasizes a shift toward measurable behavior, controlled failure modes, and operating safely within real constraints.

2. Why do LLM requirements need to include evaluation plans and safety guardrails?

Show answer
Correct answer: Because LLM outputs are probabilistic and context-sensitive, so acceptance criteria must define and test behavior
Since LLM behavior varies with context, data, and prompts, requirements must include how behavior will be evaluated and kept safe.

3. Which set best represents the chapter’s three mindsets for LLM architecture work?

Show answer
Correct answer: Outcome mindset, Constraint mindset, Evidence mindset
The chapter explicitly names outcome, constraint, and evidence mindsets as guiding principles.

4. In the chapter’s workflow, what does it mean to treat the artifacts as a “traceable chain”?

Show answer
Correct answer: Each decision should be explainable from business goal to technical pattern, and constraints must feed back into scope and release gates
Traceability links business intent to architecture choices and ensures constraints are reflected in scope, expectations, and gates.

5. What is the intended outcome by the end of Chapter 1 when given a vague request like “add AI to support”?

Show answer
Correct answer: A bounded, testable architecture direction with clear tradeoffs among prompting, tool use, retrieval (RAG), and fine-tuning
The chapter aims to turn vague requests into implementable, testable architecture direction with explicit tradeoffs and constraints.

Chapter 2: Model & Pattern Selection (Prompting, Tools, RAG, Fine-Tune)

In Chapter 1 you turned a business problem into an LLM-ready requirement: a target user, a job-to-be-done, constraints (latency, cost, privacy), and acceptance criteria. This chapter is where you make the first “architecture” decisions that materially affect delivery: interaction pattern, baseline model and deployment, prompt and tool design, and whether you need retrieval (RAG), fine-tuning, or a hybrid. These decisions are not about chasing the most advanced model—they are about selecting the smallest set of capabilities that reliably meets requirements under real constraints.

As a PM transitioning into an LLM Solutions Architect role, you should treat pattern and model selection as a controlled experiment. Start with a baseline that is cheap and easy to ship, prove it meets acceptance criteria with evidence, then add complexity only when you can show why it is needed. Stakeholders usually ask for “the best model”; your job is to translate that into measurable outcomes: accuracy on a gold set, reduced escalation rate, acceptable hallucination risk, and predictable cost per request.

Throughout this chapter, keep one goal in mind: produce decisions that you can justify, document, and get signed off. That means articulating tradeoffs, showing test results, and making the system behavior controllable with structured outputs, validation, and safety guards.

Practice note for Choose an interaction pattern for the use case and justify it: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select a baseline model and deployment option with constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design prompts and tool schemas for controllable behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Decide whether retrieval or fine-tuning is needed using evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document tradeoffs and get stakeholder sign-off: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose an interaction pattern for the use case and justify it: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select a baseline model and deployment option with constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design prompts and tool schemas for controllable behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Decide whether retrieval or fine-tuning is needed using evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Pattern catalog: chat, extract, classify, generate, agentic

Start by choosing an interaction pattern. Patterns are reusable “shapes” of LLM work that map cleanly to requirements and evaluation. Picking the wrong pattern is a common cause of overbuilding (agent frameworks for simple extraction) or underbuilding (pure chat for regulated workflows that need deterministic outputs).

Chat is best for open-ended support and exploration where user intent may evolve. The acceptance criteria often include user satisfaction, deflection rate, and safe refusal behavior. Extract fits when you have source text and need structured fields (e.g., invoice number, policy effective date). It demands strong schema adherence and low variance. Classify fits routing decisions (priority, topic, risk label) where the output space is small and measurable with precision/recall. Generate is for producing new content (email drafts, code scaffolds, summaries) where style and constraints matter. Agentic patterns (multi-step planning + tool use) are for workflows that require action across systems (create ticket, query database, schedule meeting) and must manage state, retries, and permissions.

A practical selection workflow: (1) Write the “unit of work” the system must do in one request, (2) identify whether the output must be deterministic and auditable, (3) decide whether the system must act (tools) or only speak (text), and (4) map to the simplest pattern. For example: “Read a contract and extract termination clause dates” is extract, not chat. “Route inbound requests to the right team” is classify, not agentic. “Resolve a customer issue by checking order status and issuing a refund” is agentic with strict tool permissions.

  • Common mistake: treating everything as chat and hoping prompt instructions will make it reliable. If stakeholders need consistent fields, use extract with a schema and validators.
  • Practical outcome: a one-page pattern justification tied to acceptance criteria (what success looks like and how you will measure it).

Choose the pattern first; it will constrain model choice, prompt format, tool design, and whether RAG or fine-tuning is worth the effort.

Section 2.2: Model choice: capability, context window, and modality

Model selection is an exercise in matching capability to constraints. Treat the baseline model as a replaceable component behind an interface; you should be able to swap vendors or sizes without rewriting the product. In practice, model choice depends on three main factors: capability, context window, and modality (text, vision, audio), plus deployment constraints like data residency and cost.

Capability maps to reasoning depth, instruction following, multilingual needs, and robustness to messy inputs. If your task is classification with a fixed label set, a smaller/cheaper model may outperform a larger one once you add examples and strict output formats. For complex agentic workflows, tool planning, and multi-step transformations, you often need a stronger model to reduce orchestration complexity and retries.

Context window is not just “how many tokens can I stuff in.” Long context helps with summarization and multi-document review, but it also increases cost and can reduce attention to critical details. If your solution depends on large documents, consider chunking + retrieval first, and reserve long-context models for cases where retrieval harms fidelity (e.g., needing cross-document reasoning with consistent references).

Modality matters when inputs are PDFs with images, screenshots, scanned forms, or call recordings. Vision-capable models may simplify pipelines by reducing OCR/ETL steps, but they can be harder to evaluate and may introduce new privacy concerns (e.g., sensitive images). Audio introduces latency and streaming requirements, plus separate evaluation for transcription quality.

  • Deployment option checklist: API vs self-hosted; regional availability; logging controls; encryption; rate limits; cost predictability; model versioning and deprecation policy.
  • Common mistake: selecting a model before defining evaluation. If you cannot explain what “better” means (metric + dataset), you cannot justify the spend.

End this step with a baseline model decision and constraints written down (max latency, max cost per request, required regions, and any prohibited data categories). That clarity makes later tradeoffs objective and stakeholder-friendly.

Section 2.3: Prompt design fundamentals and structured outputs

Prompts are executable requirements. A strong prompt does three things: defines the role and goal, constrains the output format, and provides task-specific context. You are not “writing clever text”; you are specifying behavior in a way that is testable and maintainable.

Use a consistent structure: System (non-negotiable policies), Developer (task instructions, format), and User (instance data). Include acceptance criteria as explicit constraints: “If the answer cannot be supported by the provided sources, respond with ‘INSUFFICIENT_EVIDENCE’.” For controlled behavior, prefer structured outputs (JSON) and explicit field definitions over prose.

For extract/classify patterns, define a schema with types, allowed values, and nullability. Add a rule for uncertainty: require the model to leave fields null rather than invent. For generate patterns, define tone, length, banned content, and required sections. Provide a small set of high-quality examples only when needed; too many examples can anchor the model to irrelevant patterns.

  • Practical technique: write prompts alongside tests. Every prompt change should run against a gold set so you can detect regressions.
  • Common mistake: mixing instructions and data. Always delimit user-provided content (e.g., with triple quotes) to reduce prompt injection and parsing errors.

Finally, design prompts for debuggability: request intermediate “rationale” only if your policy allows storing it, and prefer short, structured “decision traces” (e.g., which rule matched) over full chain-of-thought. Your goal is controllability and auditability, not verbosity.

Section 2.4: Tool/function calling: schemas, validation, and safety

Tool (function) calling is how you move from “LLM answers” to “LLM does.” When the model can call tools—search, database queries, ticket creation, price calculations—you can keep sensitive logic and permissions outside the model and reduce hallucinations. But tool calling must be treated like an API design problem, not a prompt trick.

Start with tool schemas that are minimal and explicit. Every argument needs a type, constraints (min/max, regex), and clear semantics. Prefer multiple small tools over one “doEverything” endpoint; it improves safety and evaluation because each tool has a narrow blast radius. Implement validation at the orchestrator: reject invalid arguments, enforce allowlists (e.g., which tables can be queried), and add idempotency keys for actions that should not repeat.

Safety controls should be layered: (1) authorization checks before execution, (2) input sanitization, (3) output verification (did the tool result match expectations), and (4) rate limits and audit logging. For high-risk actions (refunds, account changes), require a confirmation step or human approval—this is a product decision and an architecture decision.

  • Common mistake: letting the model “compose SQL” directly. Use parameterized queries or a constrained query tool that maps user intent to safe query templates.
  • Practical outcome: a tool contract document: name, purpose, schema, validation rules, error modes, and test cases.

Well-designed tools reduce the model’s cognitive load. The more deterministic the surrounding system is, the less you need to rely on an expensive model to keep the workflow on track.

Section 2.5: RAG vs fine-tuning vs hybrid: decision criteria

Deciding between retrieval-augmented generation (RAG), fine-tuning, or a hybrid is where many teams waste months. Use evidence-based criteria tied to failure modes you observed in evaluation—not intuition.

Choose RAG when the model’s failures come from missing or changing knowledge: product policies, internal docs, account-specific data, or rapidly updated information. RAG works best when sources are well-structured, chunkable, and attributable. Your acceptance criteria can then include citation correctness and “answer must be supported by sources.” Common pitfalls: poor chunking, noisy documents, weak queries, and missing permission filters (returning data the user shouldn’t see).

Choose fine-tuning when the model knows the domain but fails at consistent behavior: formatting, tone, classification boundaries, or specialized extraction style. Fine-tuning can reduce prompt length and improve reliability, but it does not magically inject up-to-date facts. You need curated training data and a plan for model/version governance. A frequent mistake is fine-tuning to “learn the handbook”; that content will drift and cause confident wrong answers.

Choose hybrid when you need both: RAG for factual grounding plus fine-tuning for consistent structure or tool-use discipline. Hybrid is powerful but increases operational complexity: more components to monitor, more ways to regress, and more stakeholder education required.

  • Decision test: If you remove all external documents, does the model still fail in the same way? If yes, consider fine-tuning/prompting. If no, prioritize RAG.
  • Practical outcome: an experiment plan comparing (A) prompt-only, (B) RAG, (C) fine-tune, using the same gold set and reporting cost/latency/quality.

Make the decision with a small prototype and a measurable delta. Stakeholders sign off faster when you can show, for example, “RAG reduced unsupported answers from 18% to 3% at +120ms latency.”

Section 2.6: Architecture decision records (ADRs) for AI systems

Once you’ve selected a pattern, model, prompt strategy, and RAG/fine-tune approach, document the decisions in Architecture Decision Records (ADRs). ADRs are lightweight, durable artifacts that make tradeoffs explicit and prevent “decision churn” when a new stakeholder or model release appears.

An AI ADR should include: Context (use case, constraints, stakeholders), Decision (what you chose), Options considered (at least two), Rationale (evidence from evaluation), Consequences (cost/latency/risks), and Acceptance criteria (release gates and metrics). For LLM systems, add two extra sections: Safety & compliance (data handling, logging, PII, retention, red-teaming scope) and Operational plan (monitoring signals, fallback behavior, model/version pinning, rollback steps).

Use ADRs as the mechanism to get stakeholder sign-off. Your goal is not unanimous enthusiasm; it is documented agreement on tradeoffs. Example: “We will ship prompt+RAG with citations, not fine-tuning, because policy content changes weekly; revisit fine-tuning after 8 weeks of production logs show stable intent taxonomy.”

  • Common mistake: documenting only the final choice. Without options and evidence, ADRs become hindsight narratives instead of decision tools.
  • Practical outcome: a signed ADR set that unblocks implementation planning: interfaces, evaluation gates, and a reference build roadmap.

With ADRs in place, Chapter 3 can focus on end-to-end architecture: data pipelines, retrieval, orchestration, and APIs—built on decisions you can defend.

Chapter milestones
  • Choose an interaction pattern for the use case and justify it
  • Select a baseline model and deployment option with constraints
  • Design prompts and tool schemas for controllable behavior
  • Decide whether retrieval or fine-tuning is needed using evidence
  • Document tradeoffs and get stakeholder sign-off
Chapter quiz

1. What is the guiding principle for choosing models and patterns in Chapter 2?

Show answer
Correct answer: Select the smallest set of capabilities that reliably meets requirements under real constraints
The chapter emphasizes meeting requirements reliably under constraints, not chasing the most advanced model.

2. How should a PM-turned-LLM Solutions Architect approach pattern and model selection?

Show answer
Correct answer: As a controlled experiment: start with a cheap baseline, prove against acceptance criteria, then add complexity with evidence
The chapter frames selection as iterative and evidence-based, starting with a baseline and increasing complexity only when justified.

3. When stakeholders ask for “the best model,” what does Chapter 2 recommend translating that request into?

Show answer
Correct answer: Measurable outcomes like gold-set accuracy, escalation reduction, hallucination risk, and cost per request
The chapter advises converting vague preferences into measurable success metrics tied to requirements.

4. What does Chapter 2 say you should do before adding retrieval (RAG), fine-tuning, or a hybrid approach?

Show answer
Correct answer: Show evidence that the baseline does not meet acceptance criteria and justify the added complexity
The chapter stresses proving the baseline first and adding complexity only with evidence of need.

5. What is the purpose of making system behavior 'controllable' as described in Chapter 2?

Show answer
Correct answer: To enable predictable behavior via structured outputs, validation, and safety guards that can be documented and signed off
The chapter links controllability to predictability, safeguards, and documentation for stakeholder sign-off.

Chapter 3: Designing the LLM System End-to-End

In Chapter 2 you translated business intent into LLM-ready requirements. Now you will design the system that can actually meet those requirements reliably: data flows, retrieval, orchestration, APIs, and performance. This is where many PM-to-architect transitions succeed or fail. A product requirement like “answer accurately using internal docs” is not an implementation; you must decide how knowledge is governed, retrieved, validated, and served under real latency, cost, and compliance constraints.

Think end-to-end. Your users see a chat box, but your system is a pipeline: content ingestion and governance controls, indexing and retrieval, an orchestration layer for multi-step tasks, and one or more APIs that integrate with enterprise systems. Every decision needs a tradeoff statement you can defend (accuracy vs. latency, openness vs. safety, freshness vs. cost). You also need a release strategy: gold sets and acceptance criteria become release gates, and the architecture must be testable from day one with reference builds and scaffolds.

A practical way to start is to sketch a “happy path” sequence diagram (user request → retrieval → model call → post-processing → response), then add failure paths (retrieval empty, tool call fails, policy violation, rate-limited upstream). When you can explain how the system behaves in each failure mode, you are doing solutions architecture, not just “prompt design.”

  • Data and retrieval pipeline: ingestion, chunking, embeddings, indexes, filters, governance controls, and update cadence.
  • Orchestration flow: multi-step plans, deterministic tool steps, state/memory, retries, and fallbacks.
  • API design: contracts, versioning, error semantics, observability, and backward compatibility.
  • Performance plan: token budgets, caching, batching, concurrency limits, and cost guardrails.
  • Reference artifacts: a baseline diagram plus an integration checklist so teams can implement consistently.

The rest of this chapter walks those pieces in the order most teams build them: retrieval foundations, retrieval strategy, orchestration, guardrails, integration, and performance. Your goal is not a perfect diagram; it is a design you can implement, evaluate, and operate.

Practice note for Design the data and retrieval pipeline with governance controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create the orchestration flow for multi-step tasks and fallbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define API contracts, error handling, and versioning strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan caching, batching, and token budgets to hit cost/latency targets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Produce a reference architecture diagram and integration checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the data and retrieval pipeline with governance controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create the orchestration flow for multi-step tasks and fallbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: RAG foundations: chunking, embeddings, and indexing

Section 3.1: RAG foundations: chunking, embeddings, and indexing

Retrieval-Augmented Generation (RAG) is usually the fastest path to enterprise usefulness because it keeps domain knowledge outside the model and under governance. The architecture starts with an ingestion pipeline that converts source systems (wiki pages, PDFs, tickets, policies) into a normalized, permission-aware corpus. The common mistake is treating ingestion as a one-time export. In production, ingestion is a product: it needs schedules, change detection, backfills, and audit logs.

Chunking is the first key design choice. Chunks that are too large waste tokens and reduce recall; chunks that are too small lose context and create “semantic confetti.” A practical default is 300–800 tokens per chunk with 10–20% overlap, then adjust based on document type (procedures chunk well by steps; policies chunk by sections; tables may require custom parsing). Include metadata with each chunk: source URL, title, section heading, updated_at, author, system_of_record, and—critically—ACL information for authorization filtering.

Embeddings turn chunks into vectors. Your job is to choose an embedding model that matches your language, domain, and cost constraints, then standardize on a single embedding dimension and normalization strategy so you can reindex safely. A common failure mode: mixing embeddings from different models in the same index. Treat embeddings like a schema: version them. Store embedding_model_id and embedding_version on each record and plan a re-embed migration path.

Indexing is not only “put vectors in a database.” You must choose the index type (HNSW, IVF, PQ), the refresh approach (batch vs. streaming), and the durability model (rebuild vs. incremental). If your business requires freshness (e.g., support runbooks updated daily), you need a predictable indexing SLA: “docs visible to retrieval within 30 minutes of publish.” That SLA becomes an acceptance criterion and an operational alert.

  • Governance control: define which sources are allowed, who approves them, and how removals propagate (right-to-be-forgotten and retention).
  • Quality control: run deduplication, boilerplate stripping, language detection, and PII scanning before indexing.
  • Test scaffolds: keep a small “gold corpus” and deterministic chunking config in your repo so retrieval regressions are catchable in CI.

Outcome: a repeatable, versioned data-to-index pipeline with metadata and ACLs built in, not bolted on.

Section 3.2: Retrieval strategies: hybrid search, reranking, filters

Section 3.2: Retrieval strategies: hybrid search, reranking, filters

Once you have an index, the next question is: “How do we reliably fetch the right context?” Pure vector search is rarely enough in enterprise settings because users ask with acronyms, ticket IDs, and exact phrases. A practical baseline is hybrid search: combine lexical search (BM25/keyword) with vector similarity, then merge results. This improves recall for exact matches while keeping semantic flexibility.

Design retrieval as a sequence of explicit steps you can tune: query rewrite → candidate retrieval → filtering → reranking → context assembly. Query rewrite (sometimes called “search query generation”) is a low-risk LLM call that can expand acronyms and add synonyms, but it should be tightly constrained and logged because it affects traceability. Candidate retrieval typically pulls 20–100 chunks. Filters then enforce business constraints: tenant, department, doc type, time range, and authorization. Skipping filters is a classic security bug: you might “retrieve the best answer” from a document the user is not allowed to see.

Reranking is the biggest quality lever after good chunking. Use a cross-encoder reranker (or a reranking-capable model) to score top candidates against the full query. Reranking adds latency, so treat it as a configurable tier: enable it for high-stakes tasks, disable it for low-risk Q&A or when the system is under load. A good pattern is adaptive retrieval: if top-1 confidence is high, skip reranking; otherwise rerank.

  • Filters as governance: enforce ACLs, data residency, and “approved sources only.” Make filters part of the retrieval API contract, not an optional parameter.
  • Context assembly: don’t just concatenate chunks. Order by relevance, group by document, cap per-document contribution, and include citations metadata.
  • Empty retrieval behavior: define what “no evidence” means. You may respond with clarifying questions, suggest sources, or refuse to answer rather than hallucinate.

Outcome: a retrieval strategy that is tunable, explainable, and safe, with explicit knobs for recall, precision, latency, and compliance.

Section 3.3: Orchestration: state, memory, and deterministic steps

Section 3.3: Orchestration: state, memory, and deterministic steps

Orchestration is the “control plane” of your LLM application: it decides what to do next and how to recover when something fails. Many teams start with a single prompt and then keep adding instructions until it becomes brittle. A better approach is to build a multi-step workflow where only the reasoning and language generation are delegated to the model; everything else is deterministic.

Start by identifying the steps that must be deterministic for reliability or compliance: permission checks, document retrieval, tool invocation, calculations, formatting, and logging. Then decide where the LLM adds value: query rewrite, summarization, extraction, and final response composition. For multi-step tasks (e.g., “draft a customer response and open a ticket”), define a state machine: RECEIVED → RETRIEVE → PLAN → EXECUTE_TOOLS → COMPOSE → VALIDATE → DELIVER. Store state in a durable store (Redis/Postgres) so you can retry without losing context and so you can trace incidents.

Memory is often misunderstood. Conversation history is not the same as long-term memory. Treat chat history as short-lived context with a token budget; summarize it deterministically or via a summarizer model with strict formatting. For long-term personalization or case context, store structured facts (preferences, account IDs, prior outcomes) in a database and retrieve them like any other data source, with governance and consent. Avoid “let the model remember” as an architecture.

  • Fallbacks: define what happens when retrieval is empty, a tool times out, or the model hits rate limits. Fallbacks should be product decisions (e.g., show sources, ask for clarification, queue the task).
  • Idempotency: tool calls that create side effects (tickets, emails) must use idempotency keys to prevent duplicate actions on retries.
  • Deterministic outputs: for extraction, prefer JSON schemas and validators so downstream services don’t parse free text.

Outcome: an orchestration layer that makes the system testable and operable, with clear state, retries, and deterministic integration points.

Section 3.4: Guardrails: validation, refusal handling, and policy checks

Section 3.4: Guardrails: validation, refusal handling, and policy checks

Guardrails are not a single filter; they are multiple checks across the pipeline. The purpose is to reduce harm (security, privacy, compliance) and reduce business risk (wrong actions, wrong claims). Treat guardrails as requirements with acceptance criteria: “No cross-tenant data leakage,” “No medical advice,” “All responses cite evidence,” or “Tool calls must be approved by policy.”

Implement guardrails in three layers. Input validation checks for prompt injection patterns, disallowed content, and suspicious requests (e.g., “ignore policies”). It can also enforce structural constraints: required fields, max lengths, and user authentication state. Policy checks occur before retrieval (authorization) and before tool calls (capabilities). For example, even if a user asks to “delete all records,” the orchestrator should block that tool call unless the user role and ticket context permit it.

Output validation ensures the model response is safe and usable. Validate JSON against a schema, enforce citation requirements, and check for sensitive data leakage (PII, secrets). If validation fails, do not “hope the model is fine.” Trigger a repair step (ask the model to reformat), a safe fallback message, or a refusal. Refusal handling should be user-friendly and consistent: explain what you can do, ask for a safer alternative, and log the policy category for audit.

  • Common mistake: relying only on the model’s built-in safety. Enterprise deployments need explicit controls and auditable decisions.
  • Release gates: add guardrail test cases to your gold set (prompt injection attempts, cross-tenant queries, PII in outputs).
  • Observability: log which guardrail fired, on which step, and with which policy version to support compliance reviews.

Outcome: a system that can say “no” safely, prove why it said “no,” and avoid silent failures that become security incidents.

Section 3.5: Integration architecture: services, queues, and connectors

Section 3.5: Integration architecture: services, queues, and connectors

Now connect the LLM capability to the enterprise. A maintainable integration architecture separates concerns into services with clear contracts. A common reference split is: API Gateway (auth, rate limits), LLM Orchestrator (workflow/state), Retrieval Service (search/rerank/context), Policy Service (authorization and compliance rules), and Connectors (CRM, ticketing, data lake). This separation enables independent scaling, clearer ownership, and safer changes.

Define API contracts early. Your orchestrator API should specify: request schema, supported features (citations, tool use), and response schema with machine-readable fields (answer text, citations array, safety flags, trace_id). Error handling is part of the product. Standardize error types such as VALIDATION_ERROR, UNAUTHORIZED, POLICY_BLOCKED, RETRIEVAL_EMPTY, UPSTREAM_TIMEOUT, and RATE_LIMITED, and document how clients should react.

Use versioning intentionally. Prompt changes, retrieval changes, and policy changes can break behavior. Version APIs (e.g., /v1), and also version internal artifacts (prompt templates, embedding model IDs, policy rules). This makes rollbacks possible. For long-running tasks (report generation, bulk summarization), introduce queues and async workflows. Put the orchestrator behind a job model: submit → process → callback/webhook → fetch result. This prevents user-facing timeouts and makes retries safe.

  • Integration checklist: auth method (OIDC/SAML), tenant isolation, logging/PII redaction, rate limits, idempotency keys, and audit trails.
  • Connectors: treat each tool/data source as an adapter with a stable interface; mock them in tests.
  • Reference build plan: create a repo structure with /services, /contracts, /prompts, /eval, /infra, and test scaffolds for retrieval and guardrails.

Outcome: an architecture that integrates cleanly with enterprise systems and can evolve without breaking clients or compliance commitments.

Section 3.6: Performance design: token limits, caching, and concurrency

Section 3.6: Performance design: token limits, caching, and concurrency

Performance is where “cool demo” becomes “usable product.” You need explicit cost and latency targets, then design token budgets and concurrency to meet them. Start by allocating a token budget per request: system prompt + conversation history + retrieved context + tool outputs + model response. If you do not budget tokens, you will accidentally exceed limits, truncate key context, or incur unpredictable costs.

Practical tactics: cap retrieved context by tokens, not by number of chunks; summarize older conversation turns; and enforce maximum response length. Build a “context assembler” that can drop low-value text first (boilerplate, repeated headers) and preserve high-value fields (definitions, steps, constraints). When tool outputs are large (tickets, logs), extract only relevant fields rather than pasting entire payloads into the prompt.

Caching is a major lever. Use multi-layer caching: (1) embedding cache for repeated chunks, (2) retrieval result cache keyed by normalized query + filter set + index version, and (3) response cache for safe, non-personalized queries (e.g., public policy answers). Cache invalidation must respect governance: if a document is removed or access changes, cached results must expire or be revalidated. Include index version and policy version in cache keys to avoid serving stale or non-compliant outputs.

Batching and concurrency matter when load spikes. Batch embedding generation during ingestion, and consider batching rerank requests if your provider supports it. On the serving path, set concurrency limits per tenant and per user, and define backpressure behavior: queue requests, return 429 with retry hints, or degrade features (disable reranking) to preserve core functionality. Also plan for provider failures with circuit breakers and multi-region or multi-provider routing when required by SLAs.

  • Cost controls: per-request token caps, per-tenant budgets, and alerts when spend deviates from forecast.
  • Latency controls: timeouts per step (retrieval, rerank, model), plus a “fast fail” path when deadlines are exceeded.
  • Release gates: performance regression tests (p95 latency, average tokens, cache hit rate) alongside quality metrics.

Outcome: a system that can hit cost/latency targets predictably, degrade gracefully under load, and scale without surprising bills.

Chapter milestones
  • Design the data and retrieval pipeline with governance controls
  • Create the orchestration flow for multi-step tasks and fallbacks
  • Define API contracts, error handling, and versioning strategy
  • Plan caching, batching, and token budgets to hit cost/latency targets
  • Produce a reference architecture diagram and integration checklist
Chapter quiz

1. According to Chapter 3, what turns the requirement “answer accurately using internal docs” into an implementable design?

Show answer
Correct answer: Specifying how knowledge is governed, retrieved, validated, and served under latency, cost, and compliance constraints
The chapter emphasizes end-to-end system decisions (governance, retrieval, validation, serving) rather than treating requirements as implementation.

2. Which sequence best represents the chapter’s suggested “happy path” for a chat request?

Show answer
Correct answer: User request → retrieval → model call → post-processing → response
The chapter explicitly proposes starting with this happy-path sequence diagram before adding failure paths.

3. What is the main purpose of adding explicit failure paths (e.g., retrieval empty, tool call fails, policy violation, rate-limited upstream) to the design?

Show answer
Correct answer: To demonstrate and define system behavior under each failure mode, which is core to solutions architecture
Chapter 3 frames solutions architecture as being able to explain system behavior across failure modes, not just prompt design.

4. Which grouping correctly matches the chapter’s end-to-end pipeline components users don’t see behind the chat box?

Show answer
Correct answer: Content ingestion and governance controls, indexing and retrieval, orchestration for multi-step tasks, and APIs integrating with enterprise systems
The chapter describes the system as a pipeline spanning ingestion/governance, retrieval, orchestration, and integration via APIs.

5. Which set of design levers does Chapter 3 highlight for meeting cost and latency targets?

Show answer
Correct answer: Token budgets, caching, batching, concurrency limits, and cost guardrails
The performance plan in the chapter calls out token budgets, caching, batching, concurrency limits, and guardrails to control cost/latency.

Chapter 4: Evaluation, Testing, and Release Gates

Shipping an LLM feature without a defensible evaluation plan is like shipping a new pricing model without finance controls: you may get early wins, but you will eventually lose trust. In LLM systems, “works on my prompt” is not evidence. Your job as a solutions architect is to turn vague notions of “helpful” into measurable acceptance criteria, then enforce those criteria through offline evaluation, online monitoring, and release gates.

This chapter gives you a practical workflow: define what “good” means per scenario; build a gold dataset; select metrics and judge methods for quality and safety; design offline and online strategies (A/B tests, canaries); create test plans for prompts, tools, retrieval, and regressions; and finally set release gates with escalation procedures when something fails. Think of evaluation not as a single dashboard, but as a chain of controls that aligns engineering decisions to business risk.

A common mistake is optimizing one metric in isolation. For example, raising answer length can increase “helpfulness” ratings while simultaneously increasing hallucinations and compliance risk. Another mistake is treating evaluation as a one-time project. LLMs change (model versions, context windows, safety filters), your data changes (fresh documents, new policies), and user behavior changes (more edge cases). Your evaluation strategy must be designed to run continuously, with clear baselines, thresholds, and rollback plans.

  • Offline evaluation answers: “Is this candidate build better than baseline, and why?”
  • Online evaluation answers: “Does this improve real user outcomes without increasing incidents?”
  • Release gates answer: “What must be true to ship, and what happens if it isn’t?”

By the end of this chapter, you should be able to propose an evaluation plan that a skeptical engineering lead, a risk officer, and a product stakeholder can all accept—because it is specific, repeatable, and aligned to the system’s intended use.

Practice note for Build a gold dataset and define what “good” means per scenario: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select metrics and judge methods for quality and safety: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design offline evaluation and online A/B or canary strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create test plans for prompts, tools, retrieval, and regressions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set release gates and escalation procedures for failures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a gold dataset and define what “good” means per scenario: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select metrics and judge methods for quality and safety: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design offline evaluation and online A/B or canary strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Evaluation taxonomy: quality, truthfulness, and usefulness

Start by separating what you are evaluating. LLM outputs can be “good” in one sense and unacceptable in another. A useful taxonomy prevents teams from arguing past each other and helps you choose the right metrics and test methods.

Quality covers writing and interaction properties: clarity, tone, completeness, formatting, and whether the answer follows instructions (e.g., “return JSON” or “cite sources”). Quality is often what stakeholders notice first, but it is not the same as correctness.

Truthfulness (or factuality) asks whether claims are supported by allowed sources. In enterprise settings, truthfulness is frequently “truth relative to the provided context,” especially in RAG systems where the model must not invent policy details. Truthfulness evaluation should explicitly state the permitted knowledge: retrieved documents, tool outputs, and approved system metadata.

Usefulness measures whether the output helps the user complete the job-to-be-done. A truthful answer can still be unhelpful if it’s too generic, too cautious, or misses the user’s constraints. Usefulness is best defined per scenario: “reduces handle time,” “increases first-contact resolution,” “creates a correct draft that needs minimal edits,” or “avoids escalation.”

Engineering judgment: treat these as separate acceptance criteria with different owners. For example, product may own usefulness targets, engineering may own truthfulness and reliability, and security may own safety constraints. When you build your gold dataset later, label each example with which dimensions matter most. That prevents over-fitting to stylistic preferences when the real risk is factual error.

Common mistake: using one blended “overall score” too early. Keep dimension-level scores visible so you can diagnose regressions (e.g., higher helpfulness but lower grounding). Practical outcome: your requirements document should include a table mapping scenarios to required levels of quality, truthfulness, and usefulness, plus the failure impact if each dimension is missed.

Section 4.2: Dataset design: scenario coverage and hard cases

An evaluation dataset is not a random sample of conversations. It is a deliberately constructed “gold set” that encodes what the business cares about, including the uncomfortable edge cases that tend to break LLM systems. Build it as a product artifact, not as an afterthought.

Start with scenario coverage. Take your top use cases (from requirements) and create a scenario matrix: user intent, user role, domain area, and output type (answer, summary, email draft, SQL query, tool action). For each cell, add 5–20 examples depending on risk. High-risk flows (customer communication, compliance, refunds, medical/legal) deserve more density.

Then add hard cases: ambiguous questions, missing context, conflicting sources, out-of-policy requests, long inputs, and adversarial prompts. Hard cases are where release gates earn their value. Include “should refuse” examples and “should ask a clarifying question” examples. For RAG, include stale docs, near-duplicate passages, and documents with similar titles but different applicability (common in HR and security policies).

Define what “good” means per scenario by attaching a reference output or a grading rubric. Do not over-index on a single “perfect answer” for open-ended tasks; use rubric criteria such as “mentions required policy clause,” “includes next steps,” “uses customer-safe tone,” and “does not claim certainty when evidence is missing.”

Workflow tip: version your gold set like code. Each item should store: prompt, system instructions version, tool availability, expected citations (if relevant), and labels (risk level, scenario, language). Treat additions and changes as pull requests reviewed by domain experts. Practical outcome: you can explain to leadership exactly what you tested, and you can reproduce results when a model upgrade changes behavior.

Section 4.3: Metrics: exact match, rubric scoring, and LLM-as-judge

Once you have a gold set, choose metrics that match the task type and risk. You rarely need a single metric; you need a small set that covers correctness, formatting, and policy compliance.

Exact match works for constrained outputs: classification labels, tool arguments, JSON fields, SQL templates (sometimes with normalization), and routing decisions. Use exact match when you can, because it is deterministic and fast. Pair it with schema validation for structured outputs and with “no extra keys” checks to enforce contracts.

Rubric scoring is better for open-ended generation. Define 3–8 criteria with clear anchors (0/1/2 or 1–5) and examples of what each score means. Keep rubrics short and decision-oriented, such as “contains required disclaimer,” “answers the question directly,” “uses only supported sources,” and “actionable next steps.” Rubrics also enable targeted improvements: you can see which criterion causes failures.

LLM-as-judge can scale rubric scoring, but treat it as a measurement instrument that needs calibration. Use a fixed judge model/version, freeze the judge prompt, and periodically validate judge agreement against a small human-labeled subset. Add judge guardrails: require quoted evidence from the candidate answer and, for RAG tasks, require references to retrieved passages. If the judge cannot cite evidence, the score should be lower or “uncertain,” not a confident pass.

Common mistakes: (1) letting the candidate model judge itself; (2) changing judge prompts frequently, which breaks comparability; (3) accepting high average scores while ignoring tail failures. Practical outcome: your evaluation report should include distribution metrics (p50/p90), not just averages, and it should explicitly track “critical failure rate” for high-severity mistakes (unsafe tool calls, policy violations, fabricated citations).

Section 4.4: RAG evals: retrieval recall, grounding, and citation checks

RAG systems fail in two distinct places: retrieval returns the wrong (or incomplete) context, or generation ignores/overrides the context. Your evaluation must isolate these layers so you can fix the right component.

Retrieval recall asks: “Did we fetch the passages needed to answer?” Build a set of questions with known supporting documents and measure recall@k (and sometimes MRR) against those targets. If you don’t have ground-truth doc IDs, create them: during gold set creation, record which documents justify the answer. Also test query rewriting and metadata filters (business unit, region, effective date) because these often cause silent recall drops.

Grounding asks: “Are claims supported by retrieved text or tool outputs?” Measure “grounded claim rate” by extracting claims (manual or automated) and checking whether each claim is entailed by the context. For practical teams, a proxy metric is: require the model to quote supporting snippets for key statements, then check whether those snippets appear in the retrieved context.

Citation checks prevent “citation theater.” If your UX shows citations, verify that each citation points to a retrieved chunk, that the cited chunk actually supports the statement, and that the answer does not include uncited factual claims above a severity threshold. Implement automated checks: (1) citations must map to document IDs returned by retrieval; (2) quoted spans must exist verbatim; (3) for policy answers, at least one citation must come from an “authoritative” source class.

Design tip: run ablations. Evaluate retrieval-only (can we find it?), then generation with perfect context (can we use it?), then end-to-end. Practical outcome: when stakeholders complain about hallucinations, you can quantify whether the issue is missing documents, poor chunking, embedding drift, or a prompt that fails to enforce grounding.

Section 4.5: Safety evals: prompt injection, jailbreaks, and data leakage

Safety is not a generic “be careful” checkbox; it is a set of targeted adversarial tests matched to your architecture. Safety evaluation should be part of your normal test plan, not an occasional red-team event.

Prompt injection tests whether untrusted content (web pages, user files, tickets) can override system instructions or cause tool misuse. Build a suite where retrieved documents contain malicious instructions like “ignore previous directions” or “exfiltrate secrets.” Your expected behavior should be explicit: the model must treat retrieved text as data, not instructions; it must refuse to reveal system prompts; and it must follow a tool allowlist with parameter validation.

Jailbreaks test whether users can coax forbidden outputs (harassment, self-harm guidance, illegal instructions) or bypass enterprise policy. Include common jailbreak patterns: role-play, “for research,” translation attacks, and multi-turn gradual escalation. Track not only refusal rate, but safe completion quality: a refusal that still provides actionable harmful detail is a failure.

Data leakage tests whether the system reveals sensitive data: secrets in prompts, PII in context windows, proprietary documents outside the user’s authorization, or training data memorization. Evaluate access control by simulating users with different permissions and ensuring retrieval respects ACLs. Add tests for “echoing” user inputs, logs, and tool outputs. If you store conversations, verify redaction and retention controls.

Escalation procedures matter here. Define severity levels (e.g., S0: credential leak; S1: PII exposure; S2: policy-violating advice) and map them to actions: block release, roll back, rotate keys, notify security, and open an incident. Practical outcome: you can demonstrate to compliance that safety is measurable and enforced through gates, not dependent on individual judgment at ship time.

Section 4.6: Release management: baselines, thresholds, and rollbacks

Release gates convert evaluation into operational control. A gate is a decision rule: if metrics cross thresholds, you do not ship (or you ship only to a limited canary). Gates should be defined before you run the experiment, not negotiated after seeing results.

Begin with a baseline: a prior model/prompt/retrieval configuration that is “known good.” Every candidate build must be evaluated against the same gold set and the same judge configuration. Track changes in prompts, chunking, embeddings, rerankers, tool schemas, and model versions as first-class release artifacts.

Define thresholds by scenario risk. For low-risk drafting tasks, you might accept small regressions in style if factuality improves. For high-risk tasks, require “no regression” on critical failure rate and enforce minimum groundedness and safety scores. Use both average improvements and tail constraints (e.g., p90 rubric score, maximum hallucination rate, zero critical leaks). Include non-quality gates too: latency, cost per request, tool error rate, and availability.

Design your online strategy as layered rollout: internal users → limited canary → A/B test → full rollout. Canary rules should include automated rollback triggers (spike in complaints, refusal rate changes, citation mismatch rate, safety filter hits). Keep a kill switch that can disable tool use or fall back to a safe baseline response.

Finally, write the regression plan. Each release should add tests for any incident you had (“never again” cases). Prompt tests, tool contract tests, retrieval evals, and safety suites should run in CI. Practical outcome: releases become routine rather than heroic—because you can prove what changed, why it’s better, and how you will revert if reality disagrees with your offline results.

Chapter milestones
  • Build a gold dataset and define what “good” means per scenario
  • Select metrics and judge methods for quality and safety
  • Design offline evaluation and online A/B or canary strategies
  • Create test plans for prompts, tools, retrieval, and regressions
  • Set release gates and escalation procedures for failures
Chapter quiz

1. Why does the chapter argue that “works on my prompt” is not sufficient evidence to ship an LLM feature?

Show answer
Correct answer: Because evaluation must translate “helpful” into measurable acceptance criteria enforced through offline/online evaluation and release gates
The chapter emphasizes defensible, repeatable evaluation tied to acceptance criteria, not anecdotal success on a single prompt.

2. What is the primary purpose of building a gold dataset in the workflow described?

Show answer
Correct answer: To provide a consistent reference set that defines what “good” looks like per scenario and enables comparable evaluations
A gold dataset anchors scenario-specific definitions of “good” and supports repeatable comparisons across builds.

3. Which pairing best matches the chapter’s framing of offline vs. online evaluation?

Show answer
Correct answer: Offline: determine if a candidate build is better than baseline and why; Online: verify real user outcomes improve without increasing incidents
The chapter distinguishes offline evaluation (better than baseline and why) from online evaluation (real-world impact and incident rates).

4. Which scenario reflects the chapter’s warning about optimizing one metric in isolation?

Show answer
Correct answer: Increasing answer length boosts helpfulness ratings but also raises hallucination and compliance risk
The chapter’s example shows that pushing one metric (length/helpfulness) can worsen others (hallucinations/compliance).

5. What best describes the role of release gates and escalation procedures in this chapter’s evaluation chain?

Show answer
Correct answer: They define what must be true to ship and what actions occur (e.g., thresholds, rollback) when criteria aren’t met
Release gates formalize ship/no-ship conditions and specify escalation/rollback when failures occur.

Chapter 5: Security, Privacy, Compliance, and Reliability by Design

When you move from a demo to an enterprise LLM deployment, security and reliability stop being “non-functional” and become primary product requirements. An LLM system is not just a model call; it is a pipeline that ingests user input, queries tools and data sources, and returns content that users may act on. Each hop introduces risk: sensitive data exposure, cross-tenant leakage, malicious content influencing behavior, compliance gaps, and brittle performance under load.

This chapter teaches you to design the “guardrails” as first-class architecture components. You will threat-model the end-to-end system to identify highest-risk paths, then design access controls, secrets handling, and data minimization. You will implement defenses for injection and untrusted content, and you will plan incident response, auditability, and compliance evidence. Finally, you will define SLOs and monitoring practices so the system behaves predictably in production, even when the model or dependencies fail.

The practical mindset shift is simple: treat the LLM as an untrusted component that can be manipulated by untrusted inputs, and treat your data and tools as high-value assets that must be protected by explicit boundaries. The best reference builds bake these controls into the architecture from day one, rather than retrofitting them after a security review blocks launch.

Practice note for Threat-model the system and identify highest-risk paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design access controls, secrets handling, and data minimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement defenses for injection and untrusted content: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan incident response, auditability, and compliance evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define SLOs, monitoring, and reliability practices for production: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Threat-model the system and identify highest-risk paths: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design access controls, secrets handling, and data minimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement defenses for injection and untrusted content: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan incident response, auditability, and compliance evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define SLOs, monitoring, and reliability practices for production: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Threat modeling for LLM apps: assets, actors, attack paths

Section 5.1: Threat modeling for LLM apps: assets, actors, attack paths

Threat modeling is how you translate abstract security concerns into concrete engineering work. Start with a data-flow diagram of your LLM application: user → API gateway → orchestrator → retrieval (vector DB/search) → tools (ticketing, CRM, code repo) → model provider → response post-processing. Annotate where data is stored, where it crosses trust boundaries, and where it can be modified.

Then identify assets. Typical high-value assets include: proprietary documents in RAG indexes, credentials used for tool calls, tenant-specific metadata, conversation transcripts, and any downstream system the LLM can mutate (e.g., “create invoice,” “approve refund”). Next, list actors: legitimate users, malicious external users, compromised internal accounts, and supply-chain actors (model provider, plugins, web content you fetch).

Finally, enumerate attack paths. In LLM systems, the highest-risk paths often involve: prompt injection causing the model to reveal secrets or take unintended actions; cross-tenant retrieval due to filtering mistakes; data exfiltration through logs or analytics; and tool misuse where the LLM triggers destructive operations. A practical technique is to rank each path by impact and likelihood and choose “top 3” mitigations to implement first (e.g., strict tool allowlists, retrieval filters enforced server-side, and secrets isolation).

Common mistake: threat-modeling only the model call. Your biggest risks usually occur in orchestration and tool execution, where the system has real permissions. Your outcome for this section should be a short threat model document (one page is enough) with prioritized threats and mapped controls, usable as an implementation checklist and as evidence for security review.

Section 5.2: Privacy patterns: PII handling, retention, and redaction

Section 5.2: Privacy patterns: PII handling, retention, and redaction

Privacy-by-design begins with data minimization: do not send sensitive data to the model unless it is necessary for the user task and justified by requirements. For each field in the prompt or retrieved context, ask: “Is this needed to produce the correct answer or action?” If not, remove it. When it is needed, prefer transformations: partial masking (last four digits), tokenization, or replacing identifiers with stable pseudonyms that your server can reverse later.

Implement a clear PII handling pipeline: (1) classify inputs (PII, secrets, regulated data), (2) redact or tokenize before persistence, (3) constrain what is sent to the LLM provider, and (4) restrict what is returned to the user. Redaction should be deterministic and testable; treat it like an API contract. Keep separate policies for “display” vs “storage” redaction: you may need to show PII to an authorized agent, but you should not store it in long-lived logs.

Retention is where teams get surprised. Decide what you store (prompts, retrieved snippets, tool outputs, model responses), for how long, and for what purpose (debugging, evaluation, audit). Implement tiered retention: short-lived full-fidelity traces in a restricted environment, plus longer-lived minimal metadata (request IDs, latency, token counts, outcome codes). Provide deletion mechanisms aligned to your compliance obligations and product promises.

Common mistakes include training or fine-tuning on production conversations without explicit consent and governance, and leaking PII into vector databases because chunking and embedding happened before redaction. Practical outcome: a data inventory and privacy spec that lists data elements, lawful basis/consent assumptions, storage locations, retention windows, and redaction tests that run in CI.

Section 5.3: Prompt injection defenses and content trust boundaries

Section 5.3: Prompt injection defenses and content trust boundaries

Prompt injection is not a “prompting issue”; it is a trust-boundary issue. Any untrusted content—user input, web pages, emails, tickets, retrieved documents—can contain instructions that conflict with your system intent. Your job is to ensure that only the system’s policy and the user’s authorized request can influence actions, and that untrusted content is treated as data, not instructions.

Start by defining content trust boundaries in your orchestrator. Separate fields like: system policy, developer instructions, user request, and retrieved content. Use structured prompting (JSON/XML-like sections) and explicitly label retrieved content as “reference material” that must not override policy. Then enforce behavior outside the model: never allow the model to directly execute arbitrary tool calls. Instead, require the model to emit a tool request that your server validates against an allowlist, schema, and policy checks (resource ownership, tenant, risk level).

Add technical defenses: content sanitization (strip HTML/JS, remove hidden text), retrieval filtering by metadata and tenant, and safety classifiers to detect attempts to exfiltrate secrets (“print your system prompt,” “show credentials,” “ignore previous instructions”). For high-risk workflows, use a two-model or two-pass approach: one pass generates a plan; a second pass (or a deterministic validator) checks that the plan complies with rules before any tool execution.

Common mistake: relying on a single “do not follow instructions in documents” sentence. Injection resilience comes from layered controls: policy separation, server-side validation, least-privilege tool scopes, and safe defaults. Practical outcome: a documented trust model plus a set of injection test cases (your “red team” prompts) that must pass before release.

Section 5.4: AuthZ/AuthN and tenant isolation in AI-assisted workflows

Section 5.4: AuthZ/AuthN and tenant isolation in AI-assisted workflows

Enterprise LLM apps usually fail security review not because the model is “unsafe,” but because authorization is ambiguous. You must make it provable that the LLM cannot access data or perform actions beyond the authenticated user’s permissions. This requires standard AuthN/AuthZ discipline plus AI-specific enforcement points.

Authenticate users with your existing identity provider (OIDC/SAML) and propagate identity through the entire request chain. Then implement authorization server-side at every data access: retrieval queries, document fetches, and tool calls. Do not depend on the model to “choose the right tenant” or “only use allowed docs.” Enforce tenant isolation with hard filters: each vector DB record must carry a tenant ID and, ideally, ACL metadata; retrieval must apply tenant and ACL constraints before similarity search results are returned. If your vector DB can’t enforce this reliably, wrap it with a service that can.

Secrets handling is part of authorization. Use short-lived credentials for tool calls (scoped OAuth tokens, per-user delegated access) rather than shared API keys. Store secrets in a dedicated secrets manager, rotate them, and never place them in prompts, logs, or client-visible traces. A useful pattern is “capability tokens”: the orchestrator mints a one-time token representing a narrow action (e.g., create a draft ticket in project X) and the tool service validates it.

Common mistakes include reusing a single service account for all tool operations, and building retrieval indexes that mix tenants. Practical outcome: an authorization matrix that maps user roles to tool scopes and data domains, plus integration tests proving cross-tenant requests return empty results and tool execution is denied without explicit scope.

Section 5.5: Compliance readiness: logs, audits, and vendor risk

Section 5.5: Compliance readiness: logs, audits, and vendor risk

Compliance is easiest when you design for evidence. Assume you will need to answer: who accessed what, when, under which policy, and what the system did as a result. Build auditability into the architecture: assign a request ID and trace ID to every interaction, record the user identity, tenant, data sources accessed, tools invoked, and key policy decisions (e.g., “redaction applied,” “tool call blocked by policy”).

Be careful about what you log. You often need enough detail for incident response and debugging, but storing full prompts and retrieved documents can create compliance and privacy liabilities. Use tiered logging: default to minimal logs, and allow privileged, time-bounded “debug capture” with approvals. Ensure logs are immutable (append-only), access-controlled, and retained per policy.

Vendor risk is also part of compliance readiness. Document your model provider and third-party services: data usage terms (training opt-out), data residency, encryption, sub-processors, and breach notification commitments. If you use hosted embeddings or rerankers, they are data processors too. Create a vendor checklist and keep artifacts (DPAs, SOC 2 reports, pen test summaries) linked to the system design.

Common mistake: treating evaluation datasets and incident transcripts as “not production,” then storing sensitive content without controls. Practical outcome: an audit log schema, a compliance evidence folder in your repo (policies, diagrams, vendor docs), and a runbook showing how to reconstruct an event from trace IDs during an investigation.

Section 5.6: Reliability: rate limits, graceful degradation, and fallbacks

Section 5.6: Reliability: rate limits, graceful degradation, and fallbacks

Reliability is security’s cousin: a system that fails unpredictably becomes a safety risk and a business risk. Define explicit SLOs for your LLM app, such as availability, p95 latency, and “successful task completion rate.” Tie these to user journeys (answering a question, drafting an email, submitting a tool action) rather than only infrastructure metrics.

Design for rate limits and quota volatility. Implement client- and server-side throttling, backoff, and queueing for bursty workloads. Prefer asynchronous patterns for heavy tasks (document ingestion, large RAG queries, long tool chains) and provide user feedback (job status) rather than blocking until timeouts. Cache where safe: embeddings, retrieval results, and even model outputs for deterministic prompts, but ensure caches are tenant-scoped and respect authorization.

Graceful degradation is your main production lever. Decide what happens when the model is down, the vector DB is slow, or a tool API fails. Options include: return a best-effort answer without tool actions; fall back from RAG to a “no-citations” mode with a clear disclaimer; switch to a smaller/cheaper model; or route to a human review queue for high-risk actions. Build circuit breakers so failing dependencies do not cascade.

Common mistake: treating “model error” as a single retry loop. Retries can amplify outages and cost. Use bounded retries with jitter, and emit structured error codes that your monitoring can alert on. Practical outcome: a reliability plan with SLOs, dashboards (latency, token usage, tool failure rates), release gates informed by evaluation results, and a fallback matrix documenting exactly what users experience under each failure mode.

Chapter milestones
  • Threat-model the system and identify highest-risk paths
  • Design access controls, secrets handling, and data minimization
  • Implement defenses for injection and untrusted content
  • Plan incident response, auditability, and compliance evidence
  • Define SLOs, monitoring, and reliability practices for production
Chapter quiz

1. Why does Chapter 5 argue that security and reliability become primary product requirements when moving from a demo to an enterprise LLM deployment?

Show answer
Correct answer: Because an LLM system is an end-to-end pipeline with multiple hops where each hop introduces risk and failure modes
The chapter emphasizes that the system includes inputs, tool/data access, and outputs—each step can introduce exposure, leakage, manipulation, compliance gaps, and brittleness under load.

2. What is the main goal of threat-modeling the end-to-end LLM system in this chapter?

Show answer
Correct answer: To identify the highest-risk paths so guardrails can be designed into the architecture
Threat-modeling is used to find where risk concentrates across the pipeline and to prioritize controls early.

3. Which design choice best aligns with the chapter’s recommended mindset shift about trust boundaries?

Show answer
Correct answer: Treat the LLM as untrusted and protect data/tools with explicit boundaries and controls
The chapter states to treat the LLM as manipulable by untrusted inputs and treat data/tools as high-value assets requiring explicit protection.

4. According to the chapter, which set of controls most directly supports privacy protection as part of the system guardrails?

Show answer
Correct answer: Access controls, secrets handling, and data minimization
The chapter highlights these controls to reduce exposure of sensitive information and limit what is collected and stored.

5. What is the purpose of defining SLOs and monitoring practices for an enterprise LLM system?

Show answer
Correct answer: To ensure predictable production behavior and detect/handle failures in the model or dependencies
SLOs and monitoring are framed as reliability practices to keep behavior predictable even when components fail.

Chapter 6: From Blueprint to Reference Implementation (and Your Portfolio)

A requirements doc and architecture diagram are necessary, but they are not yet “buildable.” Teams ship when they can clone a repo, run one command, see a working happy-path, and understand where to plug in real services, data, and policies. This chapter turns your blueprint into a minimal reference implementation: a small, opinionated build that demonstrates the end-to-end flow (data → retrieval/tools → orchestration → API → evaluation → deployment), while remaining simple enough for new engineers to adopt quickly.

As a PM transitioning into an LLM Solutions Architect role, this is also where your work becomes portfolio-grade. A reference build forces clarity: interfaces, typed outputs, environment configuration, and release gates. It also forces operational thinking: logging, dashboards, runbooks, and on-call readiness. You will practice engineering judgment—what to implement now vs what to stub—and you will document tradeoffs so future teams can evolve the system without repeating analysis.

Keep a consistent north star: the reference implementation is not the product. It is a learning and acceleration asset that proves feasibility, de-risks integration, and sets standards for security, privacy, and compliance in an enterprise LLM deployment.

Practice note for Define the minimal reference implementation scope and interfaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a repo layout and scaffolding for rapid team adoption: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for deployment, monitoring, and on-call readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Estimate delivery: milestones, staffing, and cost of ownership: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package your work into a portfolio-ready case study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the minimal reference implementation scope and interfaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a repo layout and scaffolding for rapid team adoption: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for deployment, monitoring, and on-call readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Estimate delivery: milestones, staffing, and cost of ownership: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package your work into a portfolio-ready case study: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Reference implementation goals: learnability vs completeness

Define “minimal” by asking: what must be true for another team to extend this safely? A good reference implementation teaches by example. It includes the narrowest slice that exercises the core patterns you chose earlier—prompting, tools, RAG, or fine-tuning—plus the constraints that matter (latency budgets, data residency, PII handling, audit logging). The trap is trying to be complete: implementing every feature, integrating every system, and polishing UX. That turns the reference build into a parallel product and usually delays learning.

Balance learnability vs completeness with three scope rules. First, implement one primary user journey end-to-end (the “golden path”), including failure handling for the top two error modes (e.g., retrieval empty, tool timeout). Second, implement only one variant per architectural decision: one vector store, one model provider, one orchestration approach, one evaluation harness. Third, document extension points rather than building them: clearly mark TODOs, interfaces, and configuration hooks.

  • Include: working API, minimal UI or CLI, retrieval/tool integration, structured outputs, evaluation script, basic deployment config.
  • Stub: secondary workflows, multi-tenant billing, advanced RBAC, full data pipelines (use a small sample dataset).
  • Prove: acceptance criteria that map to business goals (accuracy, refusal behavior, latency, cost ceilings).

Common mistake: calling something a reference build when it cannot be run locally or in a sandbox environment. If setup takes more than 30–45 minutes for a new engineer, adoption drops. Your “definition of done” for this section is a scoped build that is runnable, understandable, and explicitly non-exhaustive.

Section 6.2: Contracts first: APIs, schemas, and typed outputs

Reference implementations succeed when they lock down contracts early. LLMs are probabilistic; contracts create determinism at system boundaries. Start with the external API: define endpoints, request/response schemas, authentication mechanism, and error model. Then define internal contracts: the retrieval interface, tool calling interface, and the “LLM output” interface (ideally typed). This lets teams swap models or retrieval backends without rewriting the whole application.

Practically, use a schema language and code generation where possible. For HTTP APIs, OpenAPI is common; for eventing, use JSON Schema or Protobuf. For LLM responses, require structured outputs: JSON with strict schema, or function/tool calling with typed arguments. Enforce this at runtime with validation (reject or repair invalid outputs) and at test time with fixtures.

  • API contract: POST /v1/answer with question, context_mode, user_id; response includes answer, citations[], confidence, trace_id.
  • Retrieval contract: retrieve(query, filters) -> [DocumentChunk] where each chunk has source, uri, chunk_id, text, score.
  • Tool contract: each tool has name, description, input schema, and explicit timeouts/retry policy.

Engineering judgment shows up in error handling. Decide which errors are user-visible (e.g., “I couldn’t find supporting sources”) vs operational (e.g., vector store unavailable). Define consistent error codes and include trace_id for observability. Common mistake: leaving contracts “implicit” in prompt text or scattered in notebooks. Your goal is that contracts are discoverable, versioned, and testable—making the system maintainable under change.

Section 6.3: Implementation plan: components, stubs, and test harnesses

Turn the architecture into a build plan that a team can execute. Start by listing components and their ownership boundaries: API service, orchestration layer, retrieval adapter, tool adapters, prompt/template package, evaluation package, and infrastructure definitions. For each, define: inputs/outputs, dependencies, and what can be stubbed in the reference build.

Design a repo layout that encourages rapid adoption. A practical pattern is a mono-repo with clear top-level folders and a single “happy path” entry point. Keep notebooks out of the critical path; if you include them, treat them as demos, not production scaffolding.

  • /apps/api: web server, auth middleware, request validation, routing.
  • /packages/orchestrator: prompt assembly, tool routing, retry/backoff, guardrails.
  • /packages/retrieval: vector store client, chunking utilities, query rewriting (optional).
  • /packages/eval: gold set loader, metrics, report generator, release gates.
  • /infra: IaC templates, environment manifests, secrets placeholders.
  • /docs: setup, architecture, runbooks, decision records (ADRs).

Write stubs intentionally. For example, if enterprise SSO is out of scope, implement an auth interface and a “dev token” provider with the same claims shape you’ll expect in production. If the final system needs a data catalog, stub it with a local YAML registry but keep the same lookup API. The payoff is that your reference build remains runnable while still mapping to real enterprise constraints.

Invest early in a test harness. Include: unit tests for schema validation and prompt rendering; integration tests that run against a mocked LLM and a small local vector index; and a “contract test” for tool schemas. Common mistake: relying only on manual prompting. Your reference implementation plan should include a repeatable test command that produces an artifact (logs + metrics report) suitable for code review and CI.

Section 6.4: Deployment patterns: environments, configuration, and CI/CD

A reference implementation becomes truly valuable when it demonstrates how it will be deployed. Document a clear environment strategy: local (developer laptop), dev/sandbox (shared), staging (prod-like with test data), and production. For each environment, specify model endpoints, data sources, logging destinations, and access controls. Keep the same code across environments; change behavior via configuration.

Use a configuration system that supports typed settings and secret separation. Non-secret config (feature flags, model name, top-k, timeouts) can live in environment files or a parameter store. Secrets (API keys, DB credentials) must come from a secrets manager. Show a pattern for rotation and least-privilege IAM. In enterprise contexts, include guidance on data boundaries: which environments may access real documents, and where PII must be masked or blocked.

  • CI checks: lint, unit tests, schema validation, build, integration tests with mocked LLM.
  • CD steps: deploy to dev, run smoke tests, generate evaluation report, manual gate, deploy to staging/prod.
  • Release gates: block deployment if latency/cost thresholds or gold-set metrics regress beyond agreed tolerances.

Common mistake: treating LLM configuration as “just prompt tweaks” and bypassing CI/CD. In reality, prompts and retrieval settings are production code. Version prompts, pin model versions where possible, and store prompt templates in the repo with change review. Demonstrate safe rollout patterns such as canary deployments, feature flags for model/provider switching, and fallback behavior when the LLM or retrieval service degrades.

Finish this section with a runnable deployment example (even if simplified): a container build, a minimal service manifest, and a clear command sequence. The goal is confidence: a new team can follow your documentation and reproduce the deployment without tribal knowledge.

Section 6.5: Operational readiness: dashboards, alerts, and runbooks

Operational readiness is where many LLM projects fail in enterprise settings. Your reference implementation should make operations visible. Start with structured logging and tracing: every request gets a trace_id; logs include model name, token counts, retrieval hits, tool calls, and safety outcomes (e.g., “refused,” “redacted”). Redact sensitive payloads by default and provide a controlled debug path with audited access.

Define dashboards that answer operational questions. For availability: error rate by component (API, retrieval, LLM provider, tools). For quality: citation rate, refusal rate, top intents, and evaluation scores from scheduled runs. For performance/cost: latency percentiles, tokens per request, cost per successful answer, cache hit rate (if used). For data: document freshness, ingestion lag (even if ingestion is stubbed, show where it would be measured).

  • Alerts: sustained 5xx rate, retrieval returning empty above baseline, tool timeout spikes, cost anomaly, safety filter failures.
  • On-call runbooks: “LLM provider degraded,” “vector store down,” “bad deploy regression,” “quality drop after data update.”
  • Playbooks: rollback steps, feature-flag disablement, switch to fallback model, disable a tool, reduce top-k.

Write runbooks like you expect a tired on-call engineer to use them. Each should include: symptoms, immediate mitigations, deeper diagnosis queries, and escalation paths. Include “known good” baselines: typical latency, typical token counts, typical retrieval hit rate. Common mistake: only monitoring uptime. LLM systems degrade silently—quality drift, retrieval mismatches, prompt regressions—so bake evaluation runs into operations and treat them as first-class health signals.

Section 6.6: Portfolio artifacts: case study narrative and interview talk track

Your reference implementation is also your career asset—if you package it correctly. Convert the work into a portfolio-ready case study that demonstrates the full arc: business goal → requirements → architecture → evaluation → reference build → operational plan. Hiring panels want to see judgment, not just code. Your narrative should highlight constraints (security, privacy, compliance), tradeoffs (RAG vs fine-tuning; tool calling vs pure prompting), and measurable outcomes (gold-set improvements, latency/cost budgets met).

Produce a small set of artifacts that are easy to review. First, a short README with a one-command local run, plus an architecture diagram and a “how it works” walkthrough. Second, an ADR folder that records key decisions (model/provider choice, retrieval strategy, structured output approach, release gates). Third, an evaluation report template showing metrics, failure taxonomy, and go/no-go criteria. Fourth, runbooks and dashboard screenshots (or sample queries) demonstrating on-call readiness.

  • Case study structure: Problem → Users & workflows → Constraints → Approach → Reference build → Evaluation & results → Ops plan → Next steps.
  • Interview talk track: 2-minute overview, 10-minute deep dive on one decision, and a “what I’d do next” roadmap.
  • What to redact: proprietary data, customer identifiers, internal endpoints—replace with synthetic datasets and generic interfaces.

Common mistake: presenting a notebook of prompts without interfaces, tests, or deployment thinking. Your differentiator is end-to-end credibility: contracts, scaffolding, runbooks, and realistic delivery estimates (milestones, staffing, and cost of ownership). When you can explain how the system is built, evaluated, deployed, and operated—plus what you intentionally did not build—you demonstrate the exact mindset of an LLM Solutions Architect.

Chapter milestones
  • Define the minimal reference implementation scope and interfaces
  • Design a repo layout and scaffolding for rapid team adoption
  • Create runbooks for deployment, monitoring, and on-call readiness
  • Estimate delivery: milestones, staffing, and cost of ownership
  • Package your work into a portfolio-ready case study
Chapter quiz

1. What best describes the purpose of a minimal reference implementation compared to a requirements doc and architecture diagram?

Show answer
Correct answer: A small, opinionated end-to-end build that teams can clone and run to see a working happy-path and integration points
The chapter emphasizes “buildable” assets: clone a repo, run one command, observe a happy-path, and know where to plug in real services.

2. Which end-to-end flow should the reference implementation demonstrate according to the chapter?

Show answer
Correct answer: Data → retrieval/tools → orchestration → API → evaluation → deployment
The reference build should show the full LLM solution pipeline from data through deployment, including evaluation.

3. Why does creating a reference build help make your work “portfolio-grade” as a PM moving into an LLM Solutions Architect role?

Show answer
Correct answer: It forces clarity on interfaces, typed outputs, environment configuration, and release gates, turning the blueprint into something demonstrably buildable
The chapter links portfolio value to forcing implementation-level clarity and demonstrable standards beyond diagrams.

4. What operational thinking is explicitly called out as part of turning a blueprint into a reference implementation?

Show answer
Correct answer: Logging, dashboards, runbooks, and on-call readiness
The chapter highlights operations as essential: monitoring artifacts and readiness practices, not just core functionality.

5. Which statement best captures the chapter’s “north star” for what a reference implementation is (and is not)?

Show answer
Correct answer: It is a learning and acceleration asset that proves feasibility, de-risks integration, and sets standards; it is not the product
The chapter stresses that the reference build accelerates teams and sets enterprise standards, but should not be treated as the product itself.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.