AI Certifications & Exam Prep — Intermediate
Ship a production-grade RAG app with tracing, evals, and spend control.
This capstone course is structured like a short technical book: each chapter adds a production layer—ingestion, retrieval, observability, evaluation, and cost governance—until you have a complete Retrieval-Augmented Generation (RAG) application you can defend in a certification review. You will produce tangible artifacts: a running API, a versioned index, traceable request flows, evaluation reports, and a budget-enforced deployment plan.
The focus is not just “make it work,” but “make it operable.” You’ll learn how to detect when retrieval fails, how to distinguish hallucinations from missing context, how to set service-level objectives (SLOs), and how to control spend with enforceable budgets. By the end, your project looks like something a real team could monitor, iterate on, and ship.
You will implement a production-style RAG application that answers questions using your chosen corpus (documentation, knowledge base articles, policies, or internal notes). The system will include a structured ingestion pipeline, a vector index with metadata and versioning, a retrieval + generation chain that returns grounded answers with citations, and a web-ready API that supports streaming responses.
Chapter 1 locks in scope and success criteria so you don’t build a demo that can’t be graded. Chapters 2 and 3 deliver the functional core: ingestion, indexing, retrieval, and a clean API surface. Chapter 4 adds tracing and debugging workflows so every answer is explainable and every failure is actionable. Chapter 5 formalizes quality with an evaluation harness and regression tests—critical for certification scoring and real-world maintenance. Chapter 6 hardens the system with budget enforcement, security basics, and deployment packaging, then guides you through a polished final submission.
This course is designed for learners preparing for AI/LLM certifications, technical interviews, or portfolio reviews where reviewers expect evidence: architecture decisions, measurable quality, and operational readiness. If you already know basic Python and APIs but haven’t shipped an observable, testable LLM app, this capstone fills that gap.
To begin building your capstone and track progress on Edu AI, Register free. Want to compare learning paths first? You can also browse all courses and return to this capstone when you’re ready to ship.
Senior Machine Learning Engineer, LLM Systems & Observability
Sofia Chen builds retrieval-augmented generation systems for customer support and internal knowledge search. She specializes in LLM observability, evaluation harnesses, and cost governance for production AI. She has mentored teams through capstone-style deliveries and certification readiness sprints.
This capstone is not about building a “cool demo.” It is about producing a Retrieval-Augmented Generation (RAG) system you can defend under certification-style scoring: clear scope, repeatable builds, measurable quality, and explicit cost controls. In production, vague requirements turn into brittle systems and expensive surprises. This chapter helps you convert a problem idea into a deliverable plan with architecture, acceptance tests, and service-level objectives (SLOs) that can be traced, evaluated, and budgeted.
We will move through the milestones implicitly: define your problem statement and user journeys (Milestone 1), choose data sources and acceptance tests (Milestone 2), draft target architecture and deployment approach (Milestone 3), set quality/latency/cost SLOs (Milestone 4), and create a delivery plan and repo structure (Milestone 5). By the end, you should have a capstone that is “engineering-complete” on paper before you write significant code.
Engineering judgement matters most at the boundaries: where data enters, where retrieval can fail, where the model can hallucinate, and where cost can spike. Your success criteria should explicitly address those boundaries. Expect to iterate, but do not allow scope creep: every new feature must map to rubric points, acceptance tests, and a measurable outcome.
Practice note for Milestone 1: Define capstone problem statement and user journeys: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Choose data sources, constraints, and acceptance tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Draft target architecture and deployment approach: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Set quality, latency, and cost SLOs for certification scoring: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Create a delivery plan and repo structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Define capstone problem statement and user journeys: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Choose data sources, constraints, and acceptance tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Draft target architecture and deployment approach: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Set quality, latency, and cost SLOs for certification scoring: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Create a delivery plan and repo structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your first job is to translate the course outcomes into a scoring rubric you can execute. Treat each outcome as a contract: you either demonstrate it with evidence (code + artifacts), or you don’t. This is Milestone 1 framed as assessment engineering: define the problem statement, then define what “done” means for each capability.
Start with a one-page problem statement: target users, top 3 user journeys, and “non-goals.” Example journeys: (1) ask a policy question and receive an answer with citations; (2) request a summary of a document section; (3) ask a question outside the corpus and get a safe fallback that explains limitations. Each journey should have an acceptance test and a trace you can show.
kb_v2026_03_25).Common mistake: writing acceptance tests that are subjective (“answer seems good”). Replace with observable assertions: presence of citations, maximum latency, retrieval hit rate, cost per request. Your rubric mapping becomes your delivery checklist and your project’s definition of success.
Selecting a RAG pattern is an architectural decision with cost and reliability consequences. Milestone 3 is not “choose the fanciest approach”; it is “choose the simplest pattern that meets the user journeys and SLOs.” You should document your choice and the conditions under which you would upgrade.
Single-pass RAG is the baseline: one query → retrieve top-k → generate with citations. It is easiest to trace and cheapest to run. It fails when the user query is ambiguous, vocabulary mismatched, or when the corpus requires multi-step reasoning across sources.
Multi-query RAG expands the user query into variants (synonyms, sub-questions) and merges results. This often improves recall but increases retrieval cost and latency. Use it when user questions are short, domain-specific, or when you observe low retrieval hit rates in traces.
Re-ranking adds a second stage that scores candidate chunks (cross-encoder or lightweight LLM). This improves precision and citation quality, especially when embeddings return thematically similar but non-answering chunks. The trade-off is extra compute; re-rank only when you can afford it under your cost budget.
Agentic RAG introduces iterative planning (decide to search, refine query, read more) and tool calls. It can handle complex workflows but is the hardest to make predictable for certification scoring: more tokens, more failure modes, and more evaluation complexity. If you use it, constrain it: max tool calls, strict timeouts, and deterministic fallbacks.
Practical recommendation: implement single-pass first with excellent observability, then add one enhancement (multi-query or re-rank) only if your evaluation harness proves it improves relevance without violating latency/cost SLOs. A mature capstone shows restraint and evidence-driven iteration.
Milestone 2 (choose data sources, constraints, and acceptance tests) is where many RAG projects quietly fail. The fastest way to derail a capstone is to pick a dataset you cannot legally store, cannot share, or cannot evaluate consistently. Treat data governance as part of production readiness, not paperwork.
Start by listing candidate sources (PDF manuals, internal docs, public web pages, ticket exports) and annotate each with: licensing terms, permitted uses (commercial vs educational), redistribution rules, and whether derived embeddings are allowed. If terms are unclear, choose a different dataset. For certification-style work, public datasets with explicit licenses (e.g., CC BY) reduce risk and simplify repo sharing.
Privacy constraints determine architecture. If documents contain personal data or confidential material, you must define redaction or access control. Practical options include: (1) pre-ingestion redaction (strip emails, IDs), (2) per-user authorization filters on retrieval (metadata ACLs), (3) separate indexes per tenant. Your acceptance tests should include “unauthorized user cannot retrieve restricted chunks,” which is more realistic than “we promise not to.”
Also decide what you log. Tracing can accidentally capture sensitive prompts or retrieved passages. A production-minded strategy logs identifiers and metrics by default, and logs content only in a gated debug mode with retention limits. Document your retention policy (e.g., 7 days for traces, 30 days for aggregates) and show where it is configured.
Common mistake: using scraped web content with unstable URLs. Your evaluation set then drifts as pages change. Prefer versioned snapshots or stable datasets so your regression tests remain meaningful across time.
Before implementing pipelines and chains, establish an environment strategy that keeps the capstone reproducible and safe. This supports Milestone 5 (delivery plan and repo structure) and prevents a common production failure: “it works on my laptop, but not in CI or staging.”
Use a layered configuration approach: defaults in code, environment-specific overrides in config files, and secrets exclusively in a secret manager or environment variables. Separate what changes often (model name, top-k, chunk size, thresholds) from what must never be committed (API keys). A practical pattern is: config/default.yaml, config/dev.yaml, config/prod.yaml, plus .env for local development (ignored by Git).
Define a minimal set of required secrets: LLM provider key, embeddings key (if separate), vector DB credentials, and tracing/metrics backend keys. Add a startup check that fails fast if required secrets are missing. In production, silent fallbacks create partial outages that are difficult to debug.
Pin versions. Your ingestion outputs and retrieval quality depend on libraries (PDF parsers, tokenizers) and model revisions. Record model IDs and embedding dimensions in your index metadata. If you change embeddings models, treat it as a breaking change requiring a new index version and a migration plan.
Common mistake: mixing configuration with prompt text. Keep prompts versioned and testable (e.g., prompts/answer_with_citations.md) and include prompt version in traces so you can correlate regressions to changes.
Milestone 4 is where you turn ambition into measurable commitments. You need baseline metrics before you can enforce budgets or evaluate improvements. Establish a “day 0” baseline with the simplest working system (often single-pass RAG), then set SLOs that are strict enough to guide engineering but realistic for your infrastructure.
Quality should be split into at least two dimensions: relevance (did we retrieve the right evidence?) and faithfulness (did the answer stay grounded in citations?). Baseline metrics might include: retrieval hit rate on a small labeled set, mean reciprocal rank (MRR) for retrieval, and a faithfulness score using an automated checker that verifies claims against retrieved chunks. Define a hard rule: if there are no strong retrieval results, the system must abstain or ask a clarifying question rather than fabricate.
Latency should be measured end-to-end and by stage: ingestion is batch (minutes/hours), but query-time must be interactive. Track p50 and p95 for: embedding/query time, vector search time, re-rank time (if any), and generation time. A practical SLO example: p95 under 2.5 seconds for retrieval + first token, and under 6 seconds for full response, with timeouts and fallbacks if exceeded.
Cost should be expressed as a budget per request and per day. Measure prompt tokens, completion tokens, number of retrieval calls, and re-rank tokens if you use an LLM re-ranker. Then enforce: max context tokens, max output tokens, caching (prompt+retrieval cache keyed by normalized query and index version), and rate limits. Add alerts when daily spend or p95 tokens exceed thresholds.
Common mistake: optimizing quality without budgeting tokens. Multi-query and agentic loops can double or triple costs. Your SLOs should explicitly cap tool calls, retrieved chunk count, and context size.
A production-ready capstone benefits from a clear scaffolding that mirrors real deployments. This is Milestone 5 in concrete form: the structure should make it obvious where ingestion lives, where the online API lives, where evaluations run, and how changes are tested.
A practical monorepo layout looks like this: /apps/api (query endpoint, auth, rate limiting), /apps/worker (ingestion jobs, re-indexing), /packages/rag (retrieval + prompting library), /packages/observability (tracing wrappers, metric helpers), /eval (datasets, harness, reports), and /infra (Docker, IaC, deployment manifests). Keep shared schemas (metadata, trace payloads) in a package to avoid drift between services.
Define service boundaries early. In many teams, ingestion is a separate deployable because it needs different scaling and permissions than the online API. Even if you run both locally, model them as separate entry points. This makes it easier to enforce least privilege (the API should not need write access to raw documents, for example).
Your CI outline should reflect the rubric: lint/format, unit tests for chunking and metadata, integration tests for retrieval (smoke test against a small index), evaluation run on a fixed regression set, and a cost check that fails if token usage exceeds a configured budget on the test suite. Publish artifacts: evaluation report, latency histogram, and example traces.
Common mistake: leaving evaluation as a manual notebook. For certification readiness, evaluations must be runnable with one command in CI and produce comparable results across commits. Treat the harness as a first-class product feature, not an afterthought.
1. What is the primary goal of this capstone according to Chapter 1?
2. Which set of deliverables best matches what Chapter 1 says you should have before writing significant code?
3. Why does Chapter 1 stress defining acceptance tests and measurable outcomes early?
4. Where does Chapter 1 say engineering judgment matters most when designing the capstone?
5. How should you handle new feature ideas to avoid scope creep, per Chapter 1?
A production RAG system is only as trustworthy as its ingestion pipeline. If your loaders silently skip pages, if your parser drops tables, or if duplicates flood the index, the “retrieval” part of RAG becomes random. This chapter turns ingestion into an engineered subsystem: deterministic, observable, versioned, and repeatable.
We will move from raw sources (files, web pages, internal docs) through normalization and cleaning, then apply chunking strategies that preserve meaning, and finally produce embeddings into a versioned vector index. Along the way, you’ll build the “boring” but critical workflows: incremental updates, backfills, and data-quality sampling reports. These are the Milestones that separate a demo from an on-call-ready service.
Keep one idea front and center: ingestion is a build step, not a side effect. If you can’t rerun it, compare outputs across versions, and explain exactly why a given chunk exists in the index, you will struggle with evaluations, tracing, cost control, and regressions later in the course.
Practice note for Milestone 1: Build document loaders and normalization pipeline: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Implement chunking strategies and metadata schemas: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Generate embeddings and create a versioned index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Add incremental updates and backfill workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Validate data quality with sampling reports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Build document loaders and normalization pipeline: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Implement chunking strategies and metadata schemas: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Generate embeddings and create a versioned index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Add incremental updates and backfill workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Validate data quality with sampling reports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Milestone 1 starts with document loaders: small, testable connectors that retrieve raw content and emit a normalized “document” object. In production you typically have three classes of sources: file drops (PDF, DOCX, HTML, Markdown), web content (public docs, knowledge bases), and internal systems (wikis, ticketing, shared drives, databases). Treat each connector as untrusted I/O and isolate it behind a consistent interface: fetch → parse → normalize.
Parsing is where most hidden failures occur. PDFs may reorder text; scanned PDFs require OCR; DOCX may contain headers/footers that look like content; HTML can include navigation noise. Your goal is to preserve semantic structure when possible (headings, lists, tables) while producing plain text that downstream chunking can work with. Prefer parsers that can provide layout hints (page numbers, section titles) because they become valuable metadata for citations later.
Engineering judgment: decide early what “ground truth text” means for your organization. If your users care about tables (pricing, limits, compatibility matrices), you need a strategy: convert tables to a stable textual representation (e.g., Markdown tables) or store them as structured JSON and retrieve them separately. Many teams ship an MVP that discards tables and later discover the model hallucinates the missing values.
Common mistakes: silently skipping unreadable files, mixing multiple encodings, and allowing nondeterministic scraping (content changes mid-run). Log per-document outcomes (loaded, parsed, failed) and persist a “raw snapshot” or hash so you can reproduce a run. This sets you up for incremental updates and backfills in later milestones.
After parsing, you need a normalization pipeline that makes downstream retrieval predictable. Think of this as “text hygiene”: remove content that hurts retrieval and generation quality, while keeping the wording users expect to see in citations. Milestone 1 continues here: normalize whitespace, collapse repeated headers, remove navigation menus from web pages, and standardize punctuation where it helps (for example, turning fancy quotes into plain quotes).
Deduplication is not optional. In RAG, duplicates cause wasted embedding cost and diluted retrieval (the same fact appears many times, ranking becomes unstable). Use multiple layers: (1) exact match on normalized text hash, (2) near-duplicate detection with MinHash/SimHash or embedding similarity, and (3) source-aware rules (e.g., “latest version wins” for internal policies). Decide what to do with duplicates: drop, merge metadata, or keep but downweight.
PII redaction basics must happen before embeddings. If you embed raw emails, phone numbers, or customer identifiers, they can be retrieved later and surfaced in responses or logs. A practical baseline is rule-based redaction (regex for emails, phone numbers, SSNs, API keys) plus allow/deny lists for known internal patterns. If you need higher recall, add a lightweight PII classifier, but keep it deterministic and auditable.
Common mistakes: over-redaction (removing product IDs that are not PII), under-redaction (missing API keys), and inconsistent cleaning that changes between runs. Treat cleaning rules as versioned code, and record the “pipeline version” into metadata so you can explain differences across index versions.
Milestone 2 is chunking: the single most influential design choice for retrieval quality. Chunk too large and you’ll miss relevant passages because the embedding averages multiple topics. Chunk too small and you lose context, forcing the generator to guess. A practical starting point for general documentation is 300–800 tokens per chunk with 10–20% overlap, then adjust based on evaluation and observed failure modes.
Overlap helps when answers span boundaries, but it increases cost (more chunks, more embeddings, more index size). Use overlap intentionally: large overlap for narrative docs, smaller overlap for reference docs with clear headings. Always measure: compute average chunk count per document and estimate embedding spend before committing.
Structure-aware splitting beats naive fixed windows. If you can keep headings with their paragraphs, retrieval becomes more semantically aligned with user questions. Split by document hierarchy first (H1/H2/H3, Markdown headings), then by paragraphs, then by sentences as a last resort. For PDFs without headings, consider page-based splitting combined with heuristics (font size, bold text) if your parser provides it.
Common mistakes: chunking that produces empty or near-empty chunks (often due to cleaning), splitting across sentences so citations look broken, and forgetting to cap maximum chunk length (some documents have huge paragraphs). Your practical outcome for this milestone is a configurable chunking module with clear parameters, plus metrics: distribution of chunk sizes, overlap rate, and “structure preservation” rate (how often headings remain attached).
Milestone 2 also requires a metadata schema. In production, metadata is not decoration; it is how you control retrieval, produce credible citations, and debug lineage. Design metadata to answer three questions: (1) Where did this chunk come from? (2) How should it be retrieved? (3) How do we reproduce it?
For citations, store source URL/path, document title, section heading path, page number (for PDFs), and an excerpt boundary (start/end offsets if available). For retrieval filters, include doc type, product/team, language, publish date, and access scope. For lineage, include document_id, chunk_id, source revision (e.g., Git commit, CMS version, last-modified timestamp), and pipeline versions (parser version, cleaning version, chunker version, embedding model).
Engineering judgment: don’t overload metadata with everything you can think of. Every extra field increases index size and sometimes query latency. Instead, choose metadata that directly supports your retrieval strategy and operational debugging. A common mistake is forgetting lineage fields; later, when an evaluation regresses, you cannot tell whether the cause was new source content, a changed chunker, or a different embedding model.
The practical outcome is a documented schema (JSON schema or typed struct) used consistently by loaders, chunkers, and index writers, with defaults and validation.
Milestone 3 is embedding generation and building the vector index. Choose an embedding model that matches your domain and latency/cost constraints, and then pick a storage backend: a managed vector database, a library embedded in your service, or a search engine with vector support. Your production decision should account for: operational burden, indexing speed, metadata filtering support, multi-tenancy, durability, and cost predictability.
Indexing strategy matters as much as the database choice. Use a versioned index: write embeddings to an index named by semantic version or timestamp (e.g., kb_v2026_03_25), then atomically switch your retriever to the new version. This enables safe rollbacks and makes evaluations meaningful. You can keep the previous index for a fixed retention window to support incident response.
Embedding generation should be batched, retried with idempotency, and rate-limited. Persist an “embedding job record” containing: chunk_id, text hash, embedding model name/version, and embedding timestamp. That record lets you avoid recomputing embeddings when text is unchanged, reducing cost and keeping your builds faster.
Common mistakes: overwriting an index in place (no rollback), mixing embeddings from different models in the same collection, and skipping metadata filters so retrieval returns out-of-scope content. The practical outcome of Milestone 3 is a reproducible indexing job that produces a new, fully-populated, versioned index with a promotion step and a clear “current index” pointer.
Milestones 4 and 5 are about operating ingestion over time: incremental updates, backfills, and data quality validation. Start by making ingestion testable. Create fixtures: a small, representative corpus that includes tricky PDFs, pages with tables, duplicated docs, and content containing fake PII patterns. Your CI pipeline should run the full ingestion flow on fixtures and assert deterministic outputs: same number of documents, stable chunk counts within expected ranges, and consistent metadata fields.
Incremental updates require change detection. Use source revision signals (ETag/Last-Modified for web, file hashes for files, version IDs for internal systems). If a document hasn’t changed, skip re-embedding. If it changed, re-chunk and re-embed only the affected chunks, then upsert them. For deletions, implement tombstones: mark chunks as inactive so they stop retrieving, and periodically compact the index if your backend needs it.
Backfills are controlled reprocessing runs: for example, “re-embed everything with a new model” or “re-chunk policies with a new structure-aware splitter.” This is where versioned indexes pay off. Run backfills into a fresh index version, generate sampling reports, then promote when quality gates pass.
Common mistakes: only testing “happy path” documents, running incremental updates without idempotency (creating duplicates), and shipping without a sampling report. The practical outcome is a repeatable build that can be run locally and in CI, plus a lightweight data-quality dashboard or report artifact that makes ingestion changes reviewable before they impact production retrieval.
1. Why does the chapter emphasize making ingestion "deterministic, observable, versioned, and repeatable"?
2. What end-to-end flow best matches the chapter’s ingestion pipeline description?
3. What is the core meaning of "ingestion is a build step, not a side effect" in this chapter?
4. Which set of workflows does the chapter call "boring" but critical for production readiness?
5. Which ingestion failure most directly supports the claim that RAG retrieval can become "random" without a trustworthy pipeline?
In Chapter 2 you built the ingredients: chunked content, metadata, and a searchable index. Chapter 3 turns those ingredients into a production-ready retrieval + generation chain that behaves predictably under real traffic. The work here is less about “making it work” and more about engineering judgment: how you tune top-k without flooding the model, how you handle empty or low-confidence retrievals, how you enforce citation discipline, and how you expose the system through an API that can stream and degrade safely.
This chapter is organized as a set of milestones that mirror what teams actually ship. First, you implement a retrieval pipeline with filters and top-k tuning. Next, you add re-ranking and context window management so the model sees the most useful evidence. Then you design prompts for grounded answers with citations and refusals. Finally, you assemble a FastAPI service with streaming responses, caching, and resilient fallbacks for degraded modes.
Keep a single principle in mind: you are building a chain of probabilistic components. That means every stage needs guardrails and measurable quality signals. A clean interface between stages (query → retrieval → re-rank → context build → generation → post-process) is the easiest way to trace failures, evaluate changes, and enforce cost budgets later in the course.
Practice note for Milestone 1: Implement retrieval pipeline with filters and top-k tuning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Add re-ranking and context window management: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Design prompts for grounded answers with citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Build a FastAPI service with streaming responses: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Add caching and resilient fallbacks for degraded modes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Implement retrieval pipeline with filters and top-k tuning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Add re-ranking and context window management: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Design prompts for grounded answers with citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Build a FastAPI service with streaming responses: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your retrieval quality ceiling is set by how well the system understands the user’s intent. In production RAG, the first “retrieval” is often a lightweight query understanding step: normalize the question, rewrite it to match the document language, and optionally expand it into multiple focused queries. This is where you fix vague pronouns (“it”, “that policy”), missing nouns (“How do I reset?”), and user-specific context (“for enterprise plan”).
A practical pattern is a query rewrite prompt that outputs: (1) a canonical query string, (2) optional filters inferred from the request (product, region, time range), and (3) 2–4 sub-queries that target different facets (definition, procedure, exceptions). Multi-query retrieval improves recall but can explode cost; control it with strict caps and only enable it when the initial retrieval score distribution is weak (e.g., no scores above a threshold).
Hybrid search is usually the default in production because pure vector similarity can miss keyword-heavy queries (error codes, legal clauses, SKUs). Combine lexical (BM25) and vector results, then deduplicate by document id + chunk span. The common mistake is mixing scores naïvely; instead, rank within each channel, merge by reciprocal rank fusion (RRF), then pass the merged candidates downstream.
Milestone alignment: this is where you begin implementing the retrieval pipeline with filters, but you also set up the inputs for later re-ranking and context assembly. Treat query understanding outputs as structured data you can trace and test.
Once you can reliably form a good query, the next step is tuning the levers that control recall, precision, and cost. The three you will use constantly are top-k, diversification (often MMR), and score thresholds. Each has tradeoffs, and production systems usually expose them as configuration with safe defaults rather than hard-coded constants.
Top-k tuning is not “higher is better.” Larger k increases recall but also increases context length, which increases token costs and can degrade answer quality by diluting the evidence. Start with k=5–10 for short factual queries and k=15–30 for broad questions, then measure retrieval hit rate and downstream faithfulness. A practical technique is adaptive k: start at 8, and only expand if the aggregated evidence is insufficient (e.g., total distinct documents < 2 or average score below target).
MMR (Maximal Marginal Relevance) helps you avoid returning ten near-duplicate chunks from the same section. This is especially important when your chunking strategy produces overlapping windows. MMR adds diversity by penalizing candidates that are too similar to already-selected chunks. The common mistake is setting the diversity parameter too high, which can pull in irrelevant chunks just to be “different.” A useful heuristic: tune MMR on a small set of representative questions and inspect which documents are being excluded; your goal is diversity across sources, not randomness.
Score thresholds are your first safety guard. If the best similarity score is below a minimum, you should not pretend you found evidence. Instead, you either (a) ask a clarification question, (b) run a broader retrieval mode (hybrid + multi-query), or (c) enter a refusal/degraded mode with a transparent message. Thresholds must be calibrated per index and embedding model; do not copy numbers from blog posts.
Milestone alignment: this section completes the core retrieval pipeline and sets you up for Milestone 2, where re-ranking and context window management turn raw recall into model-ready evidence.
A production RAG prompt is a contract. It tells the model what it may use (the retrieved context), what it must produce (answer + citations), and what it must do when evidence is missing (refuse or ask to clarify). Without this contract, your “citations” become decoration rather than an enforceable grounding mechanism.
Use a structured prompt with clearly separated blocks: system (rules), developer (format requirements), context (chunks with IDs), and user (question). In the rules, state that the model must only use facts found in the provided context and must cite the chunk IDs for each factual claim. Then specify a citation format that your post-processor can parse, such as [doc_id#chunk_id] or [S1] with a mapping table.
Refusal is not a failure; it is correct behavior when retrieval confidence is low. Include an explicit policy: if the context does not contain the answer, respond with a brief refusal and one of: (a) a clarification question, or (b) guidance on what information is needed. A common mistake is “soft refusal,” where the model hedges but still invents steps. Make refusal deterministic by defining a threshold signal from retrieval (e.g., max_score or evidence_count) and passing it to the prompt as a variable the model must respect.
Milestone alignment: this section corresponds to Milestone 3. You are designing prompts that can be evaluated later for faithfulness and that work well with streaming and post-processing in the API layer.
Even with a strong prompt, you should not ship the raw model output directly to users. Post-processing is where you enforce formatting, validate citations, and apply lightweight safety and quality checks. Think of it as a “linting” step for natural language.
Citation validation is the most important. Parse the model output for citation tokens and verify that each one corresponds to a retrieved chunk actually present in the context window. If the model cites nonexistent IDs, you have a few options: (1) drop invalid citations and mark the answer as low-confidence, (2) re-run generation with a stricter prompt, or (3) fall back to extractive mode (return top passages with minimal synthesis). The common mistake is silently accepting invalid citations, which trains users to distrust the system.
Formatting normalization improves consistency for downstream clients. Convert markdown to a safe subset if needed, enforce a maximum length, and ensure lists are well-formed. For enterprise settings, you may also need to remove PII patterns or secrets (API keys) detected in either the context or the generated answer.
Safety checks in RAG are often about policy compliance rather than toxicity. Examples: do not provide medical/legal advice beyond the sourced text; do not output internal-only documents to unauthorized users; do not reveal system prompts. These checks typically rely on metadata (document access level) plus simple classifiers or rules. In later chapters, you will trace and evaluate these outcomes, so emit structured flags like citations_valid, evidence_strength, and safety_blocked.
This section connects Milestone 3 to Milestone 4: once post-processing is structured, it becomes easy to return consistent API responses and stream partial output while still validating citations at the end.
Your RAG chain becomes a product when it is accessible through a stable API. A practical FastAPI design starts with two endpoints: POST /v1/answers for standard responses and POST /v1/answers:stream (or a query param) for streaming via SSE. Keep the request schema explicit: question, tenant_id, optional filters, and an optional conversation_id. Include knobs you are willing to support long-term, like max_output_tokens and mode (standard vs. degraded), but avoid exposing raw k/MMR unless you can maintain them as public contracts.
Streaming responses improve perceived latency and reduce timeout risk. Stream tokens as they are generated, but also stream structured “events” when possible: retrieval completed, rerank completed, generation started, generation finished. Clients can render partial text while still receiving final metadata (citations, latency, cost estimates) at the end. One common mistake is returning citations only after streaming completes with no way for the client to reconcile; solve this by buffering citation blocks or streaming a final “sources” event.
Idempotency matters for retries and client errors. Accept an Idempotency-Key header and store the final response for a short TTL keyed by (tenant_id, idempotency_key). If the client retries due to a network issue, you can return the same answer without re-spending tokens. This also helps enforce cost budgets and prevents duplicate charges in metered systems.
answer, citations[] (with doc_id, chunk_id, title, url), usage (prompt/completion tokens), and trace_id./v1 to /v2.Milestone alignment: this is Milestone 4. You are assembling the full chain into a service boundary that supports observability, evaluation hooks, and later cost controls.
Production RAG is a distributed system: vector store, re-ranker, LLM API, cache, and your own service. Reliability comes from assuming each dependency will sometimes be slow, return errors, or degrade in quality. Your job is to make those failures predictable and safe.
Timeouts should be set per stage, not just as a global request timeout. For example: 300–800ms for retrieval, 500–1500ms for re-ranking, and a generation budget that depends on streaming and max tokens. If retrieval times out, you can still attempt a degraded response: ask a clarifying question or provide generic guidance without citations (only if your product policy allows), clearly labeled as not sourced.
Retries must be selective. Retry only on transient errors (429, 503, connection resets) and use exponential backoff with jitter. Never blindly retry long generations; you will multiply cost. For streaming, design so a partial stream can be abandoned safely and the client can retry with an idempotency key to resume a cached final answer when available.
Circuit breakers prevent cascading failures when a provider is down. If the LLM API error rate crosses a threshold, open the circuit and immediately route requests to a fallback model, a smaller context mode, or an extractive-only endpoint. Similarly, if your vector store is unhealthy, skip generation and return a clear message rather than generating from nothing. The common mistake is allowing the system to “hallucinate through outages,” which looks like the system is working while it silently becomes ungrounded.
Milestone alignment: this is Milestone 5. Caching and resilient fallbacks are not optional polish; they are how you keep user trust and cost predictable when dependencies misbehave. With these patterns in place, Chapter 4 can focus on tracing and metrics with clean, stage-level signals.
1. What is the main engineering focus of Chapter 3 compared to Chapter 2?
2. Why does the chapter emphasize tuning top-k and adding filters in retrieval?
3. What is the purpose of adding re-ranking and context window management after initial retrieval?
4. How should prompts be designed in this chapter’s RAG chain to support reliable outputs?
5. Which end-to-end structure best reflects the chapter’s recommended clean interfaces for tracing failures and enforcing budgets?
In a production RAG system, “it worked in staging” is not a success criterion. Your real success criteria are repeatability, diagnosability, and controlled cost. When a user reports “the assistant is wrong,” you need to answer three questions quickly: what happened, why it happened, and what you will change to prevent it. This chapter turns the RAG pipeline into an observable system by instrumenting end-to-end traces across retrieval and generation (Milestone 1), capturing token usage and latency breakdowns with a meaningful error taxonomy (Milestone 2), logging retrieval artifacts safely (Milestone 3), building dashboards that track SLOs and anomalies (Milestone 4), and running structured debugging playbooks on real failures (Milestone 5).
Observability is not just “more logging.” It is a disciplined data model and workflow: trace a single user request across services, correlate each step, summarize key artifacts, and preserve enough evidence to reproduce failures. You also need engineering judgment: log too little and you cannot debug; log too much and you create privacy risk and runaway storage costs. The goal is a minimum sufficient dataset that supports fast triage, accurate root cause analysis, and regression-proof fixes.
Throughout this chapter, assume a standard production RAG flow: request intake → query rewriting (optional) → retrieval (vector + keyword + rerank) → context assembly → generation → post-processing (citations, guardrails) → response. Each step becomes observable through traces, logs, and metrics with consistent IDs, standard attributes, and safe payload handling.
Practice note for Milestone 1: Instrument end-to-end traces across retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Capture token usage, latency breakdowns, and error taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Log retrieval artifacts (queries, docs, scores) safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Build dashboards for SLOs and anomaly detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Run structured debugging playbooks on real failures: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Instrument end-to-end traces across retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Capture token usage, latency breakdowns, and error taxonomy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Log retrieval artifacts (queries, docs, scores) safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Build dashboards for SLOs and anomaly detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Tracing in RAG is about reconstructing a single request’s journey through retrieval and generation with enough detail to explain outcomes. Start with a single trace per user request (or per tool-call workflow), then break it into spans that map to pipeline stages. A practical span set is: request (gateway), auth, query_normalization, retrieval.vector_search, retrieval.keyword_search (if used), retrieval.rerank, context_assembly, llm.generate, citations.build, and response. This gives you end-to-end traces across retrieval and generation (Milestone 1) while preserving stage boundaries for debugging and cost accounting.
Every trace should carry stable correlation IDs: trace_id (end-to-end), request_id (from edge), user_id_hash (pseudonymous), session_id, and doc_index_version. In addition, include a rag_pipeline_version (git SHA or semantic version) so you can compare behaviors across releases. A common mistake is to log only a request ID at the API gateway but not propagate it into retrieval services and LLM calls. Fix this by enforcing context propagation in your HTTP clients, queue messages, and background workers.
Spans are only useful if they have attributes that explain quality and cost. For retrieval spans, capture: query length, embedding model/version, topK, filters used, latency, and the distribution of scores (min/median/max). For reranking, capture reranker model, input doc count, and topN. For generation spans, capture model name, temperature, max_tokens, stop sequences, and a “context bytes/tokens” estimate. Keep attributes structured and typed; avoid stuffing raw JSON blobs into a single string field because it breaks filtering and aggregation later.
Finally, decide on sampling strategy. Full tracing of all requests can be expensive. A typical approach is 100% traces for errors and slow requests, plus 1–10% sampling for baseline performance, plus targeted sampling for specific tenants or experiments.
LLM applications need all three pillars—traces, metrics, and logs—but you must be explicit about which questions each pillar answers. Traces answer “where did time go and which step failed?” Metrics answer “is it getting worse, and how often?” Logs answer “what exactly happened for this request?” Build a consistent data model that ties them together via trace_id and request_id.
Start by defining a canonical schema for LLM/RAG events. At minimum: timestamp, env, service_name, rag_pipeline_version, trace_id, request_id, tenant_id, user_id_hash, and error_class. Then add domain fields: retrieval_topk, rerank_topn, index_name, index_version, chunking_version, model, prompt_template_version, and cache_key_hash. This is where Milestone 2 becomes real: token usage and latency breakdowns must be first-class fields, not free-form text. Track: input_tokens, output_tokens, total_tokens, and cost_estimate (if you can compute it deterministically). Track latency as both end-to-end and per span stage (retrieval_ms, rerank_ms, generation_ms).
Define an error taxonomy that is meaningful for RAG. Avoid a single “500” bucket. Use categories such as: RetrievalTimeout, VectorStoreUnavailable, RerankerFailed, ContextTooLarge, LLMRateLimited, LLMTimeout, GuardrailBlocked, CitationMissing, ParserError, and UpstreamAuthError. Attach a “stage” field so you can see whether failures are concentrated in retrieval or generation.
Metrics should be designed for dashboards and alerts. Good metrics are low-cardinality and aggregated: p50/p95/p99 latency per stage, error rate by error_class, token distributions by model and tenant, cache hit rate, retrieval success rate (defined explicitly), and “empty context” rate. A common mistake is to create metrics with high-cardinality labels (e.g., full user IDs, full queries), which can overload time-series systems and become unusable.
Connect the pillars: traces carry span-level timing, logs carry structured artifacts (sanitized), and metrics provide long-term trends. When your on-call opens an incident, they should pivot from an alerting metric to a set of exemplar traces, and from those traces to the specific logs that show what the model saw.
Logging retrieval artifacts (queries, documents, scores) is essential for debugging retrieval quality, but it is also where production RAG systems fail compliance reviews. Milestone 3 is achieved when you can investigate relevance issues without leaking sensitive content. Treat prompts and documents as potentially sensitive by default, even in internal systems.
Use a tiered logging strategy. Tier 1 (always on) logs only metadata: document IDs, chunk IDs, source system, index_version, scores, and filters. Tier 2 (sampled or gated) logs partial text previews with aggressive redaction. Tier 3 (break-glass) is enabled only for authorized incident response and stores encrypted payloads with short retention and audited access. A common mistake is to log full prompts to application logs “temporarily,” then discover months later that they were shipped to third-party log storage.
Redaction should be deterministic and testable. Apply it both on the client side (before transmission) and server side (before persistence). Techniques include: regex-based masking for obvious PII (emails, phone numbers), entity detection for names/addresses, and allowlists for safe fields. Prefer hashing for stable identifiers (user_id_hash) and tokenization for high-risk substrings. When you need to reproduce an issue, store references rather than raw text: content hashes, doc IDs, and versioned index pointers. If the underlying corpus can change, the index_version + chunk_id becomes your reproducibility anchor.
Secure logging also includes retention and access control. Set short retention for higher-risk tiers, encrypt at rest, and audit access. Separate “debug payload storage” from general log aggregation, and ensure data processors (vendors) are approved for the data class. Practical outcome: you can answer “what docs were retrieved and why?” without ever storing a user’s raw sensitive text in standard logs.
In production, users report “hallucinations,” but the fix depends on whether the system failed to retrieve the right evidence or failed to use the evidence correctly. Root cause analysis begins by separating two common modes: retrieval miss (the right information wasn’t in the context) and grounding failure (the right information was present but the model ignored or distorted it). Your tracing and artifact logging should make this distinction measurable.
Start with the trace. If retrieval spans show low scores, empty results, or filters that excluded relevant documents, suspect a retrieval miss. Check: query rewriting output (did it drift?), embedding model/version mismatch, index_version (stale index), metadata filters (overly strict), topK too small, or reranker errors. An overlooked cause is silent fallback behavior: for example, reranker failure leading to un-reranked results with worse relevance. Without spans for rerank and explicit error_class, this looks like “random hallucination.”
If retrieval looks healthy (high scores, relevant doc IDs, reasonable excerpts), but the answer is still wrong, suspect grounding failure. Common causes: context truncation (the crucial chunk was dropped due to token limits), poor context ordering (relevant chunk buried), prompt template regression (instructions changed), or citation builder selecting incorrect snippets. Another frequent issue is “citation drift,” where the model answers from general knowledge but still attaches citations; you detect this by comparing answer claims to retrieved chunk content in offline evaluation, and by logging which chunk IDs were cited.
Operationally, create a checklist you can execute in minutes:
Practical outcome: instead of generic “hallucination” tickets, you produce actionable bug reports like “retrieval filter excluded policy docs for tenant X” or “context assembly dropped the reranked top-1 chunk due to token budgeting regression,” each tied to a trace_id and reproducible configuration.
Dashboards are where observability becomes operational. Milestone 4 is not “a pretty chart,” it is an on-call tool that answers: are we meeting SLOs, what changed, and where is the anomaly? Start with a small set of KPIs that reflect both user experience and RAG-specific quality signals.
Latency should be broken down by stage: end-to-end, retrieval (vector + keyword), rerank, generation, and post-processing. Track p50/p95/p99, not just averages. LLM generation often dominates p99; retrieval often dominates p50. Add saturation signals: queue depth, concurrent requests, and rate-limits encountered. A common mistake is to alert on end-to-end latency without stage breakdown; you end up paging the wrong team.
Cost and efficiency KPIs include: input/output tokens per request, tokens per successful answer, and cache hit rate (prompt cache, retrieval cache, embedding cache). Cache hit rate is especially important in RAG: a drop might indicate a query rewrite change that prevents normalization, or a missing cache key dimension (e.g., index_version not included). Also monitor “context tokens” and “truncation rate” to detect when your context budget is being exceeded after a corpus growth or chunking change.
RAG-specific quality KPIs need precise definitions. “Retrieval success rate” might mean: at least one document returned above a score threshold, or at least one document from an approved source, or at least N tokens of context assembled. “Citation coverage” might mean: percentage of sentences with citations, or percentage of answers with at least one citation. Track “empty context rate,” “low-score retrieval rate,” and “reranker failure rate.” These metrics give early warning before human feedback arrives.
For anomaly detection, segment by tenant, model, index_version, and pipeline_version. A small regression may be invisible globally but severe for one tenant with distinct documents. Practical outcome: your dashboards support fast triage (what broke), scoped rollback decisions (which version), and capacity planning (where latency is growing).
Runbooks turn observability into consistent action. Milestone 5 is achieved when a real failure can be handled by following a structured playbook: identify, mitigate, diagnose, fix, and prevent regression. Your runbooks should be written for the “2 a.m. operator,” not the system designer. They should include concrete commands, dashboards to open, and decision points.
Create separate runbooks for the most common incident classes: elevated latency, high error rate, degraded retrieval quality, and cost spikes. Each runbook starts with: (1) confirm impact (tenant? region? model?), (2) stop the bleeding (rate limit, degrade features, switch model, disable query rewrite), and (3) preserve evidence (increase sampling, capture exemplar traces). Then it proceeds to diagnosis using the trace breakdown and error taxonomy.
Regression reproduction is the bridge from incident to permanent fix. For each incident, store a “repro bundle” that avoids sensitive data: request metadata, normalized query (or hashed query with deterministic replay in a secure environment), pipeline_version, prompt_template_version, model, index_version, retrieval parameters, and the list of retrieved doc IDs/chunk IDs with scores. With this bundle, you can rerun the pipeline against the same index snapshot and compare outputs across candidate fixes. A common mistake is to attempt reproduction against a live index that has changed; without versioned indexes and logged index_version, you cannot know if you fixed the bug or the data changed.
Practical outcome: incidents become learning loops. You not only restore service quickly, you also add the missing trace attributes, metrics, or redaction-safe artifacts that would have made the diagnosis faster—making the next incident less likely and easier to resolve.
1. In Chapter 4, what are the three questions you should be able to answer quickly when a user reports “the assistant is wrong”?
2. Which practice best represents observability (as defined in the chapter) rather than “more logging”?
3. What is the main trade-off the chapter highlights when deciding what to log in a production RAG system?
4. Which set of telemetry aligns with Milestone 2 in the chapter?
5. In the standard production RAG flow described, where do citations and guardrails belong?
A production RAG system is never “done” when it answers correctly once. It is done when you can prove it stays correct as your corpus grows, your chunking strategy evolves, models change, and costs are constrained. This chapter builds the evaluation harness you will submit in a certification context: a gold dataset, automatic metrics, retrieval-specific evals, judge scoring with guardrails, and CI gates with report artifacts.
Two engineering realities shape everything here. First, a RAG pipeline is a chain: ingestion → indexing → retrieval → reranking (optional) → prompt assembly → generation → citation formatting. When quality drops, you need to localize the fault quickly, not argue about “the model got worse.” Second, quality and cost are linked: higher k increases recall but can inflate tokens and latency; heavier judges improve measurement but raise evaluation spend. A good harness measures both quality and operational constraints.
We will follow five milestones: (1) create a gold dataset and protocol, (2) implement automatic and LLM-judge scoring, (3) add retrieval evals such as recall@k/MRR/citation accuracy, (4) build CI-friendly regression tests with thresholds, and (5) generate a report artifact that documents methodology, results, and known limitations.
By the end of this chapter, you should have a harness that answers: “Did we get better?” “Did we break anything?” and “What is the cost of measuring this?”—all with enough rigor for production sign-off.
Practice note for Milestone 1: Create a gold dataset and evaluation protocol: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Implement automatic metrics and LLM-judge scoring: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add retrieval evals (recall@k, MRR, citation accuracy): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Build CI-friendly regression tests and thresholds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Produce an evaluation report for certification submission: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Create a gold dataset and evaluation protocol: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Implement automatic metrics and LLM-judge scoring: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add retrieval evals (recall@k, MRR, citation accuracy): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your evaluation harness is only as trustworthy as the dataset behind it. For RAG, the gold dataset must include more than a question and an ideal answer. It must encode where the answer comes from so you can test retrieval and citation behavior. Treat this as Milestone 1: create a gold dataset and an evaluation protocol that someone else could run and get the same results.
Start by defining your use cases and failure modes. Collect questions that represent: (1) straightforward fact lookup, (2) multi-sentence synthesis across two sources, (3) “not in corpus” questions that should trigger refusal or escalation, and (4) ambiguous questions where the best response is to ask a clarifying question. Ensure coverage across document types (policies, manuals, tickets) and across time (old vs. newly ingested content).
For each example, store a structured record: id, question, gold_answer (short, checkable), and gold_citations. Citations should be stable identifiers, not raw URLs alone: include document_id, version, and a span locator such as chunk_id or character offsets. If your pipeline supports versioned indexes, record the index version used to author the gold label; otherwise, future re-chunking will invalidate your references.
Common mistakes: letting authors label citations without verifying spans; mixing multiple answers that are “all acceptable” without encoding alternatives; and leaking the gold answer into prompts during evaluation. Your protocol should specify: how examples are created, how they’re reviewed (two-person check is ideal), and how you handle updates (e.g., “freeze a quarterly gold set; add new cases as regressions are found”).
Milestone 2 is implementing scoring that reflects what “good” means for your product. For RAG, you want a small, interpretable set of metrics rather than a single opaque score. At minimum, measure: relevance (did we answer the question?), faithfulness (are claims supported by retrieved sources?), completeness (did we cover required points?), and refusal quality (did we decline appropriately when the corpus lacks evidence?).
Automatic metrics are useful but must be chosen carefully. Exact match is often too strict, while generic semantic similarity can reward fluent hallucinations. Practical approach: use a claim checklist for gold answers. Represent the gold answer as 2–6 atomic claims and score completeness as the fraction of claims present. If you need automation, use an LLM or NLI model to detect whether each claim is entailed by the system answer, but keep the unit of evaluation small and auditable.
Faithfulness in RAG should be tied to citations. Define “supported” as: for each claim in the answer, at least one cited chunk contains sufficient evidence. This becomes a measurable contract: missing citations, irrelevant citations, and fabricated citations are distinct errors. Refusal quality should also be explicit: if the question is not answerable from the corpus, the system should (1) say it cannot verify, (2) avoid making up specifics, and (3) optionally suggest next steps (ask for document, contact owner). Score refusals separately from standard QA so they do not tank “accuracy” unfairly.
Common mistakes: grading answers without considering whether they were grounded; treating “helpful” but unsupported content as a win; and ignoring refusal behavior until production incidents occur. Your harness should make these failure modes visible in separate columns so engineering can act on them.
Milestone 3 is adding retrieval evaluations so you can localize regressions. End-to-end metrics can look stable even when retrieval degrades—because the generator compensates with prior knowledge or lucky phrasing. In production RAG, you want retrieval to be measurably good on its own.
For each gold example, you already have gold citations. Use them to compute recall@k: whether any retrieved chunk among the top-k matches a gold citation (or belongs to the same document/span range). Compute this at multiple k values (e.g., 3, 5, 10) because engineering decisions depend on it: a higher k increases cost and context length, but may be necessary for recall. Add MRR (mean reciprocal rank) to capture ranking quality: if the first relevant chunk appears at rank 1, MRR is high; if it appears at rank 10, MRR drops even though recall@10 may be fine.
Next, measure citation accuracy in the final response: do the cited chunks actually appear in the retrieved set and do they support the claims? A practical proxy is “citation-in-retrieved-set rate” (are we citing something we did not retrieve?) plus “citation relevance rate” (are cited chunks among those judged relevant to the question). If you use a reranker, evaluate both pre- and post-rerank retrieval lists to see where mistakes are introduced.
Separate retrieval evaluation also helps you tune cost budgets: you can justify k=5 instead of k=10 when recall@5 is already high, or you can invest in reranking rather than expanding context.
LLM-as-judge is powerful for nuanced metrics like faithfulness and refusal quality, but it introduces its own failure modes. A judge can be biased toward eloquent answers, can miss subtle citation mismatches, and can be non-deterministic across runs. This section turns Milestone 2 into production-grade practice: use judges, but calibrate them and bound their influence.
Start with a clear judge prompt contract. Provide: the question, the system answer, and the retrieved evidence chunks (or cited chunks). Instruct the judge to score specific criteria and to quote the evidence used to justify a score. Avoid asking, “Is this correct?” in a vague way; instead ask, “For each claim, is it supported by the provided evidence?” If you require refusal behavior, include explicit judge rules for when refusal is correct versus when it is evasive.
Calibration is non-negotiable. Create a small “judge calibration set” with obvious good and bad examples, including: correct answer with correct citations, correct answer with wrong citations, fluent hallucination, and proper refusal. Run the judge multiple times and inspect variance. If your judge is unstable, reduce temperature, constrain output schema (JSON), and simplify the rubric. Consider using two judges (different models or prompt variants) and taking a conservative aggregate (e.g., minimum faithfulness score) when you care about risk.
Finally, treat the judge as a measuring device with a cost. Track judge token usage and keep a “fast mode” (automatic metrics only) for PR checks, with a “full mode” nightly run that includes judges and deeper analyses.
Regression testing fails when teams expect one run to be definitive. RAG quality is noisy: retrieval depends on approximate nearest neighbors, generation depends on sampling, and even judges add variance. Statistical thinking helps you set thresholds that catch real regressions without blocking releases for random fluctuation.
First, quantify variance. For a subset of examples, run the system multiple times (or with fixed seeds where possible) and measure standard deviation for key scores. If you see high variance, tighten determinism: set temperature low for evaluation, fix prompt templates, and pin model versions. For retrieval, ensure index versions are immutable and that evaluation queries do not mix corpora states.
Second, use confidence intervals rather than single-point comparisons. If your gold set has 200 examples and faithfulness improves from 1.62 to 1.65 on a 0–2 scale, that might not be meaningful. Conversely, a drop in recall@5 from 0.82 to 0.76 may be highly meaningful. Bootstrap resampling is a practical technique: resample examples with replacement and compute a distribution of the metric difference. Use that to decide whether the change is likely real.
Third, watch for drift signals. In production, the corpus changes; your test set should evolve. Add “canary” subsets: recent documents, high-traffic intents, and known fragile areas (policies that change often). Track metric trends over time and alert on slope changes, not just threshold breaks. Pair quality drift with operational drift: token usage rising may indicate longer contexts (maybe due to retrieval returning longer chunks), which can foreshadow latency and cost issues.
Statistical discipline turns evaluation from “opinions about answers” into an engineering signal that can drive safe iteration.
Milestone 4 and Milestone 5 come together in CI: you need regression tests that run reliably on every change and produce artifacts that reviewers can trust. The goal is not to run the most expensive eval every time; it is to run the right eval at the right cadence and to preserve evidence.
Define tiers. In pull requests, run a small “smoke eval” (e.g., 20–50 examples) with deterministic settings and retrieval metrics (recall@k, MRR) plus basic formatting/citation checks. Nightly, run the full suite with judge scoring, completeness, refusal quality, and deeper breakdowns by category. In release pipelines, run the full suite against the exact model and index versions you will deploy.
Baselines and gates must be explicit. Store a baseline results file (or a metric snapshot in your experiment tracker) tied to a specific index version, prompt version, and model version. In CI, compare current metrics to baseline with thresholds such as: recall@5 must not drop by more than 0.02; citation accuracy must be ≥ 0.90; p95 latency must be ≤ a target; cost per query must be within budget. Make thresholds asymmetric when appropriate: allow improvements freely, but require review for degradations.
For certification submission, produce an evaluation report artifact: methodology, dataset description, metric definitions, baseline versions, thresholds, and a short analysis of failures and planned fixes. The credibility of your RAG system is not the best answer it can produce—it is the repeatability and transparency of how you measure it.
1. Why does Chapter 5 argue a production RAG system is not “done” after it answers correctly once?
2. What is the main engineering reason to treat the RAG pipeline as a chain (ingestion → indexing → retrieval → …)?
3. Which approach best reflects the chapter’s key principle for evaluation design?
4. What tradeoff does the chapter highlight when increasing retrieval parameter k?
5. What is the intended CI-friendly deliverable of the evaluation harness described in Chapter 5?
By this point in the capstone, you likely have a working RAG flow: ingestion produces chunked, metadata-rich documents; retrieval returns relevant context; generation produces cited answers; tracing and evaluations catch regressions. Chapter 6 turns a “working demo” into a production-ready service by enforcing cost budgets, tuning performance with explicit trade-offs, hardening security, and packaging the project so it can be assessed (and trusted) by reviewers.
The theme is engineering judgment under constraints. In production, the best architecture is the one that stays within spend limits, fails safely, and is operable by others. You will implement budgets and optimizations (Milestones 1–2), then apply auth, rate limits, and secrets management (Milestone 3). Finally, you’ll containerize and deploy with environment-based configuration (Milestone 4) and produce capstone-quality artifacts: README, diagrams, and a demo script that prove the system meets a rubric (Milestone 5).
As you work through the sections, keep one principle front and center: every control must be measurable. A budget without telemetry is a suggestion; telemetry without enforcement is a dashboard. Your goal is a closed loop: measure → decide → enforce → verify.
Practice note for Milestone 1: Implement token and request budgets with enforcement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Optimize spend via caching, batching, and model routing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add auth, rate limiting, and secrets management: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Containerize and deploy with environment-based configs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Final capstone presentation: README, diagrams, and demo script: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Implement token and request budgets with enforcement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Optimize spend via caching, batching, and model routing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add auth, rate limiting, and secrets management: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Containerize and deploy with environment-based configs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Final capstone presentation: README, diagrams, and demo script: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Cost in a production RAG system is multi-dimensional. Tokens are the most visible line item, but embeddings, vector operations, and network egress often decide whether you can scale. Start by modeling cost per request as a sum of components: (1) prompt tokens + completion tokens for the LLM, (2) embedding tokens for ingestion and for query-time embedding (if you embed queries), (3) vector database reads and compute (similarity search, filtering, reranking), and (4) data transfer or egress when moving documents across networks or regions.
Tokens: In RAG, tokens come from the user message, your system prompt, the retrieved context, tool/function call wrappers, and the model’s output. The common mistake is budgeting only for “user input + answer,” then being surprised by the context window cost. Track and attribute tokens by source: system, user, retrieval context, and completion. This lets you target reductions (e.g., compress context, reduce top-k, or shorten templates) without harming answer quality.
Embeddings: Ingestion embedding cost depends on chunk size and volume. Chunking too small increases embedding calls and index size; too large can reduce retrieval quality. Query-time embedding is usually cheaper per request but adds up under high QPS. Prefer caching query embeddings for repeated questions and ensure you don’t re-embed identical text because of whitespace differences—normalize inputs.
Practical outcome: build a spreadsheet or config-driven model with per-unit costs and measured averages (tokens in/out, retrieved chunks, embedding calls). Feed those numbers back into your tracing so your “cost per request” is a first-class metric, not an afterthought.
Milestone 1 is enforcement: implement token and request budgets with clear guardrails. The goal is to prevent runaway spend from bugs, abuse, or unexpected traffic. Start with three layers of control: (1) hard limits that block or degrade requests, (2) soft limits that warn and alert, and (3) per-tenant/per-user quotas to protect fairness.
Hard limits: Cap maximum prompt tokens, retrieved context tokens, and maximum completion tokens. Enforce a maximum number of retrieval results (top-k) and a maximum document length per chunk included in context. When a request exceeds limits, do not “just truncate silently.” Return a controlled response: reduce top-k, switch to a cheaper model, or ask the user to narrow the question. Also enforce max tool calls and max retries to avoid loops.
Quotas and per-user caps: Add per-minute request limits and daily token budgets keyed by API key, user ID, or tenant ID. Store counters in a low-latency backing store (Redis is typical) with time windows. A common mistake is counting only successful calls; you should count attempted calls too, otherwise attackers can burn your budget via repeated failures.
Practical outcome: a budget module that returns a decision object (allow / degrade / deny) and logs its reasoning. This makes spend predictable and reviewable, which is essential for both production and certification assessment.
Milestone 2 is optimization: reduce spend without breaking quality. Performance tuning in RAG is always a trade-off triangle: latency, cost, and answer quality. Your job is to move the frontier through caching, batching, and model routing, then verify improvements using the evaluation harness from earlier chapters.
Caching: Cache at multiple points: query embeddings, retrieval results for repeated questions, and final responses when the question+policy context is identical. Use a cache key that includes the index version and retrieval parameters (top-k, filters) so you don’t serve stale context after re-indexing. The common mistake is caching only the final answer; caching retrieval results often yields bigger wins because it reduces both vector ops and context token usage.
Batching: During ingestion, batch embedding calls for throughput and cost efficiency. For online requests, batch only when you can tolerate small delays (e.g., internal tools). Never batch across tenants in a way that leaks data; keep isolation boundaries explicit.
Model routing: Route by task complexity. Use a cheaper model for classification, query rewriting, or “answerability” checks, and reserve the expensive model for final generation when the system predicts high value. Another effective pattern is progressive generation: start with a small model to draft and escalate to a larger model only if evaluations (or heuristics) detect low confidence, missing citations, or high-risk domains.
Practical outcome: a documented routing policy with measurable thresholds (token counts, confidence scores, or latency budgets) and a regression test that confirms you didn’t trade away faithfulness for savings.
Milestone 3 is security hardening. Production RAG systems are attractive targets because they combine data access with generative capabilities. Treat security as a checklist plus continuous verification via logs and tests.
Authentication and authorization: Require auth for all non-public endpoints. Use scoped API keys or OAuth tokens, and implement authorization checks for document access (row-level or metadata-based). A frequent mistake is applying auth to the API but not to retrieval filters—ensure the retriever filters by tenant/user permissions so the model never sees unauthorized chunks.
Rate limiting: Apply rate limits per user and per IP, and separate “cheap” and “expensive” routes. Rate limiting complements budgets: budgets protect money over time; rate limits protect availability in the moment. Log rate-limit decisions to support incident response.
Secrets management: Never bake secrets into images or repos. Load keys from environment variables or a secrets manager, rotate regularly, and ensure traces do not capture sensitive headers or tokens. Redact prompts if they may contain PII; at minimum, mask known patterns and provide a safe logging mode for production.
Prompt injection defenses: Assume retrieved documents can be hostile. Apply a “retrieval firewall”: strip or annotate instructions from documents, enforce system-message precedence, and restrict tool/function calling to allowlisted operations. Validate tool arguments and refuse to execute actions derived solely from retrieved text. Add a policy that the model must cite sources and decline if citations are missing or retrieval is low confidence.
Practical outcome: a security checklist in your README, plus automated checks (linting for secrets, integration tests for authorization filters, and a few adversarial prompt-injection regression cases).
Milestone 4 is deployment readiness. A capstone that runs only on your laptop is not production. Containerize the app and make it configurable per environment (dev/staging/prod) without code changes.
Docker: Use a multi-stage build: one stage for dependency installation and build artifacts, another minimal runtime stage. Pin versions, run as a non-root user, and include health checks. Expose only the needed port and keep the image small to reduce cold-start and vulnerability surface.
Environment-based configuration: Use explicit configuration objects: model names, token caps, top-k, reranker flags, cache TTLs, and budget thresholds should be config, not constants. Separate “safe defaults” for development from strict production settings. The common mistake is letting debug logging or permissive CORS slip into production—tie those toggles to environment.
Migrations and index versioning: Treat your vector index like a database. When you change chunking, embedding models, or metadata schema, create a new index version. Provide a migration plan: backfill embeddings, validate retrieval quality via your eval harness, then cut over. Keep a rollback plan: ability to switch back to the previous index and model routing policy if production metrics degrade.
Practical outcome: a reproducible deployment that can be launched with a single command plus environment variables, and a written rollback procedure that explains exactly what toggles to flip when something goes wrong.
Milestone 5 is delivery: package your work so a reviewer can verify outcomes quickly. Your capstone should read like a professional project handoff—clear artifacts, explicit evidence, and a repeatable demo.
README as the control center: Include architecture overview, setup steps, and a “Why these choices?” section that explains cost, quality, and security trade-offs. Provide a table that maps course outcomes to evidence (links to code modules, dashboards, and evaluation reports). Common mistake: a README that lists features but does not show proof. Add screenshots or exported metrics from tracing (token usage, latency, error rates, retrieval stats) and a sample evaluation run showing relevance/faithfulness metrics and regression gates.
Diagrams: Include at least two: (1) system architecture (client → API → retriever → vector DB → LLM) with trust boundaries, and (2) ingestion/indexing pipeline with versioning. Label where budgets are enforced, where caching occurs, and where secrets live.
Demo script: Write a step-by-step script that exercises: a normal query with citations, a low-retrieval scenario that triggers a safe fallback, a budget-exceeding request that gets degraded/denied gracefully, and a prompt-injection attempt that is neutralized. The demo should also show how to find the trace for a request and how to read token attribution and cost per request.
Practical outcome: a reviewer can clone, configure, run, and validate the entire system in under 30 minutes, and your artifacts make it obvious that budgets, security, and deployment readiness were implemented intentionally—not incidentally.
1. What is the primary shift Chapter 6 targets when moving from a “working demo” RAG system to a production-ready service?
2. Which statement best reflects the chapter’s principle about budgets and telemetry?
3. Which milestone combination is specifically aimed at reducing spend through performance techniques rather than access control?
4. How does Chapter 6 characterize the role of engineering judgment in production constraints?
5. What deliverable set best demonstrates capstone readiness to reviewers according to Milestone 5?