Career Transitions Into AI — Beginner
Build a job-ready retrieval pipeline: chunk, embed, rerank, and evaluate.
Vector search is the backbone of modern Retrieval-Augmented Generation (RAG) systems—used in AI assistants for documentation, support tickets, policy search, and internal knowledge bases. This course is a short technical book disguised as a hands-on build: you’ll start from a simple baseline and progressively construct a complete retrieval pipeline with chunking, embeddings, reranking, and evaluation.
It’s designed for career changers who want a credible, end-to-end project that demonstrates applied AI skills without requiring a deep ML background. By the end, you’ll have a portfolio-ready retrieval system with measurable improvements and a clear story you can explain in interviews.
Many beginners stop at “I created embeddings and queried a vector database.” In real work, the hard parts are upstream and downstream: data ingestion, chunking, metadata design, ranking quality, and proving impact with evaluation. This course emphasizes those practical steps.
You’ll move in a straight line from fundamentals to production readiness. Chapter 1 defines what “good retrieval” means and sets a baseline. Chapter 2 builds the ingestion and chunking layer—often the biggest determinant of quality. Chapter 3 handles embeddings and indexing, including how to tune retrieval parameters. Chapter 4 adds reranking and query improvements, transforming “kind of relevant” into “consistently useful.” Chapter 5 shows you how to evaluate changes with discipline, so you can prove improvements. Chapter 6 packages everything into a portfolio artifact with practical deployment and career framing.
If you can write basic Python and run scripts, you can follow along. You do not need to know transformers, neural networks, or information retrieval theory in advance—those ideas are introduced only as needed to make good engineering decisions.
When you finish, you’ll have more than a demo—you’ll have a pipeline with clear tradeoffs, metrics that support your choices, and a repeatable process for improving retrieval. That’s the difference between “I tried vector search” and “I can ship retrieval systems.”
Ready to build? Register free to start, or browse all courses to compare learning paths.
Machine Learning Engineer, Search & Retrieval Systems
Sofia Chen is a machine learning engineer focused on production search, ranking, and retrieval-augmented generation. She has built embedding and reranking pipelines for customer support, documentation, and enterprise knowledge bases, and mentors career changers moving into applied AI roles.
This course is about building retrieval that you can trust: not just “it seems to work,” but a pipeline with clear success criteria, a baseline, and an evaluation target you can improve. Retrieval-Augmented Generation (RAG) is often described as “LLM + search,” but in production it’s a chain of engineering decisions: how you ingest documents, how you chunk them, what metadata you keep, how you rank results, and how you measure whether you’re getting better.
In this chapter you will define the retrieval problem in concrete terms and map the full RAG pipeline—from raw documents to final answers. You will also set up your project repository and dataset, then run a baseline keyword search to establish a benchmark. That baseline gives you a checkpoint: a working end-to-end system and a clear evaluation target for the vector and reranking improvements that follow in later chapters.
Before writing code, adopt the right mental model: generation is not a substitute for retrieval. The LLM is a reasoning and language component, but it cannot reliably “guess” missing facts. The retrieval system owns coverage (getting the right evidence) and ranking (getting the best evidence first). Your job is to build retrieval that returns the right information for the question with minimal noise, then design the generation step so it uses that evidence correctly.
By the end of the chapter, you should be able to explain when vector search beats keyword search, what “top-k similarity search” actually does, and what you’re going to build: an ingestion pipeline with IDs and metadata, a baseline search benchmark, and a roadmap for adding embeddings, indexing, reranking, and evaluation.
Practice note for Define the retrieval problem and success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Map the RAG pipeline from documents to answers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the project repo, environment, and dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a baseline keyword search to establish a benchmark: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: a working baseline and a clear evaluation target: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the retrieval problem and success criteria: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Map the RAG pipeline from documents to answers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the project repo, environment, and dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
RAG splits responsibility between two components with different failure patterns. Retrieval is responsible for finding the right evidence; generation is responsible for composing an answer grounded in that evidence. When teams say “the model hallucinated,” the root cause is often retrieval: the right content was never retrieved, or it was retrieved but buried below irrelevant chunks, or it lacked the needed surrounding sentences.
Define the retrieval problem as a contract: given a user query, return a ranked list of text chunks (and metadata) such that the answer can be derived from the top results. This contract is testable. For example, a success criterion might be “for 80% of queries, at least one chunk in the top 5 contains the exact policy clause required to answer.” Notice that this is a retrieval metric, not a “nice sounding” answer metric.
Generation should be designed to respect the contract. Practical patterns include: instructing the LLM to cite retrieved sources, refusing to answer if sources are insufficient, and keeping prompts short so evidence fits within context limits. A common mistake is to push too much responsibility into the LLM prompt (“be accurate”), while skipping the hard work of ingestion, chunking, and ranking. Another mistake is evaluating only final answers without inspecting retrieved chunks; you need both, because you can’t fix what you don’t observe.
In this course you’ll start by writing down success criteria for retrieval, then implement a baseline keyword search. That baseline becomes your initial benchmark: it sets a floor and clarifies what “better” means when you later add embeddings and reranking.
Vector search represents text as a numeric vector (an embedding). Intuitively, embeddings place semantically similar pieces of text near each other in a high-dimensional space. Retrieval becomes a geometry problem: embed the query, compute distances (or similarities) to candidate vectors, and return the top-k closest chunks.
The most common similarity measure is cosine similarity (angle between vectors), though some systems use dot product or Euclidean distance. The important engineering judgment is consistency: your index, similarity metric, and embedding model must align. If your vector DB expects normalized vectors for cosine similarity but you don’t normalize (or the model already does), you can silently degrade ranking.
Top-k means you’re retrieving the k most similar chunks. Choosing k is not arbitrary: larger k increases recall (you’re more likely to include the relevant chunk) but also increases noise, latency, and the chance the LLM is distracted by irrelevant context. A practical starting point for many corpora is k=5 to k=20, then adjust based on evaluation and chunk size.
Vector search typically relies on approximate nearest neighbor (ANN) indexes such as HNSW or IVF to be fast at scale. ANN trades a tiny amount of accuracy for large speed gains, which is usually acceptable. A common mistake is blaming embeddings when the index parameters are too aggressive (low recall due to ANN settings). Another mistake is forgetting that vectors only represent what you chunked: if you split documents poorly, even perfect vector search can’t retrieve coherent evidence.
In later chapters you’ll generate embeddings, build an index, and run top-k similarity search. For now, you should be able to articulate what’s happening under the hood and what dials you can tune: chunk size, k, similarity metric, and index parameters.
Real retrieval systems rarely have “clean text files.” You’ll ingest mixed data types, and each one affects cleaning, metadata, and chunking choices. Start by inventorying your sources and deciding what fields you must preserve for traceability and filtering.
FAQs are structured and often map well to retrieval units (one Q/A per chunk). They benefit from metadata like product, version, and locale. Product docs and policies tend to be long, hierarchical, and full of headings; preserving section titles as metadata (or prefixing them into chunk text) improves retrieval. Support tickets can be noisy and conversational; you often need to remove signatures, templated greetings, and personally identifiable information (PII), then store ticket status, category, and timestamps. PDFs and web pages introduce extraction issues: reading order errors, duplicated nav text, and broken paragraphs.
Design your ingestion pipeline with three practical outputs per chunk: (1) a stable chunk_id that won’t change when you re-run ingestion, (2) cleaned text used for embeddings and ranking, and (3) metadata for filtering and debugging (source URL, doc title, section path, created_at, product, access level). A common mistake is using array indices as IDs; the moment you insert a new document, all IDs shift and you can’t track regressions.
Chunking strategy depends on data type. FAQs can often be “one entry per chunk.” Long docs usually require splitting by headings first, then by token length with a small overlap. Tickets may require grouping messages into a coherent narrative. In this chapter’s project setup, you’ll pick a dataset and define what “a document” means in your repo so every later step—indexing, reranking, evaluation—operates on consistent units.
To improve RAG, you must classify failures precisely. Three recurring failure modes are hallucinations, missing context, and drift. Each points to different fixes, and mixing them up wastes time.
Hallucinations are often downstream symptoms: the model answers confidently when retrieved evidence is irrelevant or absent. Fixes usually belong to retrieval and prompting: raise retrieval recall, add reranking, require citations, and add refusal behavior when confidence is low. Missing context happens when retrieval finds the right area but the chunk lacks the key sentence or definition. This is frequently a chunking issue (chunks too small, no overlap) or an ingestion issue (PDF extraction dropped a paragraph). Fixes include increasing chunk size, adding overlap, chunking by semantic boundaries (headings), or storing adjacent chunks for “context expansion.”
Drift is when the world changes: policies update, product behavior changes, URLs move, or terminology shifts. Drift breaks systems that rely on stale indexes or hardcoded assumptions. Fixes include re-embedding schedules, incremental indexing, and monitoring based on query logs and evaluation sets refreshed over time.
In this chapter you will establish an evaluation target before adding vector search. That target should include a small set of representative queries with expected source documents (a “gold” mapping). The goal is not perfect labeling; it’s a disciplined way to reproduce failures. A common mistake is only testing with easy queries or queries the builder already knows. Another mistake is changing chunking and embeddings simultaneously; when results change, you won’t know why.
When you later add reranking, these failure categories will help you decide whether to invest in better retrieval recall, better ranking precision, better chunking, or better refusal policies.
The RAG toolchain has three layers: embeddings, indexing/search, and orchestration. You can mix and match, but you should understand the tradeoffs so you don’t over-engineer early.
Embeddings can come from hosted APIs or local models. Hosted APIs reduce ops work and usually offer strong quality, but introduce cost, latency, and data-sharing considerations. Local models offer control and privacy but require model selection, batching, and hardware planning. Regardless of source, treat embedding generation as a reproducible step: log model name, version, and dimensionality so you can compare runs.
Vector databases (or vector-enabled search engines) store embeddings and run top-k similarity search. Options include dedicated vector DBs and general databases with vector extensions. Key selection criteria: indexing algorithm support (HNSW/IVF), filtering performance (metadata filters), hybrid search (keyword + vector), durability, and operational complexity.
Libraries and orchestration tools help you wire ingestion, chunking, retrieval, and evaluation. They speed up experimentation, but can hide critical details like tokenization, normalization, and prompt construction. A common mistake is adopting a framework that makes it easy to demo but hard to debug. In this course, your repo will keep the core pipeline logic explicit: how IDs are formed, how chunks are produced, and how retrieval is evaluated.
Reranking tools come in two main types: bi-encoders (fast, embed query and chunk separately) and cross-encoders (slower, score query+chunk together, usually higher precision). Later you will add a reranking stage and measure the latency/quality tradeoff, but it helps to know now that “vector search” is often a recall-first stage and reranking is a precision stage.
This course works best when your project has a crisp spec. Your system’s inputs are a document corpus (mixed formats) and a set of evaluation queries. Your outputs are (1) a ranked list of retrieved chunks with metadata, and (2) optionally, an answer generated from those chunks. Chapter 1 focuses on the retrieval half, because if retrieval is unreliable, generation quality will be unstable.
Set up your repo so each stage is runnable and testable: ingest (load raw data), clean (normalize text, remove boilerplate), chunk (produce chunk text + metadata + IDs), index (store searchable representations), search (query to top-k results), and evaluate (metrics + saved artifacts). Keep artifacts versioned (or at least reproducible) so you can compare changes. A practical workflow is to save chunk JSONL with stable IDs, then build both keyword and vector indexes from that same file.
In this chapter you will run a baseline keyword search to establish a benchmark. Keyword search is not “bad”; it is strong for exact matches, rare terms, error codes, IDs, and names. It also gives you a sanity check that your ingestion and cleaning are not broken. If keyword search can’t find obvious terms from your documents, your pipeline is likely discarding text or producing malformed chunks.
Define constraints early: target latency (e.g., under 500ms for retrieval), cost per query, maximum context tokens, update cadence (daily/weekly), and privacy requirements. These constraints will shape chunk size, k, and whether you can afford cross-encoder reranking. Your checkpoint for Chapter 1 is simple and concrete: a working repo, a dataset you can re-ingest deterministically, a keyword baseline that returns reasonable results, and a written evaluation target you will use to judge improvements in the next chapters.
1. Which statement best captures the chapter’s recommended mental model for RAG?
2. In the chapter’s definition of retrieval success, what is the key criterion?
3. Why does Chapter 1 have you run a baseline keyword search early?
4. Which set of items reflects the chapter’s view of key engineering decisions in a production RAG system?
5. What does the chapter describe as an appropriate evaluation target for retrieval quality?
Retrieval quality is rarely limited by your embedding model. In practice, it’s limited by what you feed the model: messy parsing, inconsistent metadata, unstable chunk IDs, and chunks that break meaning. This chapter is about engineering judgement—how to turn raw documents into a chunked corpus that retrieves reliably, is debuggable when it fails, and can be re-ingested without “mysterious” changes in search behavior.
A good ingestion pipeline has four properties: (1) deterministic outputs (same input produces the same chunks and IDs), (2) traceability (every chunk can be traced back to a source and location), (3) retrievability (chunks contain coherent answers, not fragments), and (4) maintainability (you can re-index incrementally as content changes). You’ll apply normalization and cleaning, design metadata for filtering and audits, chunk with overlap using stable IDs, validate chunk quality with quick spot checks, and finish with a reproducible checkpoint: a clean, chunked corpus ready for embeddings and indexing.
As you build, keep the “retrieval contract” in mind: your retriever will return chunks, not documents. If a chunk does not contain a self-contained, citable answer to a likely question, it will not help your RAG system—no matter how good your LLM is.
Practice note for Normalize and clean raw documents for indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design metadata (source, section, timestamps, permissions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement chunking with overlap and stable IDs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate chunk quality with quick manual spot checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: reproducible ingestion + chunked corpus: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Normalize and clean raw documents for indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design metadata (source, section, timestamps, permissions): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement chunking with overlap and stable IDs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate chunk quality with quick manual spot checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: reproducible ingestion + chunked corpus: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start by treating parsing as a data conversion problem, not a text problem. Your goal is to turn heterogeneous inputs (PDF, HTML, DOCX, Markdown, wiki pages, exported tickets) into a single normalized intermediate representation. A practical target is a sequence of “blocks” (title, heading, paragraph, table row, list item, code block) with positional hints (page number, heading path) and raw text.
Normalization should be deterministic and conservative. Common steps include: standardizing Unicode (NFKC), normalizing whitespace, converting smart quotes, removing repeated line breaks, and joining hyphenated line wraps from PDFs. Preserve signal: don’t blindly lowercase everything (it can break IDs and acronyms), and don’t strip punctuation that matters (e.g., version numbers, CLI flags, legal clauses). If you handle HTML, remove navigation chrome and scripts at the parser stage rather than later in chunking.
Tables and code are special. Tables often become unreadable if flattened; instead, serialize them into a consistent text format like “Header: Value” pairs per row. Code blocks should stay intact; splitting code mid-function creates chunks that never retrieve well. If your documents contain headings, capture a heading hierarchy (e.g., H1 > H2 > H3) because it becomes a powerful chunking primitive later.
Engineering judgement: optimize for reproducibility and debuggability. Save parsed outputs (e.g., JSONL) before chunking so you can inspect failures without re-running brittle parsers. A frequent mistake is letting parsing depend on external state (browser rendering, changing OCR settings) without pinning versions, which makes retrieval drift impossible to explain.
Metadata is what makes vector search operational in real products. Without it, you can’t filter by permissions, limit to a time window, or explain where an answer came from. Design metadata early, because changing it later often forces a full re-index.
A practical schema balances retrieval needs and governance needs. At minimum, every chunk should carry: source (system + URL/path), document_id (stable across runs), chunk_id (stable and unique), section (heading path or logical section), timestamp (published/updated time), and permissions (ACL, tenant, group IDs). If you have multiple corpora, include corpus or domain for routing queries.
Traceability metadata should support “show your work.” Store offsets where possible: page number for PDFs, byte/character offsets for text, or DOM selectors for HTML. This enables precise citations and lets you re-render the exact snippet. When debugging relevance, also store derived fields like language, document_type, and parser_version.
Common mistakes: putting large text into metadata (it bloats indexes), using unstable IDs (like array indexes that change when you re-parse), and forgetting permissions. If you ever plan to serve enterprise users, permissions are not optional; filtering after retrieval can leak sensitive content via embeddings similarity. Prefer pre-filtering (metadata filters in the vector DB) or at least “filter-then-rerank” strategies that guarantee access control.
Chunking is the core lever for retrieval. The “right” chunk is one that matches how users ask questions and how the content is authored. You typically combine patterns rather than picking one.
Heading-based chunking works well for manuals, policies, and docs with clear structure. You group paragraphs under a heading path (e.g., “Security > Key Rotation > Frequency”) and chunk within that scope. This preserves topical cohesion and gives you a natural section metadata field.
Token-window chunking (e.g., 250–500 tokens) is a robust default when structure is weak. It is simple, but it can cut across topic boundaries. Use it as a fallback or within headings. Avoid choosing sizes based on LLM context length alone; chunk size should be chosen for retrieval precision. Smaller chunks increase precision but risk missing context; larger chunks increase recall but can retrieve irrelevant text.
Sentence-based chunking is useful for narrative text where sentence boundaries capture meaning. Group N sentences per chunk while respecting maximum token constraints. This reduces mid-sentence cuts, which often produce low-quality embeddings.
Semantic chunking uses embeddings or similarity to detect topic shifts. It can help in messy documents, but it is harder to make deterministic and can hide bugs. If you use it, pin model versions and thresholds, and record them in metadata so results are explainable.
Stable IDs matter here. A good pattern is document_id + heading_path_hash + chunk_index, where chunk_index increments within a heading scope. If you expect insertions that shift indexes, consider using content-based hashes for chunk IDs (hash of normalized chunk text + document_id). This supports incremental re-indexing and deduplication later.
Overlap is insurance against boundary errors: questions often depend on a sentence that sits right at a cut point. But overlap is also a cost multiplier (more chunks, more embeddings, more storage) and can amplify duplicates in top-k results. Use overlap intentionally, not by habit.
A pragmatic recipe: pick a target chunk size (e.g., ~350 tokens) and add 10–20% overlap (e.g., 50–80 tokens). For sentence-based chunks, overlap by 1–3 sentences. For heading-based chunks, overlap is usually smaller; headings already provide context, so you mainly need protection around short subsections or definitions.
Boundary rules prevent context breakage. Never split inside: code blocks, tables, numbered procedures, or legal clauses. Preserve list integrity: splitting a 7-step procedure into separate chunks often makes retrieval fail because the user asks “What is step 5?” and the chunk lacks the surrounding steps. Instead, keep the whole procedure together if it fits, or chunk by sublists while repeating the step title in each chunk.
Validate chunk quality with quick manual spot checks. Sample 20–50 chunks across document types and verify: (1) the chunk reads as a coherent unit, (2) citations/offsets point to the right place, (3) the heading path matches the text, and (4) there’s no obvious truncation. This is the fastest way to catch broken parsing (e.g., missing spaces, scrambled columns) before you waste time embedding garbage.
Common mistakes include: too-large chunks that become “mini-documents” (retrieval becomes noisy), overlap so large that top-k returns near-duplicates, and cutting paragraphs mid-thought because you used character counts instead of tokens or sentences.
Duplicate and boilerplate content quietly degrades retrieval. If every page repeats the same header, footer, cookie banner, or navigation links, those tokens become disproportionately represented in embeddings. The retriever starts matching on “Contact us” instead of the actual policy you need.
Handle boilerplate as early as possible. For HTML, remove repeated DOM regions (nav, footer, sidebar) using selectors or readability extraction. For PDFs, identify repeated lines across pages (e.g., a report title in the header) and drop them. For wiki exports, strip “edit” links, breadcrumbs, and template text.
Deduplication should operate at multiple levels. Document-level dedupe removes identical files or mirrored URLs (canonicalize URLs, strip tracking params). Chunk-level dedupe removes repeated paragraphs across documents (common in policies copied across teams). A practical approach is to compute a hash of normalized text (whitespace collapsed, dates optionally masked) and drop exact duplicates. For near-duplicates, use similarity (MinHash or embedding cosine) with conservative thresholds and log decisions for review.
Be careful: aggressive dedupe can erase legitimate repeated definitions that users ask about. The safe strategy is to dedupe only when you can preserve at least one authoritative source and keep metadata that points to it. When you remove duplicates, record duplicate_of references so traceability is maintained and audits can explain why content disappeared from the index.
Once your ingestion works, the next failure mode is drift: content changes, parsers change, chunking changes, and suddenly retrieval results differ between environments. Treat your corpus like a dataset with versions, not like a folder of files.
Implement a reproducible ingestion checkpoint: parsed documents (normalized blocks), chunked outputs (text + metadata), and a manifest that records input sources, timestamps, and pipeline versions (parser_version, chunker_version, normalization_version). Store these artifacts in object storage or a dataset registry so you can re-run embeddings and indexing deterministically.
Incremental re-indexing keeps the system fast and stable. Maintain a stable document_id and compute a document_content_hash from normalized content. On each run: (1) detect added/changed/deleted documents, (2) re-chunk only changed documents, (3) upsert changed chunk IDs, and (4) delete chunks for removed documents. If chunk IDs are content-hash-based, updates naturally create new chunks while old ones can be retired safely.
Plan for schema evolution. If you add a new metadata field or adjust chunk sizes, you may need a full rebuild. Minimize full rebuilds by isolating “index-affecting” changes (text normalization, chunking, embedding model) from “non-index-affecting” changes (additional metadata used only for display). Log every run with counts: number of documents parsed, chunks produced, average tokens per chunk, dedupe rate, and errors. These metrics become your early warning system when something breaks.
With this in place, you’ve reached an important milestone: a reproducible ingestion pipeline and a stable, chunked corpus ready for embeddings, vector indexing, and the top-k + rerank workflow you’ll build next.
1. Which situation best illustrates the chapter’s claim that retrieval quality is often limited by ingestion rather than the embedding model?
2. Which set of properties matches the four requirements of a good ingestion pipeline in this chapter?
3. Why does the chapter emphasize designing metadata such as source, section, timestamps, and permissions?
4. What is the main reason to use chunk overlap along with stable chunk IDs?
5. According to the chapter’s “retrieval contract,” what is the most important quality test for a chunk?
In Chapter 2 you built an ingestion pipeline that produces clean text chunks with stable IDs and metadata. Chapter 3 turns that corpus into something you can search in milliseconds. The core idea is simple: represent each chunk as a numeric vector (an embedding), store those vectors in an index optimized for nearest-neighbor lookup, and query the index with an embedded user question to get the top‑k most similar chunks.
The practical challenge is not “how to embed text” but “how to embed reliably at scale and retrieve consistently under latency and cost constraints.” You will make tradeoffs between quality and budget, between recall and speed, and between simplicity and operational complexity. This chapter is where vector search becomes an engineered system rather than a demo.
We’ll proceed in the same order you should implement: build intuition for what embeddings capture (and what they miss), choose an embedding model aligned to your domain and budget, generate embeddings at scale with batching and caching, pick an index type and similarity metric, tune retrieval parameters for fast, stable top‑k, and finally add filters and hybrid signals (BM25 + vectors) when vectors alone are not enough. The chapter ends with a checkpoint: can you run fast retrieval that returns consistent top‑k results as your data grows?
Keep one guiding principle in mind: retrieval errors are usually upstream errors. If you embed the wrong text, choose a mismatched model, use unstable IDs, or tune the index for speed at the cost of recall, your generator will be forced to hallucinate. Retrieval is the foundation of RAG, so treat it like production infrastructure.
Practice note for Choose an embedding model aligned to your domain and budget: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Generate embeddings at scale with batching and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a vector index and run similarity queries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add filters and hybrid signals to improve relevance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: fast retrieval returning consistent top‑k results: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose an embedding model aligned to your domain and budget: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Generate embeddings at scale with batching and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Embeddings map text into a vector space where “closeness” approximates semantic similarity. Think of each chunk embedding as a point in a high-dimensional geometry (often 384–3072+ dimensions). When a user asks a question, you embed the query into the same space and look for nearby points; those chunks become your candidate context for RAG.
What makes this work is not magic but training: embedding models learn to place texts that should “match” near each other. Depending on the model, “match” might mean paraphrase similarity, question–answer compatibility, or general topical proximity. This is why some models are better for search than for clustering, and why you should test on your own queries.
Common mistakes: embedding raw, uncleaned text that includes navigation headers, boilerplate, or duplicated footers; embedding chunks that are too large (topic dilution) or too small (missing context); and assuming that embeddings are deterministic across model versions. You should treat your embedding model + preprocessing as a versioned contract. If either changes, your vectors and retrieval behavior change, which can break evaluation comparability.
Practical takeaway: embeddings are a powerful candidate generator. In most RAG systems you will later add reranking (often with a cross-encoder) because “nearby in embedding space” is not the same as “best evidence for this exact question.”
Choosing an embedding model is your first major budget/quality decision. In production, the best model is rarely the one with the best benchmark score; it’s the one that meets latency, cost, and privacy constraints while performing well on your domain queries.
Start by answering four questions:
Engineering judgment: do not optimize embeddings in isolation. A common practical pattern is bi-encoder embeddings for recall (fast, precomputable) followed by a cross-encoder reranker for precision (slower, run only on top‑k). This division lets you pick an embedding model that is efficient and stable, then recover ranking quality with reranking.
Model selection workflow: sample 50–200 real queries, define what “good” looks like (which chunk is relevant), then compare models by retrieval metrics (recall@k, MRR). Track not only accuracy but also embedding time, vector dimensionality (affects memory), and consistency across updates. Version your model name, dimension, normalization choice, and preprocessing steps so you can reproduce results later.
Once you have vectors, you need an index to search them quickly. The simplest index is flat: store all vectors and compute similarity to every vector at query time. Flat search is exact and easy to implement, but it becomes slow as your corpus grows (tens of thousands might be fine; millions usually are not).
Approximate nearest neighbor (ANN) indices trade a small amount of recall for big speedups. Two common families:
Practical defaults: if you’re building your first production-grade RAG index and you can afford the memory overhead, start with HNSW. It is straightforward to tune (increase efSearch for higher recall) and typically delivers strong latency/recall tradeoffs. Use flat search early in development for correctness checks and evaluation baselines; it helps you isolate whether poor results come from embeddings/chunking or from ANN approximation.
Operational considerations: build the index with stable document IDs and persist them with metadata. You will need to delete and update chunks over time; ensure your store supports upserts or tombstones. Also plan for batching and caching during embedding: batch requests to maximize throughput, cache embeddings keyed by (chunk_id, model_version, preprocessing_version) to avoid recomputing when you re-index or experiment.
Common mistake: changing chunking rules or cleaning logic without re-embedding everything. If chunk text changes, the embedding changes; stale vectors silently degrade retrieval. Treat re-indexing as a first-class pipeline step with checkpoints and logs.
Similarity metric choice determines what “nearby” means in your embedding space. The three most common metrics are cosine similarity, dot product, and L2 (Euclidean) distance. Many vector databases expose these as configuration options, and some embedding models are trained with an implicit assumption about which metric you’ll use.
Practical guidance: check the embedding model documentation first. If it recommends normalization, normalize both corpus and query vectors and use cosine (or dot product with normalized vectors). Consistency matters: mixing normalized and unnormalized vectors will cause unstable rankings. Also be careful when comparing scores across queries—similarity scores are usually not calibrated probabilities, so don’t treat “0.82” as universally “good.” Instead, use rank-based metrics and human evaluation.
Common mistake: using cosine in the database but forgetting to normalize embeddings when the model expects it. Another mistake is switching metrics midstream and comparing retrieval quality across experiments without re-indexing. The metric is part of the retrieval contract; version it alongside model and preprocessing.
Practical outcome: once your metric is correct, top‑k results become more interpretable. When retrieval is “off,” you can diagnose whether it’s because the query embedding points to the wrong region (model mismatch) or because the metric/index is misconfigured.
Retrieval has two sets of knobs: how many candidates you fetch (top‑k) and how hard the index searches (efSearch for HNSW, probes/nprobe for IVF). These parameters control the speed–recall tradeoff. Your goal is not maximum recall at any cost; it is to hit a latency budget while returning enough good candidates for reranking and generation.
top‑k: In RAG, top‑k is the number of chunks you retrieve before reranking (or directly sending to the LLM). If you plan to rerank, set top‑k higher (e.g., 20–100) to give the reranker room to improve precision. If you skip reranking, top‑k should be smaller and you’ll rely more heavily on embedding ranking quality.
efSearch (HNSW): Higher efSearch increases recall but increases latency. A practical tuning loop is: pick a target latency, then increase efSearch until recall@k plateaus. Measure on your real query set; synthetic queries can hide failure modes. For IVF, increase probes to search more clusters. Again, higher probes yields better recall at higher cost.
Checkpoint behavior: “fast retrieval returning consistent top‑k results” means (1) your query latency is within budget, (2) the top results do not change unexpectedly across runs, and (3) small corpus changes (adding a document) do not catastrophically reorder unrelated queries. If your rankings are brittle, consider increasing efSearch/probes, using a higher-quality embedding model, or adding hybrid signals.
Pure vector search is rarely sufficient on its own. You will often need metadata filters and hybrid retrieval (BM25 + vectors) to handle exact matches, structured constraints, and user intent that embeddings blur.
Metadata filters narrow the candidate set before similarity search or after it, depending on your store. Typical filters: product/version, document type, tenant/customer, language, access control labels, date ranges, and “source=handbook vs tickets.” Filtering improves both relevance and security. The engineering rule is: if a constraint is hard (must be enforced), filter before ranking so disallowed chunks never enter the candidate set.
Hybrid retrieval combines lexical scoring (BM25) with vector scoring. BM25 excels at exact token overlap: error messages, identifiers, names, and rare terms. Vectors excel at paraphrase and semantic similarity. A practical hybrid approach is:
Common mistakes: applying filters inconsistently between indexing and querying (e.g., storing language metadata but never filtering by it), or over-filtering so aggressively that recall collapses. Another common pitfall is using hybrid retrieval without evaluation; you can easily inflate irrelevant lexical matches if your query contains common tokens.
Practical outcome: hybrid + filters is where relevance often “snaps into place.” When users complain that the system ignores an exact phrase or keeps citing the wrong version of a policy, hybrid signals and metadata constraints are usually the fix. Once you have this working, you have a retrieval layer that is robust enough to support reranking and offline evaluation in the next chapter.
1. What is the main engineering challenge emphasized in Chapter 3 when moving from text chunks to searchable vectors?
2. Which sequence best matches the recommended implementation order in Chapter 3?
3. In the chapter’s core retrieval loop, what produces the top‑k results?
4. Why does Chapter 3 recommend batching and caching when generating embeddings at scale?
5. Which statement best captures the chapter’s guiding principle about retrieval failures in RAG systems?
By Chapter 3, you can generate embeddings, build an index, and run a top-k similarity search. In practice, that first list of “most similar” chunks often looks plausible while still being wrong for the user’s intent. The gap is rarely because vector search is “bad.” It’s because you asked a bi-encoder embedding model to do two jobs at once: (1) find candidates broadly related to a query and (2) perfectly order them by usefulness for the user’s specific question. Reranking is the engineering pattern that separates those responsibilities.
This chapter focuses on diagnosing why top‑k results miss intent or come back in the wrong order, implementing a reranker, and verifying a measurable relevance lift versus pure vector search. You’ll also learn to make the system practical: controlling latency and cost, boosting recall with query rewriting or multi-query, and adding production safeguards like timeouts, caching, and batching.
When you get this separation right, RAG gets noticeably better with no change to your documents, chunking, or embeddings—because you’re correcting the ranking step where most user-visible failure happens.
Practice note for Diagnose why top‑k results miss intent or ranking order: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement a reranker and rerank the candidate set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare latency/cost: cross-encoder vs lightweight rerankers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add query rewriting or multi-query to boost recall: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: measurable relevance lift vs pure vector search: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Diagnose why top‑k results miss intent or ranking order: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement a reranker and rerank the candidate set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare latency/cost: cross-encoder vs lightweight rerankers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add query rewriting or multi-query to boost recall: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: measurable relevance lift vs pure vector search: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Vector search (bi-encoder retrieval) is a fast way to produce a candidate set: chunks that are semantically related to the query. But “related” is not the same as “answers the question.” Many ranking failures look like this: the top results share vocabulary or topic with the query, yet they miss a constraint (time range, product version, jurisdiction, exception) that the user cares about. This is why top‑k results miss intent even when they are not “off topic.”
Candidate generation should be judged primarily on recall: does the correct chunk appear somewhere in the top N? Reranking should be judged on ordering: is the best chunk near rank 1–3 so the generator can use it? If you try to fix ordering by “just improving embeddings,” you often pay for bigger models and still fail on subtle constraints, because embeddings compress meaning into a single vector and score with a simple similarity function.
A practical workflow is: retrieve top-k (or top-N) candidates using your vector index, then rerank those candidates with a stronger relevance function that reads the query and the chunk text together. This division of labor is also how you diagnose failures: if the right chunk never appears in the candidate pool, that’s a recall problem (chunking, metadata filters, query rewriting). If it appears but ranks low, that’s a reranking problem.
Rerankers come in three common families, each with a different accuracy/latency/cost profile.
Cross-encoders are the classic reranker for retrieval. They take [query] and [document chunk] together and output a relevance score. Because the model attends across both texts, it can learn fine-grained matching: required conditions, negations, and “this is the exception” phrasing that embeddings often blur. The tradeoff is latency: you must run one forward pass per query–chunk pair. If you rerank 50 candidates, that’s 50 model calls (or one batched call with 50 pairs).
LLM-as-reranker uses a general LLM to score or sort candidates. This can work well when the relevance notion is nuanced (e.g., “best procedural answer for a beginner” or “most up-to-date policy”). But it is usually more expensive, more variable, and harder to control deterministically. In regulated settings, you also need to consider whether sending chunks to an external LLM is acceptable.
Heuristics and lightweight rerankers include simple signals: keyword overlap, BM25 score, recency boosts, document authority, section headings, or metadata matches (product version, locale). These are fast and predictable. They are often underrated: a small heuristic boost can correct systematic misorderings (e.g., always prefer “Troubleshooting” sections for error-code queries). The best systems frequently combine a learned reranker with a few guardrail heuristics.
Reranking only helps if the candidate pool contains the right answer. The key knob is the size of that pool: retrieve N candidates (often 20–200), rerank them, then pass the top k (often 3–10) into your RAG prompt. This creates two “k” values: N for retrieval and k for generation. Confusing them is a frequent source of poor results.
Choose N based on recall needs and corpus ambiguity. For a small, well-structured internal knowledge base, N=30 may be enough. For messy corpora (tickets, forum posts, mixed document types), N=100 is often safer. You can estimate this with offline evaluation: measure “answer chunk found in top-N” across a labeled set, then pick the smallest N that hits your recall target (e.g., 95%).
Choose k based on how much context the generator can handle without dilution. More context is not always better: if you feed 12 chunks, you increase token cost and can confuse the model with near-duplicates or contradictory passages. Many RAG stacks perform best with k=4–6 high-quality chunks. If you need broader coverage, prefer increasing N and improving reranking rather than dumping more chunks into the prompt.
When retrieval recall is the bottleneck—meaning the right chunk is not in your candidate pool—you need to change what you search for. Query rewriting is the simplest lever because it can be implemented before retrieval and does not require re-indexing.
Expansion adds clarifying terms the user implied but didn’t say. For example, “reset MFA” might expand to “reset multi-factor authentication, authenticator app, backup codes.” Expansion can be heuristic (synonym lists) or model-based (an LLM generates 2–5 enriched queries). The risk is drift: expansion terms can accidentally bias retrieval toward a related but incorrect subtopic. Mitigate drift by keeping expansions short and by preserving the original query as one of the candidates.
Decomposition splits multi-part questions into subqueries. “Why did my build fail after upgrading, and how do I roll back?” is two retrieval tasks: find the failure cause (release notes, breaking changes) and find rollback instructions (deployment docs). Retrieve for each subquery, merge candidates, then rerank globally.
Synonyms and normalization are high ROI in enterprise corpora: internal acronyms, product code names, and versioned terminology. Build a small dictionary from your own documents: extract frequent abbreviations and map them to canonical forms. Even with embeddings, this helps because your chunk text may use only one form consistently, and the embedding model may not align rare acronyms well.
Reranking improves ordering, but you can also improve the candidate pool quality by representing content in more than one way. Multi-vector strategies are especially helpful when a single chunk embedding cannot capture both local detail and global context.
Document + chunk retrieval is a practical pattern: index embeddings for each chunk, and separately index a “document summary” vector (or the document title + headings). At query time, retrieve top documents using the document vectors, then retrieve top chunks within those documents. This reduces false positives from unrelated documents whose chunks coincidentally look similar. It also reduces the odds that you pick an isolated chunk missing crucial context (definitions, prerequisites, scope).
Multiple chunk views can also help: one embedding for the raw chunk text, another for a “chunk with section header,” and optionally another for a short auto-summary. Some domains benefit from adding a structured representation (e.g., error codes, API names) into a dedicated field and embedding that field separately. You then combine retrieval scores (max, weighted sum, or reciprocal rank fusion) to produce a stronger candidate list before reranking.
Near-duplicate handling matters: multi-vector strategies can increase redundancy (same content retrieved via multiple views). Deduplicate candidates using stable chunk IDs and similarity thresholds, then rerank. Otherwise the reranker wastes budget scoring the same passage repeatedly, and your generator receives repetitive context.
A reranker is only valuable if it is reliable under real traffic. Production systems fail in mundane ways: slow model calls, bursty workloads, or network hiccups. Treat reranking as an optional improvement layer with clear fallbacks.
Timeouts and fallbacks: set a strict budget (for example, 150–300ms for reranking in an interactive app). If the reranker times out, return the original vector-ranked list. Users prefer a slightly worse answer over a spinner. Also log timeouts separately: frequent timeouts usually indicate you chose N too large, you are not batching, or the model host is under-provisioned.
Batching: cross-encoders are much faster when you score many query–chunk pairs in one batch. Implement a rerank endpoint that accepts a list of candidate texts and returns a list of scores. In your app, retrieve N candidates, build pairs, and send one batched request. Batching also makes GPU utilization efficient and cost predictable.
Caching: cache rerank results for repeated queries (exact match and optionally normalized forms). Also cache embeddings retrieval results if your query rewriting produces identical subqueries across users. In enterprise search, “how to reset password” will repeat constantly; caching can cut reranking spend dramatically. Be careful with personalization: include user role, locale, and permission filters in cache keys so you don’t leak restricted content.
Checkpoint and evaluation: after implementing reranking (and any rewriting), rerun your offline evaluation and do targeted error analysis. Look for cases where reranking helps (correct chunk promoted) versus harms (overconfidently promoting a plausible but wrong chunk). This is the measurable “relevance lift vs pure vector search” moment: you should be able to quantify improved nDCG@k, MRR, or precision@k and explain remaining failure modes with concrete examples.
1. Why do top-k vector search results often look plausible but still fail the user’s intent?
2. In the chapter’s framing, what is the primary goal of candidate generation versus reranking?
3. Which situation is reranking most directly designed to fix?
4. What tradeoff does the chapter highlight when comparing cross-encoder rerankers to lightweight rerankers?
5. How do query rewriting or multi-query techniques help the overall retrieval pipeline described in the chapter?
You can build a retrieval system that “looks right” in a demo and still fails in production. Vector search adds new degrees of freedom—chunking, embedding models, distance metrics, filtering, rerankers—and each choice can quietly degrade results. This chapter gives you a practical evaluation workflow: create a labeled query set with ground-truth references, compute offline retrieval metrics, run ablation studies to isolate what matters, and perform error analysis that turns failures into fixes. The goal is not academic perfection; it is confidence. When you can show an evaluation report with metrics, ablations, and a debugging log, you can defend your system choices in interviews and on the job.
Start with a mindset shift: you are not “testing the model,” you are testing the pipeline. Your index might be missing documents. Your metadata filters might hide the best chunks. Your embedding model might be tuned for short queries but you are embedding long questions. Evaluation is how you catch these issues early and make engineering tradeoffs deliberately.
Throughout this chapter, you will build an evaluation harness that: (1) loads a labeled query set, (2) runs retrieval (and optionally reranking), (3) computes metrics, (4) slices results by tags (topic, doc type, time range), and (5) produces a report you can iterate on. Treat this harness as a first-class artifact—versioned, repeatable, and runnable on every change.
Practice note for Create a labeled query set and ground-truth references: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compute offline retrieval metrics and interpret them: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run ablation studies on chunking, embeddings, and reranking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Perform error analysis and turn failures into fixes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: an evaluation report you can show in interviews: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a labeled query set and ground-truth references: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compute offline retrieval metrics and interpret them: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run ablation studies on chunking, embeddings, and reranking: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Perform error analysis and turn failures into fixes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Offline retrieval metrics let you quantify whether your system is finding the right evidence before generation. In RAG, the generator can only be as good as what you retrieve, so measure retrieval directly. The most common mistake is reporting a single number without understanding what it rewards. Use a small set of complementary metrics that map to user experience.
Recall@k answers: “Did we retrieve at least one relevant chunk in the top-k?” This is often the most important first metric for RAG because a single good chunk can enable a correct answer. If recall@10 is low, reranking will not save you—rerankers only reorder what you already retrieved.
Precision@k answers: “How many of the top-k are relevant?” Precision matters when your context window is tight or you pass many chunks to the LLM. Low precision can cause distraction, contradictions, and higher token costs. Watch precision@5 when you typically feed 3–5 chunks to the model.
MRR (Mean Reciprocal Rank) rewards placing the first relevant chunk as high as possible. In practice, MRR improves when reranking works, when chunk titles are informative, or when your embedding model better matches your query style. If recall is already high but answers still feel slow or inconsistent, MRR is a strong signal.
nDCG is useful when relevance is graded (e.g., “highly relevant,” “somewhat relevant,” “not relevant”) or when multiple relevant chunks exist and order matters. nDCG helps you distinguish “found something” from “found the best thing first.” It is especially helpful for long documents with many partially relevant sections.
Finally, define what “relevant” means. In retrieval, relevance usually means: “This chunk contains the information needed to answer the query.” It does not mean “same keywords.” Clear definitions make labeling consistent and metrics meaningful.
A gold set is a labeled query set paired with ground-truth references (document IDs or chunk IDs) that should satisfy the query. This is the foundation of your offline evaluation. The fastest route to a usable gold set is to start small but representative: 50–200 queries spanning the major user intents, document types, and difficulty levels.
Workflow: (1) collect real queries (search logs, support tickets, internal Slack questions) or write realistic ones if you lack logs; (2) for each query, identify one or more “answer sources” in your corpus; (3) store ground truth as stable identifiers (doc_id + chunk_id) plus optional graded relevance; (4) tag each query with facets like topic, freshness sensitivity, and required permissions.
Labeling can be done manually with a simple review UI or even a spreadsheet that includes: query, expected doc(s), notes, and constraints (must come from policy docs; must be after 2024; must be region-specific). If you use annotators, provide a short rubric with examples of relevant vs. non-relevant chunks. Measure agreement on a small overlap set to catch rubric ambiguity early.
Keep the gold set versioned (v1, v2). When your corpus changes, decide whether to freeze the corpus snapshot for evaluation or update labels intentionally. Treat label updates like code changes: reviewed, documented, and reproducible.
Once you can compute metrics on a gold set, your next job is to run experiments that isolate cause and effect. The discipline here is what turns “I tried a bigger model” into an engineering story: you have a baseline, you change one variable, you measure, and you record.
Choose baselines that are credible. At minimum: (1) keyword baseline (BM25 or hybrid without rerank), (2) vector baseline (bi-encoder embeddings + top-k), and (3) vector + rerank. Even if keyword search is not your final system, it is a valuable sanity check: if BM25 beats your vectors on many queries, your embeddings, chunking, or preprocessing may be misaligned.
Controlled changes: Change one of these at a time: chunk size/overlap, embedding model, query preprocessing, index parameters, metadata filters, top-k, reranker model, rerank depth (e.g., rerank top 50), or prompt/context length. If you change chunking and embeddings simultaneously, you will not know what caused the gain or regression.
Reproducibility: Fix random seeds where applicable and log all versions: code commit, embedding model name, reranker name, corpus snapshot hash, chunker config, and index build parameters. Deterministic runs matter because small metric differences can be noise.
When you publish results, include both metrics and operational costs: index size, ingestion time, retrieval latency, and reranking latency. In interviews, this shows you can balance quality with real constraints.
When metrics drop or users complain, resist the urge to immediately swap models. Debugging retrieval is usually about finding a concrete failure mode, then applying a targeted fix. Build a habit of inspecting the top results for failed queries and annotating what went wrong.
Bad chunks are the most common root cause. Symptoms: retrieved text is incomplete, lacks the key sentence, or includes unrelated boilerplate. Fixes include: adjusting chunk boundaries to align with headings, increasing overlap, adding a “section title + paragraph” format, and stripping repeated nav/footer text. Also watch for chunks that are too long: embeddings can blur multiple topics, hurting similarity search.
Wrong or missing metadata breaks filtering and relevance. Symptoms: correct content exists but is never retrieved when filters are applied (region, product, permission, date). Fixes: validate metadata at ingestion with schema checks, ensure doc_id consistency, and create unit tests that assert known documents appear in the index with expected fields. If you support ACLs, test that authorized queries retrieve sources while unauthorized ones do not.
Model mismatch happens when the embedding model is not suited to your content or query style. Symptoms: semantically relevant chunks rank low while keyword-ish chunks rank higher, or short queries perform poorly. Fixes: normalize query templates (remove ticket IDs, excessive prefixes), try a domain-tuned embedding model, embed structured fields (title, headings) alongside body, and ensure the same text normalization is applied to both documents and queries.
The key outcome is a prioritized fix list. You are converting qualitative failures into categories you can measure and resolve, then re-running the offline harness to verify improvements.
Offline metrics are necessary, but online signals tell you whether the system helps real users under real constraints. Prepare your RAG retrieval system for online evaluation by instrumenting it like a product: log what was retrieved, what was shown to the LLM, and what the user did next.
Feedback signals: Capture lightweight explicit feedback (thumbs up/down, “was this helpful?”) and richer implicit feedback (click-through on citations, follow-up rate, time to resolution, copy/share events). For retrieval specifically, log: query, top-k doc_ids/chunk_ids, similarity scores, reranker scores, filters applied, and latency. This enables you to replay real queries through new retrieval versions and compare outcomes.
A/B readiness: Before running A/B tests, define guardrails: maximum latency increase, maximum cost per query, and minimum acceptable safety/compliance outcomes. Ensure your system can route a stable fraction of traffic to variant B and that you can attribute outcomes to the retrieval change rather than unrelated UI or prompt changes. If the generator prompt changes during the test, your results will be confounded.
Use online data to expand and refresh your gold set. The best evaluation sets evolve with user behavior—new jargon, new policies, new edge cases—so treat online feedback as the pipeline that keeps your offline evaluation honest.
Evaluation is also where you prove the system is safe to deploy. Retrieval can leak sensitive content even if your generator prompt is “careful,” because the model may receive private text in context. Build compliance checks into ingestion, indexing, retrieval, and reporting.
PII handling: Decide whether to redact, encrypt, or exclude PII at ingestion. Run PII detectors on documents and store flags in metadata so you can filter or restrict retrieval. In evaluation, include tests that ensure PII-tagged chunks are not returned for general queries. Also ensure logs do not store raw sensitive text; log IDs and short hashes where possible.
Permissions and ACLs: If your corpus has per-user or per-group access controls, retrieval must enforce them before ranking results. A common mistake is applying ACL filters after top-k retrieval, which can cause empty results or biased ranking. Instead, filter candidates by permissions as early as possible (or use per-tenant indexes). Include evaluation queries that simulate different roles and confirm that unauthorized documents never appear in the retrieved set.
Leakage checks: Guard against cross-document contamination and “training data” assumptions. Verify that citations actually come from retrieved chunks, that chunk IDs map to the correct document version, and that you are not accidentally indexing private staging content. Add canary documents (clearly labeled secrets) in a secure test environment and assert they never appear unless explicitly permitted.
Your interview-ready checkpoint is an evaluation report that includes: the gold set definition, offline metrics with slices, ablation results (chunking/embeddings/reranking), a table of top failure modes with fixes, and a short safety/compliance section. This demonstrates you can build RAG retrieval that is not only effective, but verifiably reliable and deployable.
1. Why can a vector search system that demos well still fail in production, according to the chapter?
2. What is the first evaluation artifact you need to create to enable offline retrieval metrics?
3. What is the purpose of running ablation studies in the chapter’s workflow?
4. What mindset shift does the chapter recommend when evaluating a RAG retrieval system?
5. Which set of steps best matches the chapter’s evaluation harness?
By now you can ingest documents, chunk them, embed them, retrieve top-k candidates, and rerank for relevance. The career-transition leap happens when you can ship that capability as a dependable product artifact: an API (or CLI) that someone else can run, monitor, and trust. This chapter turns your notebook-grade retrieval pipeline into a portfolio-ready service, with the engineering judgment interviewers look for: clear boundaries, safe configs, real monitoring, cost controls, and a deployment path that matches constraints.
Productionizing retrieval is not “adding FastAPI and calling it done.” Retrieval has failure modes that only appear under load or changing corpora: rising latency when the index grows, relevance regressions when chunking changes, silent cost spikes when reranking is unconstrained, and confusing results when metadata filters are inconsistent. You will wrap retrieval + reranking behind a stable interface, add observability, tune performance, and finish with a capstone demo that includes metrics plus a deployment checklist.
The outcome is a project a hiring manager can run in 10 minutes and evaluate in 10 more: “Here’s the endpoint. Here’s a sample query. Here are the offline metrics. Here’s a dashboard screenshot. Here’s how to scale it.” That is the difference between “I tried RAG” and “I can own retrieval.”
Practice note for Wrap retrieval + reranking into a simple API or CLI: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add monitoring for latency, cost, and relevance regressions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize performance with caching, batching, and index tuning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prepare a portfolio README, diagrams, and interview talking points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Capstone: end-to-end demo with metrics and a deployment checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Wrap retrieval + reranking into a simple API or CLI: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add monitoring for latency, cost, and relevance regressions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize performance with caching, batching, and index tuning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prepare a portfolio README, diagrams, and interview talking points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with a simple, explicit architecture diagram you can explain on a whiteboard. A practical reference architecture for a retrieval system has four boundaries: ingestion, indexing, serving, and evaluation. Keep these loosely coupled so you can swap components (vector DB, embedding model, reranker) without rewriting everything.
Ingestion service takes raw documents, cleans text, extracts metadata, assigns stable document IDs, and writes “normalized documents” to object storage (S3/GCS/local) plus metadata to a DB. Avoid embedding inside ingestion at first; keep it reproducible and auditable.
Index builder reads normalized docs, chunks them using your chosen strategy, generates embeddings in batches, and writes to a vector index (FAISS, pgvector, Pinecone, etc.). Store chunk IDs that deterministically map back to (doc_id, chunk_index, byte offsets) so you can debug retrieval results later.
Retrieval API exposes a small surface area: /search or a CLI command that accepts a query, optional filters, and returns ranked passages with citations. The API should do: query embedding → ANN top-k → lightweight filtering/metadata constraints → optional cross-encoder rerank → return. Resist adding generation in this service if your course focus is retrieval; you can integrate later, but keeping retrieval isolated makes evaluation cleaner.
Evaluation job runs offline metrics (Recall@k, nDCG@k, MRR) on a labeled query set and stores results per build. This becomes your regression gate: new chunking, new embeddings, or new reranker must not silently degrade quality.
For your portfolio, include a one-page diagram that labels these components and shows data flow. Interviews love clear boundaries because they signal you understand operational ownership, not just model calls.
Packaging is where prototypes usually fall apart. Your goal is a repo that runs with one command and has no hidden state. Treat configuration as code: explicit, validated, and environment-specific.
Config structure: keep a single typed config (YAML/TOML + validation via Pydantic or dataclasses). Split what changes per environment (dev/staging/prod) from what is constant. Examples of config knobs: embedding model name, chunk size/overlap, ANN index parameters (HNSW efSearch, IVF nlist), top-k, rerank-k, timeouts, and metadata filter defaults.
Secrets: API keys and database credentials must come from environment variables or a secret manager. Never commit them; add a .env.example and a clear README section: “copy to .env and fill values.” In code, fail fast with a helpful error if a required secret is missing.
Reproducible environments: choose one packaging path and do it well: pyproject.toml with uv/poetry, or requirements.txt with pinned versions. Add a make target (or just scripts) for make ingest, make index, make serve, make eval. This matches the lesson “wrap retrieval + reranking into a simple API or CLI”: provide both a CLI for batch usage and an API for integration.
A portfolio-ready project is not “works on my laptop.” It is “works for you, on your laptop, with documented knobs and safe secrets.”
Retrieval quality and system health drift over time. Observability is how you catch regressions before users do. Implement three layers: structured logs, request traces, and metrics dashboards with alerts.
Structured logs: log one event per request with fields you can aggregate: request_id, query length, filter keys, embedding latency, ANN latency, rerank latency, total latency, top-k, rerank-k, and which index build/version served the request. Also log the returned chunk IDs and scores (at least for sampled traffic) to support later error analysis. Avoid logging raw user queries if privacy matters; hash or redact.
Tracing: use OpenTelemetry to instrument spans: embed_query, vector_search, rerank, postprocess. This helps you explain where time goes and is especially useful when switching providers or deploying serverless. A trace screenshot is an excellent portfolio artifact because it shows production literacy.
Metrics & dashboards: export Prometheus-style counters and histograms: QPS, p50/p95 latency, error rate, cache hit rate, rerank usage rate, and cost estimates per request. Add “relevance regression” signals by periodically running a small golden set through the live endpoint and tracking Recall@k/nDCG@k over time. This integrates the lesson about monitoring relevance regressions, not just uptime.
In interviews, be ready to describe one incident scenario: “index grew 5×, p95 increased, we tuned HNSW efSearch and added caching; relevance stayed stable per golden set.” This is credible operational storytelling.
Cost is a first-class retrieval metric. Rerankers and embedding calls can dominate spend, and the easiest way to blow a budget is to rerank too many candidates or to rerank unnecessarily. Implement guardrails that make cost predictable.
Token budgets: if your reranker is a cross-encoder or LLM-based scorer, define a maximum rerank input size (characters/tokens) per passage and per request. Truncate passages consistently (e.g., title + first N tokens + highlighted matches). Keep a counter for “tokens sent to reranker” and expose it as a metric.
Rerank limits: set top_k_retrieve and top_k_rerank separately. A common pattern: retrieve 50–200 via ANN (cheap), rerank 10–30 (expensive), return top 5–10. Add a dynamic policy: if ANN scores show a strong gap after the top 8, skip reranking beyond 8. This is engineering judgment: pay for reranking when it is likely to change the ordering.
Cache strategy: caching is your primary cost lever and performance lever. Cache query embeddings keyed by normalized query text + embedding model version. Cache ANN results keyed by (query_embedding_hash, filters, index_version, top_k). Optionally cache rerank outputs for popular queries, but ensure the cache key includes reranker version and the exact candidate set.
In your capstone demo, report “cost per 1,000 searches” under a realistic workload, and show how rerank-k and caching change that number without harming nDCG on your eval set.
Your portfolio should demonstrate at least two deployment modes: local for reviewers and one “real” deployment (container or serverless). Choose based on your index size and latency requirements.
Local (developer mode): run everything on a laptop: a local vector index (FAISS) and a FastAPI server. Provide a make demo that builds a tiny sample index and starts the API. This is the fastest way for a hiring manager to validate your work.
Containerized (recommended): package the API into a Docker image with deterministic builds. Mount the index as a volume or download it at startup from object storage. For larger indexes, prefer a separate vector database service and keep the API stateless. Container deployment (Fly.io, Render, ECS, Kubernetes) makes your system legible to infrastructure teams.
Serverless: good for spiky traffic, but be careful with cold starts and large index files. If the index cannot fit in memory or must be loaded from disk each invocation, latency will suffer. Serverless works better when the vector index is managed externally (Pinecone/Weaviate/pgvector on managed Postgres) and your function only orchestrates retrieval and reranking.
Managed services: simplest operationally: managed vector DB + managed metrics + managed secrets. This is a valid portfolio choice if you explain the tradeoff: higher cost, lower ops burden, faster time-to-ship.
Close this section with a deployment checklist you actually used: build image, run health check, validate /search, run golden-set eval, verify dashboards, then promote.
This course sits at the intersection of ML engineering, backend engineering, and search relevance. To make it count for a career transition, map your work to real roles: “ML Engineer (RAG/retrieval),” “Search/Ranking Engineer,” “Backend Engineer (AI platform),” or “Data Engineer (ingestion + indexing).” The same project reads differently depending on which outcomes you highlight.
Portfolio README: make it skimmable. Include: problem statement, dataset/corpus description, architecture diagram, quickstart commands, API/CLI examples, and an evaluation section with offline metrics plus error analysis examples (good vs bad queries and what you changed). Add a “Design decisions” section: chunk size rationale, why bi-encoder retrieval + cross-encoder rerank, and what you would do next (hybrid search, query rewriting, better labels).
Interview talking points: be ready to explain tradeoffs: why vector search beat keyword for your corpus, when hybrid would be better, why reranking improved nDCG, and how you prevented regressions (golden set + alerts). Bring one story about performance optimization (caching, batching, index tuning) and one about cost control (rerank-k limits, token budgets).
When reviewers can run your system, see your metrics, and understand your engineering choices, you are no longer presenting a tutorial project—you are presenting evidence of job-ready ownership.
1. What makes a retrieval pipeline “portfolio-ready” according to the chapter?
2. Which situation best illustrates why “adding FastAPI and calling it done” is insufficient for production retrieval?
3. What monitoring focus does the chapter call out to prevent regressions and surprises in production retrieval?
4. Which set of techniques is presented as the main way to optimize performance for the shipped retrieval service?
5. What is the capstone outcome that demonstrates you can “own retrieval” in an interview setting?