HELP

+40 722 606 166

messenger@eduailast.com

RAG Data Pipeline for Course Catalogs: Clean, Chunk, Index

AI In EdTech & Career Growth — Intermediate

RAG Data Pipeline for Course Catalogs: Clean, Chunk, Index

RAG Data Pipeline for Course Catalogs: Clean, Chunk, Index

Turn messy catalogs into a weekly-refreshed RAG index that works.

Intermediate rag · edtech · data-pipeline · embeddings

Why this course exists

Course catalogs are messy by nature: multiple sources, frequent updates, inconsistent formatting, and critical edge cases like prerequisites and program rules. If you want accurate AI answers for learners, advisors, or sales teams, you need more than a chatbot—you need a reliable RAG (Retrieval-Augmented Generation) data pipeline that keeps your index clean, structured, and fresh.

This book-style course walks you through building a practical, production-ready pipeline for course catalogs: ingest, normalize, clean, chunk, embed, index, and refresh on a weekly cadence. The goal is not just “it works on a demo,” but “it keeps working after the catalog changes.”

What you will build

By the end, you will have a blueprint you can implement in your stack (Python + your choice of vector database) to power:

  • Accurate course and program Q&A with citations
  • Catalog search with filters (campus, modality, term, level)
  • Advising and eligibility checks grounded in the latest rules
  • Operational workflows for weekly refresh and rollback

How the chapters fit together

Chapter 1 establishes the data contract: what counts as a “document,” how IDs work, what freshness means, and what success looks like. Chapter 2 brings sources into a canonical format while preserving lineage. Chapter 3 makes the data trustworthy through cleaning, deduplication, enrichment, and audits.

Once your catalog is consistent, Chapter 4 focuses on chunking—where most RAG systems succeed or fail. You will design chunk boundaries that match real user questions and support precise retrieval. Chapter 5 turns those chunks into an index: embeddings, hybrid retrieval, reranking, idempotent upserts, and citation-ready storage. Finally, Chapter 6 operationalizes everything with weekly refresh, change detection, monitoring, regression tests, and governance.

Who this is for

This course is designed for EdTech and education-adjacent teams—product engineers, data engineers, ML engineers, and technical product managers—who need a dependable approach to catalog RAG. If you are responsible for search, advising, enrollment funnels, or learner support, this pipeline will directly improve answer quality and user trust.

Key skills you will gain

  • Designing a catalog-specific schema and metadata strategy that improves retrieval
  • Implementing deterministic cleaning and deduplication that survives weekly updates
  • Building chunking policies for courses, programs, and policies (not just generic text)
  • Indexing with filters, citations, and safe incremental updates
  • Evaluating retrieval quality and monitoring freshness and drift

Get started

If you want to ship a RAG system that stays accurate as your catalog evolves, this course gives you the structure, milestones, and operational thinking to do it right. Register free to begin, or browse all courses to compare related learning paths.

What You Will Learn

  • Model a course catalog schema with metadata that improves RAG retrieval
  • Build a repeatable cleaning and normalization workflow for catalog data
  • Design chunking strategies for course pages, syllabi, and program rules
  • Create embeddings and index them in a vector database with filters
  • Implement weekly refresh with incremental updates and safe re-indexing
  • Evaluate retrieval quality (recall, MRR) and reduce hallucinations with guardrails
  • Add monitoring for drift, freshness, and broken sources
  • Ship a production-ready pipeline blueprint with docs and runbooks

Requirements

  • Basic Python and JSON/CSV familiarity
  • Comfort with REST APIs and command-line tools
  • General understanding of LLMs and embeddings (helpful but not required)
  • Access to a sample course catalog dataset (or ability to scrape/export one)

Chapter 1: Define the Catalog RAG Use Case and Data Contract

  • Select high-value user journeys (search, advising, prerequisites)
  • Draft the catalog data contract (fields, IDs, freshness rules)
  • Choose retrieval unit types (course, program, policy, FAQ)
  • Set acceptance criteria and success metrics for answers
  • Plan privacy, licensing, and source-of-truth governance

Chapter 2: Ingest and Normalize Catalog Sources

  • Build connectors for HTML, PDFs, and structured exports
  • Normalize text and structure into a canonical document format
  • Handle redirects, pagination, and multi-term versions
  • Validate and store raw vs processed artifacts
  • Create lineage logs for traceability

Chapter 3: Clean, Deduplicate, and Enrich the Catalog

  • Create deterministic cleaning rules and unit tests
  • Deduplicate near-identical course entries across terms
  • Extract entities (credits, prerequisites, outcomes) into metadata
  • Resolve cross-links (course codes, program requirements)
  • Produce audit reports for missing and conflicting fields

Chapter 4: Chunk Strategy for High-Recall Retrieval

  • Choose chunk boundaries aligned to how users ask questions
  • Implement section-aware chunking for structured pages
  • Add overlap and context windows without bloating the index
  • Attach metadata filters to chunks for precise retrieval
  • Run chunking experiments and pick a default policy

Chapter 5: Embed, Index, and Retrieve with Controls

  • Select an embedding model and define versioning rules
  • Create a vector index with hybrid search and metadata filters
  • Implement upserts, deletes, and idempotent indexing
  • Add retrieval controls: top-k, reranking, and citations
  • Benchmark latency and cost for catalog-scale traffic

Chapter 6: Weekly Refresh, Monitoring, and Operations

  • Design the weekly refresh workflow and incremental detection
  • Implement safe re-indexing with backfills and rollbacks
  • Add monitoring for freshness, coverage, and retrieval quality
  • Create runbooks for failures, source changes, and schema updates
  • Ship a production checklist and handoff documentation

Sofia Chen

Senior Machine Learning Engineer, Retrieval & Data Platforms

Sofia Chen designs retrieval systems and data pipelines for education and marketplace platforms. She has led production RAG deployments covering ingestion, indexing, evaluation, and monitoring. Her focus is building pragmatic systems that stay accurate as catalogs change.

Chapter 1: Define the Catalog RAG Use Case and Data Contract

A course catalog looks simple until you try to answer real student and advisor questions reliably. “Can I take CS 302 next term?” is not just a search problem; it mixes prerequisites, term availability, campus rules, transfer credit policies, and sometimes exceptions. Retrieval-Augmented Generation (RAG) can help, but only if the pipeline is designed around the catalog’s true user journeys and a strict data contract that keeps documents stable, fresh, and governable.

This chapter sets the foundation for the rest of the course: you will pick the highest-value journeys, define what a “retrieval unit” is in your system (course vs. program rule vs. policy vs. FAQ), and draft a contract for fields, identifiers, and freshness rules. You will also decide what “good” looks like—acceptance criteria and success metrics for answers—so your team can evaluate changes without subjective arguments. Finally, you’ll establish privacy, licensing, and source-of-truth governance so the system stays compliant and trustworthy.

Engineering judgment matters most at this stage. Many catalog RAG efforts fail not because embeddings are “bad,” but because the upstream data contract is fuzzy: duplicate courses, ambiguous IDs, mismatched campuses, or PDFs that drift out of date. The work in Chapter 1 is deliberately operational: you are designing the boundaries of your system so it can be cleaned, chunked, indexed, refreshed, and evaluated repeatably.

  • Practical outcome: a written use-case definition and data contract that the pipeline can enforce (not just documentation).
  • Practical outcome: a decision on retrieval unit types and the minimum metadata needed for filtering and ranking.
  • Practical outcome: acceptance criteria and metrics (e.g., recall, MRR, citation coverage) tied to user journeys.
  • Practical outcome: governance notes: which sources are authoritative, what’s licensed, and what must never be indexed.

In the sections that follow, you’ll translate “catalog” into an explicit set of documents, IDs, and rules. This becomes the contract your ingestion and indexing pipeline must satisfy every run—especially during weekly refreshes and incremental updates later in the course.

Practice note for Select high-value user journeys (search, advising, prerequisites): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft the catalog data contract (fields, IDs, freshness rules): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose retrieval unit types (course, program, policy, FAQ): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set acceptance criteria and success metrics for answers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan privacy, licensing, and source-of-truth governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select high-value user journeys (search, advising, prerequisites): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft the catalog data contract (fields, IDs, freshness rules): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: RAG patterns for course catalogs

Section 1.1: RAG patterns for course catalogs

Start by selecting high-value user journeys, because they determine your RAG pattern. In education, three journeys tend to dominate: (1) search and discovery (“Find evening data analytics courses under $X”), (2) advising and planning (“What should I take next to finish the minor?”), and (3) prerequisites and eligibility (“Can I enroll if I have MATH 101 from a community college?”). Each journey expects different evidence and different tolerance for ambiguity.

For search/discovery, RAG often behaves like semantic search with filtering: retrieve courses using embeddings, then apply metadata filters (campus, modality, level, term) and re-rank by structured signals (credit hours, department). For advising, you need multi-document retrieval where answers cite a course page plus a program rule plus a policy snippet. For prerequisites, treat it as a high-precision compliance query: the assistant should be conservative, cite the prerequisite rule verbatim, and fall back to “I can’t confirm” when sources conflict.

Choose retrieval unit types explicitly. A common catalog set is: Course (title, description, credits, prereq text, learning outcomes), Program (degree requirements, electives, residency rules), Policy (grading, repeats, transfer credit, academic standing), and FAQ (registrar explanations that students actually understand). Don’t overload “course documents” with program rules; mixing them harms retrieval because embeddings learn broad topical similarity instead of answering the specific question.

Define acceptance criteria early. For example: “Every answer must cite at least one authoritative source; prerequisite answers must include the exact prerequisite statement; if the requested term is unknown, the assistant must explicitly state that availability is not guaranteed.” Tie these to measurable success metrics like recall@k for known queries, MRR for search ranking quality, and “citation coverage” (percentage of responses with citations to the correct retrieval units). This is how you reduce hallucinations: you constrain the system with retrieval patterns and measurable requirements instead of hoping prompting will compensate.

Section 1.2: Source inventory (web, SIS/LMS, PDFs, CMS)

Section 1.2: Source inventory (web, SIS/LMS, PDFs, CMS)

Before drafting a data contract, inventory your sources and decide which is the source of truth for each field. Course catalogs are often split across systems: a public website (marketing-friendly text), a SIS (official codes, credits, prereqs), an LMS (syllabi and schedules), a CMS (policy pages), and a long tail of PDFs (handbooks, archived catalogs, articulation agreements). RAG will faithfully reproduce whatever you index—so governance starts with knowing what you are ingesting.

Make a simple table for each source: system, data type, update cadence, export method (API, database view, scraping, SFTP), license/permissions, PII risk, and authoritativeness. For example, the SIS may be authoritative for credits and requisites, while the web CMS is authoritative for program narrative descriptions. PDFs are the most dangerous: they are often outdated, hard to parse, and duplicated across directories. If you include PDFs, record versioning signals (publication date, effective term) and plan to suppress older versions at retrieval time.

Practical workflow: start with web + SIS + CMS policy pages; add syllabi later only if you can control privacy and versioning. Syllabi can improve answers about workload and topics, but they also introduce instructor names, email addresses, office hours, and assessment details that may change frequently. If you cannot guarantee safe redaction and term scoping, keep syllabi out of the index and instead link to them as external references.

Common mistake: treating “the catalog” as a single dataset. In reality, you have competing truth claims. Resolve conflicts with explicit precedence rules (e.g., “If SIS credits differs from web credits, use SIS”) and embed that rule in your pipeline so it is consistently applied. This is a core part of planning licensing and source-of-truth governance: your RAG system must be able to explain where information came from and why that source was chosen.

Section 1.3: Canonical identifiers and dedup keys

Section 1.3: Canonical identifiers and dedup keys

A RAG index is only as stable as its identifiers. If the same course appears with three slightly different titles across systems, your embeddings will cluster them, but retrieval will produce inconsistent citations and your weekly refresh may create duplicates. Your data contract must define canonical identifiers and deterministic deduplication keys for each retrieval unit type.

For courses, define a canonical course_id that is stable across years and systems. Many institutions have a SIS internal ID plus a human-readable code (e.g., “CS-302”). Use both: store the SIS ID as the primary key when available, and store the display code as a searchable attribute. If your institution reuses course codes over time, include an “effective_start_term” and “effective_end_term” in the identity or at least in the versioning fields so old and new definitions do not collide.

For programs and policies, IDs are often missing. Create them deterministically using normalized paths: for example, program:{campus}:{slug} or policy:{department}:{url_path}. Your dedup key should be reproducible from raw inputs, not assigned manually during ingestion. That way, incremental updates can upsert correctly and safe re-indexing becomes possible.

Dedup strategy should combine structural keys (IDs, URLs, SIS keys) with content fingerprints (hash of normalized text) to detect when two sources represent the same entity. A practical approach is: (1) choose the best canonical record by precedence rules, (2) attach aliases for other identifiers (old codes, cross-listed codes), and (3) keep a source_map field listing all contributing sources with timestamps. This supports auditing, troubleshooting, and citations (“This prerequisite comes from SIS rule text, updated on…”).

Common mistake: using the URL as the only ID. URLs change during website redesigns; your index will fragment, and metrics will degrade silently. Treat URLs as attributes, not identity. Your contract should specify which IDs are immutable, which are versioned, and how to handle mergers (e.g., two departments cross-listing a course). This design choice pays off later when you implement weekly refresh with incremental updates.

Section 1.4: Metadata strategy (campus, modality, term, level)

Section 1.4: Metadata strategy (campus, modality, term, level)

Metadata is not optional in catalog RAG; it is how you prevent plausible but wrong answers. Your retrieval should not only be semantic—it should be constrained by filters that reflect real academic structure. The minimum metadata set typically includes: campus, modality (in-person, online, hybrid), term/effective period, and level (undergraduate, graduate, continuing education). Add department/school, credit range, and language if relevant.

Draft these fields as part of the catalog data contract with explicit allowed values. For example, campus should be an enum (“main”, “downtown”, “virtual”), not free text (“Main Campus”, “main campus”, “MAIN”). Modality should be derived consistently: if the SIS uses codes, map them to your canonical set. Term is tricky: distinguish between effective_term (the rules that govern the course description and prerequisites) and offered_terms (when sections are typically scheduled). Many catalogs conflate them; your pipeline should not.

Use metadata to support the user journeys from Section 1.1. For search, metadata enables faceting and fast narrowing. For advising, it allows you to exclude irrelevant campuses or levels. For prerequisites, it prevents mixing rules across terms (“In 2022, the prerequisite was…”) and reduces hallucinations by ensuring retrieved chunks come from the correct effective period.

Common mistake: stuffing everything into the embedding text and hoping similarity search will “figure it out.” Embeddings are good at topical similarity but weak at strict constraints like “graduate-only” or “available this fall.” Put constraints into metadata fields and enforce them at query time (vector search + filters). Also define metadata for retrieval unit type so your index can retrieve a policy chunk when the question is about grading, not the course description.

Practical outcome: by the end of this section you should have a metadata dictionary in your contract: field name, type, allowed values, derivation rule, and whether it is filterable in the vector database. This becomes the bridge between raw catalog sources and reliable retrieval.

Section 1.5: Freshness and SLA definition

Section 1.5: Freshness and SLA definition

Freshness is where catalog RAG becomes operational. Students will ask questions that depend on the current term, newly approved curriculum changes, or updated policies. Your pipeline needs explicit freshness rules and a service-level agreement (SLA) so everyone knows when the assistant is allowed to answer confidently.

Start by defining “freshness” per retrieval unit type. Course descriptions might change once or twice a year; section schedules may change daily during registration; policies can change mid-year; FAQs might change whenever staff update the website. If your assistant is aimed at catalog rules (not real-time scheduling), you may intentionally exclude schedule data and set the SLA accordingly (“rules updated weekly; schedules not included”). Be clear in the contract about scope: it is better to be reliably incomplete than inconsistently current.

In the data contract, include fields like source_updated_at, ingested_at, effective_start_term, and expires_at. Then define thresholds: e.g., “Policies must be ingested within 7 days of source update,” “Course pages within 14 days,” “If source_updated_at is unknown for a PDF, treat it as low-trust and require an effective term in the text to be retrievable.”

Acceptance criteria should include freshness behaviors. For example: “If the retrieved sources are older than the SLA, the answer must warn that information may be outdated and provide a link to the authoritative source.” This is a guardrail that reduces hallucinations by making staleness visible instead of silently generating a confident response.

Common mistake: a single global refresh schedule (“we reindex weekly”) without unit-level SLAs. That hides risk: a daily-changing policy page and a yearly course description do not deserve the same monitoring. Plan now for weekly refresh with incremental updates and safe re-indexing later: you will need stable IDs (Section 1.3) plus a freshness policy (this section) to decide what to upsert, what to delete, and what to keep as historical versions.

Section 1.6: Threat model (PII, policy errors, outdated info)

Section 1.6: Threat model (PII, policy errors, outdated info)

A catalog RAG system is an information system, so treat it like one: define a threat model. The main threats are not adversarial hackers; they are routine institutional risks that produce harmful answers. Three high-impact threats are PII leakage, policy errors, and outdated information.

PII leakage often enters through syllabi, LMS exports, advising notes, or “helpful” PDFs that include names, emails, student examples, or accommodation details. Your governance rule should be simple: if a source can contain student data or instructor contact details not intended for public distribution, do not index it unless you have a redaction pipeline and clear permission. Enforce this with allowlists (approved domains, repositories) and automated scanning for common PII patterns. In the contract, include a pii_risk classification per source and block high-risk inputs by default.

Policy errors are more subtle. If two policy pages conflict (e.g., repeat policy updated, old PDF still online), naive retrieval may surface both and the model may synthesize an incorrect hybrid. Mitigations: (1) strict source precedence and deprecation rules, (2) effective-term metadata and filtering, and (3) answer acceptance criteria requiring direct quotation for high-stakes rules (prerequisites, graduation requirements, financial obligations). When in doubt, the assistant should refuse to generalize and instead direct the user to an official office.

Outdated info is inevitable unless you plan for it. Your freshness SLAs (Section 1.5) should drive runtime guardrails: if retrieved documents are stale, the assistant must label the answer as potentially outdated. Also consider “outdated by design” risks: archived catalogs are valuable for alumni, but harmful for current students if mixed into the same index without term filters. Separate indexes or strict term-based filtering prevents accidental retrieval of old rules.

Finally, licensing and governance matter even for “public” catalogs. Some institutions restrict reuse or have accessibility obligations. Document who owns each source, what can be stored, and how citations should be displayed. A practical outcome of this threat model is a short, enforceable checklist embedded into ingestion: approved sources only, PII scan pass, effective term present (or low-trust flag), and de-duplication complete. This is how you keep the assistant trustworthy as you scale the pipeline.

Chapter milestones
  • Select high-value user journeys (search, advising, prerequisites)
  • Draft the catalog data contract (fields, IDs, freshness rules)
  • Choose retrieval unit types (course, program, policy, FAQ)
  • Set acceptance criteria and success metrics for answers
  • Plan privacy, licensing, and source-of-truth governance
Chapter quiz

1. Why does the chapter argue that questions like “Can I take CS 302 next term?” are not just a simple search problem?

Show answer
Correct answer: Because they require combining multiple catalog constraints like prerequisites, term availability, campus rules, and policies
The chapter emphasizes these queries mix prerequisites, availability, campus rules, transfer credit policies, and exceptions, so the pipeline must support more than keyword search.

2. What is the main purpose of defining a strict catalog data contract in a RAG pipeline?

Show answer
Correct answer: To keep documents stable, fresh, and governable through clear fields, identifiers, and freshness rules
A strict contract specifies fields, IDs, and freshness rules so ingestion/indexing can run repeatably and avoid drift and ambiguity.

3. How does the chapter suggest you should define the system’s 'retrieval unit'?

Show answer
Correct answer: Choose unit types that match user journeys, such as course, program rule, policy, or FAQ
The chapter calls for selecting retrieval unit types (course/program/policy/FAQ) aligned to what users ask so retrieval and chunking are purposeful.

4. What is the benefit of setting acceptance criteria and success metrics early in the project?

Show answer
Correct answer: It allows the team to evaluate changes objectively using defined measures like recall, MRR, and citation coverage tied to user journeys
The chapter stresses defining “good” up front so pipeline changes can be assessed without subjective arguments, using metrics tied to journeys.

5. According to the chapter, why do many catalog RAG efforts fail even when embeddings are not the problem?

Show answer
Correct answer: Upstream data contracts are fuzzy, causing issues like duplicate courses, ambiguous IDs, mismatched campuses, or out-of-date PDFs
The chapter attributes failure to operational issues upstream—unclear IDs, duplicates, mismatches, and freshness drift—rather than embedding quality.

Chapter 2: Ingest and Normalize Catalog Sources

Your RAG system is only as reliable as the documents you feed it. Course catalogs look simple—pages with titles, credits, prerequisites—but in practice they come from a messy mix of HTML pages, PDF handbooks, and structured exports from SIS/CMS tools. This chapter focuses on the first half of the pipeline: building connectors, capturing raw artifacts, and normalizing everything into a canonical document model that downstream chunking and indexing can trust.

The engineering goal is repeatability. You should be able to run ingestion weekly, detect what changed, and re-index safely without losing traceability. That means designing connectors that handle redirects and pagination, dealing with “multi-term” versions (e.g., 2024–2025 vs 2025–2026), and storing both raw and processed artifacts. It also means keeping lineage logs that answer: Where did this text come from? When did we fetch it? Which extractor and cleaning rules were applied?

In course catalogs, common mistakes start early: scraping only rendered text without URLs and timestamps; stripping too aggressively and losing tables that contain requirements; ignoring PDF layout and gluing together unrelated columns; and failing to preserve term/version metadata so retrieval returns outdated program rules. The practical outcome of this chapter is a robust ingest-and-normalize layer that produces consistent, well-labeled JSON documents, ready for chunking and embedding in later chapters.

  • Deliverable: connectors for HTML, PDF, and structured exports with deterministic output.
  • Deliverable: a canonical JSON document format capturing content, metadata, and provenance.
  • Deliverable: validation + quarantine workflow to keep bad documents out of your index.
  • Deliverable: lineage logs enabling auditability and safe incremental refresh.

Throughout the chapter, treat ingestion as a data product: raw artifacts are immutable, processed artifacts are reproducible, and every record has a stable identity. That mindset sets you up for incremental updates and evaluation later, because you can compare retrieval performance across catalog terms and extraction versions without guessing what changed.

Practice note for Build connectors for HTML, PDFs, and structured exports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Normalize text and structure into a canonical document format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Handle redirects, pagination, and multi-term versions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate and store raw vs processed artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create lineage logs for traceability: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build connectors for HTML, PDFs, and structured exports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Normalize text and structure into a canonical document format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Crawling vs API extraction tradeoffs

Most catalogs offer at least two extraction paths: crawl the public website (HTML) or pull structured data via an export/API (SIS report, CMS feed, or vendor endpoint). Prefer APIs when they are stable, authenticated appropriately, and contain the fields you need (course code, title, credits, prerequisites, term, campus). APIs reduce boilerplate noise and usually provide explicit identifiers—critical for incremental refresh.

Crawling is often unavoidable because program rules, narrative descriptions, and policy pages may exist only as web content or PDFs. Crawlers must handle redirects (HTTP 301/302), canonical URLs, pagination (A–Z indexes, “load more” lists), and robots constraints. A practical pattern is a two-stage crawl: first discover URLs from index pages and sitemaps, then fetch and archive each page with response headers and a content hash.

Engineering judgment: combine sources rather than choosing one. If an API provides course records but not the narrative around prerequisites exceptions or residency requirements, ingest both and reconcile them in your canonical model using stable keys (e.g., institution_id + course_code + term_id). Track the source priority so downstream retrieval can prefer the authoritative field (API credits) while still indexing helpful explanations (HTML).

  • Connector contracts: each connector outputs raw artifact bytes, normalized text, extracted metadata, and a fetch log entry.
  • Incremental refresh: use ETag/Last-Modified when available; otherwise store a SHA-256 hash of raw content and skip unchanged items.
  • Multi-term versions: include catalog_year or effective_term at discovery time, not after parsing.

A common mistake is discovering URLs without recording the discovery context. If a course appears under multiple pages (department listing, search results, archived version), you need to know which path produced it so you can de-duplicate and keep the correct version. Log both the discovered URL and the referrer/index page that contained it; this becomes part of your lineage story.

Section 2.2: HTML parsing and boilerplate removal

HTML catalog pages mix valuable text with navigation menus, footers, cookie banners, and “related links.” If you embed all of it, retrieval quality drops because the vector index learns irrelevant patterns (“Apply Now”, “Contact Us”) and pushes down the real course details. The goal is to extract the main content consistently while preserving structure like headings, lists, and tables that carry meaning.

Start with DOM parsing (e.g., Cheerio/BeautifulSoup) rather than regex. Identify the main content container using site-specific selectors when possible (main, article, vendor-specific classes). When site structure varies, add heuristics: choose the node with the highest text density, exclude nodes with repeated link lists, and remove elements by role/class (nav, footer, aside). Preserve heading levels (h1-h4) because they help later chunking maintain semantic boundaries.

Boilerplate removal is where over-cleaning can hurt. Course pages often include tables for credit breakdowns or prerequisite logic. Convert tables into readable text (row-wise “Key: Value” lines) rather than dropping them. Keep link targets when they are meaningful (e.g., “MATH 101” links) by converting anchors into “text (URL)” or capturing resolved identifiers in metadata.

  • Redirect handling: resolve to a canonical URL and store both requested_url and final_url.
  • Pagination: detect “next” links and page parameters; store page number and total pages in fetch metadata.
  • De-duplication: compute normalized-text hashes after boilerplate removal to avoid indexing identical pages served under different URLs.

Common mistakes include stripping all lists (losing learning outcomes), flattening headings (making chunks incoherent), and not capturing the page title reliably (some catalogs render it via JavaScript; you may need server-side rendering or a fallback to OpenGraph metadata). Treat HTML extraction as a versioned component: when selectors change, bump an extractor version and record it in lineage logs so you can explain differences in downstream retrieval.

Section 2.3: PDF extraction and layout pitfalls

PDF catalogs are frequently the “source of truth” for policy and program rules, but they are the most error-prone to extract. Unlike HTML, PDFs encode layout, not reading order. Two-column pages, headers/footers, hyphenated line breaks, and scanned images can produce text that looks plausible but is semantically scrambled—exactly the kind of input that causes RAG hallucinations because the model tries to reconcile contradictory fragments.

Use a layered approach. First, attempt text-based extraction (pdfminer, PyMuPDF, or similar) and capture per-page text with coordinates when possible. Then apply layout-aware reconstruction: detect columns by clustering x-coordinates, remove repeating headers/footers by matching near-identical lines across pages, and join hyphenated words only when the hyphen occurs at line end and the next line begins with a lowercase continuation.

For scanned PDFs, integrate OCR (Tesseract or managed OCR) but do it selectively: OCR only pages with low text yield to control cost and error rates. Store confidence scores and mark OCR-derived text so you can quarantine or down-rank it later if it proves noisy.

  • Program rules: preserve section headings and numbering (e.g., “3.2 Residency Requirements”) to maintain citeable structure.
  • Tables: if table extraction is unreliable, capture the table as both text and an image snippet reference in raw artifacts for audit.
  • Page lineage: store page_number, bbox (if available), and extraction method (text vs ocr).

A practical check is to compute “reading coherence” metrics: average line length, percentage of repeated header lines, and frequency of isolated single words. Spikes often indicate column mixing or footer pollution. Instead of forcing every PDF through the same path, route problematic files to a quarantine or “manual rules required” bucket, and continue indexing the clean majority.

Section 2.4: Canonical JSON document model

Normalization means every source—HTML, PDF, structured export—becomes the same kind of object downstream. A canonical JSON model is the contract between ingestion and chunking/indexing. Design it to support retrieval filtering (term, campus, credential), provenance (citations), and safe reprocessing (raw artifact references). Avoid embedding-specific assumptions here; focus on representing content and metadata cleanly.

A practical document model for catalogs typically includes: a stable doc_id, doc_type (course, program, policy, syllabus), source (html/pdf/api), and effective metadata (catalog year, term range). Add title, body_text, and structure elements (headings, lists, tables) either as a simplified markdown-like array or as an ordered block list. Store urls (requested/final/canonical), plus raw_artifact_uri pointing to immutable storage (S3/GCS/blob).

For RAG, citations matter. Include spans or content_blocks with offsets and source anchors: HTML CSS selectors/XPaths, or PDF page numbers. This enables later “answer with citations” behavior and makes debugging straightforward when a retrieved chunk seems wrong.

  • Identity: doc_id should be deterministic (e.g., hash of institution + canonical_url + effective_term + doc_type).
  • Versioning: include extractor_version and normalized_at timestamps.
  • Multi-term: model effective_from/effective_to or catalog_year explicitly; do not bury it in free text.

Common mistakes are treating each web page as a unique “document” without consolidating variants (same course page across terms) or losing key fields by flattening everything into a single text blob. Your canonical model should be rich enough to support filters later (e.g., retrieve only undergraduate policies for 2025–2026) while still being simple enough that every connector can populate it.

Section 2.5: Language normalization and encoding

Once you have extracted text, normalize it so your embeddings reflect meaning rather than encoding quirks. Catalogs often contain smart quotes, non-breaking spaces, weird bullet characters, and department abbreviations that vary by source. Normalize too little and you get duplicate near-identical chunks; normalize too aggressively and you erase distinctions (e.g., “C++” becomes “C”). The goal is consistent, reversible-enough text cleaning.

Start with Unicode normalization (typically NFKC) to standardize look-alike characters, then normalize whitespace: convert non-breaking spaces to spaces, collapse repeated spaces, and preserve intentional newlines around headings and lists. Standardize bullets and numbering into a consistent representation. Keep case as-is for course codes (e.g., “CS 101”) and preserve punctuation that changes meaning. When you remove artifacts like page headers (“2024–2025 Catalog”), do it via anchored patterns and record the rule name in your processing log.

Language handling matters for multilingual institutions. Detect language at the document or block level and store language metadata. If you plan to embed with multilingual models, keep the original text; if you translate, store both original and translated fields and mark which one is indexed. Never silently translate without lineage—retrieval and citations become hard to trust.

  • Encoding pitfalls: mis-decoded PDFs can turn “•” into “•”; treat this as a signal to re-extract with a different library or encoding hint.
  • Abbreviation expansions: consider a controlled mapping (“hrs” → “hours”) only if consistent and scoped; keep the original in metadata for audit.
  • PII/notes: remove staff emails/phone numbers if not needed for your product; log the removal policy.

A practical outcome is fewer duplicates and cleaner chunks later. You should be able to run normalization twice and get the same output (idempotency). If normalization is not idempotent, it will be hard to trust incremental refresh because documents will appear to “change” every run.

Section 2.6: Data validation and quarantine flows

Validation is the gate that protects your vector index from malformed or misleading documents. Treat ingestion as producing two artifact tiers: raw (immutable bytes + fetch metadata) and processed (canonical JSON + normalized text). Validation runs on processed artifacts and decides: accept, reject, or quarantine for review/retry. This is where you prevent downstream hallucinations caused by partial pages, mixed columns, or missing term metadata.

Define schema validation (required fields, types, allowed enums) and content validation (minimum text length, presence of key fields like title, detection of “Access Denied” or cookie-wall pages). Add structural checks by doc type: course docs should contain a course code pattern; program rules should include headings or section numbering; syllabus docs should include instructor/date only if you intend to keep those fields. For HTML, validate that the final URL matches the expected domain and that redirects did not land on a generic landing page.

Quarantine is not failure; it’s controlled uncertainty. Store quarantined items with an error code, sample text snippet, and pointers to raw artifacts so you can debug quickly. Many issues are transient (timeouts, anti-bot interstitials) and should be retried with backoff. Others require extractor updates (new CSS selectors, changed PDF layout). Your lineage logs should record every decision: fetched → parsed → normalized → validated → accepted/quarantined, with timestamps and versions.

  • Raw vs processed storage: keep raw for audit and reprocessing; keep processed for indexing and evaluation.
  • Lineage logs: include connector name, extractor version, normalization ruleset version, and content hashes.
  • Safe re-indexing: only accepted documents are eligible for embedding; quarantine prevents “bad updates” from replacing good historical data.

Common mistakes include validating only the schema (ignoring “empty but valid” documents), overwriting last-known-good processed artifacts, and failing to capture enough context to reproduce an error. When you implement weekly refresh later, this quarantine and lineage foundation is what lets you update incrementally with confidence and explain exactly why a retrieved answer cites a particular term’s catalog page.

Chapter milestones
  • Build connectors for HTML, PDFs, and structured exports
  • Normalize text and structure into a canonical document format
  • Handle redirects, pagination, and multi-term versions
  • Validate and store raw vs processed artifacts
  • Create lineage logs for traceability
Chapter quiz

1. Why does Chapter 2 emphasize producing a canonical document model before chunking and indexing?

Show answer
Correct answer: So downstream chunking and indexing can rely on consistent, repeatable structure and metadata
The chapter’s goal is trustworthy downstream processing, which depends on normalizing diverse sources into a consistent canonical format.

2. Which design choice best supports safe weekly re-ingestion with traceability?

Show answer
Correct answer: Store raw artifacts and processed artifacts separately, with lineage logs describing fetch time and applied extractors/rules
Repeatability and auditability require immutable raw captures, reproducible processed outputs, and lineage logs capturing provenance and transformations.

3. A team scrapes only rendered text from catalog pages and discards URLs and timestamps. What risk does Chapter 2 highlight?

Show answer
Correct answer: You lose the ability to trace where text came from and when it was fetched, making audits and safe refresh difficult
Discarding URLs/timestamps breaks provenance and makes it hard to detect changes and maintain traceability.

4. What is the main reason to handle multi-term versions (e.g., 2024–2025 vs 2025–2026) during ingestion and normalization?

Show answer
Correct answer: To ensure retrieval can filter/return the correct term’s rules instead of outdated program requirements
Preserving term/version metadata prevents the system from serving outdated catalog rules during retrieval.

5. Which workflow aligns with the chapter’s goal of keeping bad documents out of the index?

Show answer
Correct answer: Run validation checks and quarantine failures before they reach chunking/indexing
The chapter calls for validation plus a quarantine process to prevent malformed or low-quality documents from contaminating the index.

Chapter 3: Clean, Deduplicate, and Enrich the Catalog

A Retrieval-Augmented Generation (RAG) system is only as reliable as the catalog data it retrieves. Course catalogs are deceptively messy: PDFs converted to text introduce hyphenation artifacts, bullet lists collapse into run-on sentences, and the same course appears across terms with tiny edits. If you embed and index this data “as-is,” you will amplify noise, fragment recall, and increase hallucination risk because the model sees conflicting snippets as equally plausible.

This chapter builds the middle of your pipeline: deterministic cleaning, normalization, deduplication, and metadata enrichment. The goal is not to create a perfect catalog; it’s to create a repeatable workflow that produces stable text for embeddings and structured metadata for filtering and ranking. You will also learn to generate audit reports that surface missing fields and contradictions early, before they degrade retrieval quality.

Engineering judgment matters here. Over-cleaning can remove meaning (e.g., stripping punctuation that distinguishes “C++” from “C”), while under-cleaning can leave spurious tokens that dominate embeddings. Similarly, deduplication is not “delete anything similar”; it is “group representations of the same real-world course offering and keep a canonical record plus provenance.” The outcome should be a catalog schema that is retrieval-friendly: clean text for chunks, plus metadata fields such as normalized course code, credit range, prerequisites, term, campus, and program relationships that enable precise filters in your vector index.

  • Build deterministic text cleaning rules and unit tests to prevent regressions.
  • Normalize course identifiers and credit formats to stabilize joins and filters.
  • Deduplicate near-identical offerings across terms without losing provenance.
  • Enrich entries with parsed entities (credits, prereqs, outcomes) for better retrieval.
  • Resolve cross-links into an explicit relationship graph.
  • Produce reproducible audits that quantify quality and guide weekly refreshes.

By the end of this chapter, you should have a pipeline stage that turns raw catalog pages into canonical, linked, and auditable course records ready for chunking and indexing.

Practice note for Create deterministic cleaning rules and unit tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deduplicate near-identical course entries across terms: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Extract entities (credits, prerequisites, outcomes) into metadata: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Resolve cross-links (course codes, program requirements): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Produce audit reports for missing and conflicting fields: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create deterministic cleaning rules and unit tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deduplicate near-identical course entries across terms: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Extract entities (credits, prerequisites, outcomes) into metadata: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Text cleaning rules (whitespace, bullets, artifacts)

Section 3.1: Text cleaning rules (whitespace, bullets, artifacts)

Start with deterministic cleaning rules that transform raw catalog text into stable, embedding-friendly content. Deterministic means: the same input always produces the same output, regardless of runtime order or environment. This matters because embeddings and dedup hashes change when whitespace or punctuation changes, which can cause unnecessary re-indexing and churn.

Typical artifacts include: repeated whitespace, page headers/footers, broken hyphenation from PDF extraction (e.g., “pre- requisite”), bullet characters (•, ◦) that become random Unicode, and “smart quotes” that vary by source. A practical approach is to define a “clean_text(text) -> text” function with a small sequence of rules, each justified and unit-tested.

  • Whitespace normalization: collapse runs of spaces, normalize newlines, trim leading/trailing space.
  • Bullet normalization: map common bullet glyphs to a standard “- ” prefix; preserve list boundaries with newlines.
  • Artifact removal: drop known boilerplate lines (e.g., “Catalog Year 2025–2026”), but only via exact patterns or anchored regex to avoid deleting meaningful content.
  • Hyphenation repair: join line-broken words cautiously (only when a hyphen occurs at line end and the next line starts with a letter).
  • Unicode normalization: normalize to NFC (or NFKC if you have strong reasons) to stabilize comparisons.

Common mistakes: stripping all punctuation (breaks course codes like “MATH-101”), deleting all short lines (often removes prerequisites), and removing parentheses (often contain credit ranges or co-requisite notes). Keep a “lossy vs lossless” mindset: cleaning should remove noise but preserve semantics. Finally, write unit tests using real samples: a PDF-extracted course description, an HTML page, and a copied syllabus snippet. Your tests should assert exact expected outputs, so a future change (like a new regex) cannot silently change the downstream embeddings.

Section 3.2: Normalizing course codes, titles, and credit formats

Section 3.2: Normalizing course codes, titles, and credit formats

Normalization is about creating a canonical representation for fields that drive joins, filters, and grouping. In course catalogs, the biggest offenders are course codes (spacing and punctuation), titles (case and subtitles), and credits (integers, decimals, ranges, or “variable”). If these fields are inconsistent, you cannot reliably deduplicate across terms or resolve cross-links like prerequisites.

For course codes, define a strict parser that produces both a display form and a normalized key. For example, map “CS 101”, “CS-101”, and “cs101” to subject=CS, number=101, and normalized key CS101. Keep suffixes (e.g., “101L”, “101A”) and separators used in your institution (e.g., “CSCI 1100”). Store the original raw string for audit and explainability.

For titles, avoid aggressive rewriting. Normalize whitespace, trim trailing punctuation, and optionally store a lowercase “title_key” for matching, but keep the human-readable title intact for generation and citations. If the catalog sometimes includes subtitles like “Introduction to Biology: Cells and Systems,” consider splitting into title and subtitle only if you can do it reliably; otherwise, keep a single field to avoid brittle heuristics.

Credits need a structured schema. A practical model is: credits_min, credits_max, and credits_text. Parse “3 credits” as (3,3), “3–4 credits” as (3,4), and “variable” as (null,null) with credits_text="Variable". This enables filter queries like “courses with at least 4 credits” without guessing from raw text. The key engineering judgment: never discard the original credit string, because edge cases (labs, contact hours) may need manual review later.

Section 3.3: Duplicate detection (hashing and similarity)

Section 3.3: Duplicate detection (hashing and similarity)

Course catalogs often repeat the same course across terms, campuses, or delivery modes, with minor edits (“updated textbook,” “same outcomes, new wording”). If you index all variants separately, retrieval can surface multiple near-identical chunks and waste context window budget. Deduplication should reduce redundancy while preserving provenance: you want one canonical course entity with linked offerings and source URLs.

Use a two-stage approach: fast candidate generation with hashing, then confirmation with similarity. First, build a stable “dedup key” from normalized fields (e.g., CS101 + normalized title + credits range). Then compute a content hash on cleaned description text (e.g., SHA-256). Exact hash matches are safe duplicates. For near-duplicates, compute similarity on a normalized representation (e.g., TF-IDF cosine, MinHash/LSH for shingles, or a lightweight embedding similarity if you already compute embeddings). Set thresholds conservatively and review borderline cases.

  • Exact dedup: same normalized code and exact description hash → merge automatically.
  • Near dedup: same normalized code, similarity above threshold (e.g., 0.92) → merge, but keep both texts and choose a canonical.
  • Collision guard: different codes but very high similarity → do not merge automatically; flag for audit (cross-listed courses or data errors).

Canonical selection should be deterministic: prefer the newest catalog year, the most complete record (fewest missing fields), or a trusted source system over scraped HTML. Store a merged_from list with term/campus/source identifiers so weekly refreshes can update offerings without changing the canonical entity unnecessarily. Common mistakes include deduplicating purely on title (“Special Topics” repeats with different content) and ignoring suffixes (“BIO101” vs “BIO101L”). Your dedup logic should be unit-tested with known pairs: identical, near-identical, and deceptively similar but distinct.

Section 3.4: Metadata enrichment via parsing and heuristics

Section 3.4: Metadata enrichment via parsing and heuristics

Metadata is what makes RAG retrieval precise. Clean text helps semantic matching, but structured fields enable filtering (“only undergraduate,” “offered online,” “requires CHEM101”). Enrichment is the step where you extract entities—credits, prerequisites, learning outcomes—from the cleaned text and store them in a schema designed for retrieval and evaluation.

Start with rule-based parsing because it is explainable and stable. For prerequisites and co-requisites, build a small grammar that recognizes patterns like “Prerequisite(s):”, “Pre-req:”, “Must have completed”, and lists of course codes separated by commas, “and/or.” Don’t try to perfectly model logical expressions on day one; instead, capture two layers: (1) a raw prerequisite string, and (2) a parsed list of referenced course codes. Later, you can improve logic handling (AND/OR groups) as a separate iteration.

  • Credits: parse into min/max as in Section 3.2; store contact hours if present.
  • Prereqs/coreqs: extract course references; store both text and normalized codes.
  • Restrictions: “for majors only,” “junior standing,” “permission of instructor.” Store as tags.
  • Outcomes/topics: detect outcome sections (“Students will be able to…”) and store as a list for retrieval boosts.

Heuristics should be scoped and measurable. For example, detect delivery mode from phrases (“online,” “hybrid”) only if your false positive rate is acceptable; otherwise, treat it as unknown. A common mistake is to over-interpret prose and create “confident” metadata that is wrong, which leads to filtered searches excluding the correct course. Prefer conservative enrichment with explicit “unknown” states, and produce audit flags when parsing fails (e.g., “Prerequisite mentioned but no codes parsed”). This sets you up for better recall and fewer hallucinations because the retriever can constrain results to relevant subsets without relying solely on embeddings.

Section 3.5: Relationship graph (prereq/coreq, program-to-course)

Section 3.5: Relationship graph (prereq/coreq, program-to-course)

Catalog pages are full of cross-links: course pages reference other courses, and program pages reference required courses, electives, and rules (“Choose two from…”). For RAG, resolving these cross-links into an explicit relationship graph provides two benefits: it improves retrieval (you can pull connected context) and it reduces hallucinations (you can validate generated answers against known relationships).

Model relationships as edges between canonical entities. At minimum, you need: COURSE --prereq--> COURSE, COURSE --coreq--> COURSE, PROGRAM --requires--> COURSE, and PROGRAM --elective_group--> COURSE. Each edge should have provenance (source URL, catalog year) and confidence (parsed vs manually curated). When parsing program requirements, treat rule text as first-class data: store the raw rule paragraph, then store extracted course references and group labels (e.g., “Math foundation electives”).

  • Cross-listing: represent as equivalence links (CS205 ≡ IT205) rather than deduping them away.
  • Unresolved references: if a prereq code does not exist in your catalog snapshot, create a placeholder node and flag it for audit.
  • Versioning: relationships change by year; store effective_year so the retriever can filter by a student’s catalog year.

Practical retrieval pattern: if the user asks “Can I take CS201 next term?” retrieve CS201 plus its prereq nodes and the student’s program rules node (if available). This creates grounded answers with citations. Common mistakes: flattening program rules into one long chunk with no structure (hard to filter and cite), and ignoring year/version, which causes the system to mix requirements from different catalogs. A small, explicit graph keeps your pipeline honest and enables downstream guardrails like “only cite prerequisites that exist as edges in the graph.”

Section 3.6: Quality checks and reproducible audits

Section 3.6: Quality checks and reproducible audits

Cleaning and enrichment are only trustworthy if you can measure their outputs over time. Build quality checks that run every pipeline execution and produce audit artifacts you can diff week to week. This is where you turn ad hoc data wrangling into a maintainable system that supports incremental updates and safe re-indexing.

Define a set of invariants and completeness checks tied to your schema. Examples: every course must have a normalized code, a title, and at least one source URL; credits must either parse into min/max or be explicitly labeled variable; prerequisite text that contains a course-code-like token must produce at least one parsed reference; and every relationship edge must point to a known node or a placeholder with an “unresolved” flag.

  • Missing fields report: counts and lists by field (e.g., missing credits, missing description).
  • Conflicts report: same normalized code with different credit ranges or substantially different titles.
  • Dedup report: clusters merged, canonical chosen, and similarity scores for near-duplicates.
  • Link resolution report: unresolved prereq/program references and newly resolved ones.

Make audits reproducible by storing run metadata (timestamp, git commit, input snapshot IDs) and emitting machine-readable outputs (CSV/JSON) plus a human-readable summary. Unit tests should cover deterministic cleaning and parsing rules, while integration tests can validate end-to-end expectations on a small fixture catalog. Common mistakes include only checking averages (which hides outliers) and not keeping historical audits (which makes regressions hard to detect). When you later evaluate retrieval quality (recall, MRR), these audits help you connect a retrieval regression back to a concrete upstream issue—like a parsing change that stopped extracting prerequisites—so you can fix the pipeline rather than “tuning prompts” to compensate.

Chapter milestones
  • Create deterministic cleaning rules and unit tests
  • Deduplicate near-identical course entries across terms
  • Extract entities (credits, prerequisites, outcomes) into metadata
  • Resolve cross-links (course codes, program requirements)
  • Produce audit reports for missing and conflicting fields
Chapter quiz

1. Why is embedding and indexing course catalog text “as-is” risky for a RAG system?

Show answer
Correct answer: It amplifies noise and conflicting snippets, fragmenting recall and increasing hallucination risk
Messy conversions and inconsistent entries create noisy, conflicting chunks that the model may treat as equally plausible.

2. What is the primary goal of deterministic cleaning, normalization, deduplication, and enrichment in this chapter?

Show answer
Correct answer: To build a repeatable workflow that yields stable embedding text and structured metadata for filtering/ranking
The focus is repeatability and retrieval-friendly outputs: stable text plus metadata for better filtering and ranking.

3. Which approach best matches the chapter’s definition of deduplication for course offerings across terms?

Show answer
Correct answer: Group representations of the same real-world course and keep a canonical record plus provenance
Deduplication is about grouping the same course across terms while preserving provenance, not blindly deleting similar text.

4. What is a key tradeoff highlighted in the chapter regarding cleaning rules?

Show answer
Correct answer: Over-cleaning can remove meaning (e.g., stripping punctuation that distinguishes “C++” from “C”), while under-cleaning can leave spurious tokens that dominate embeddings
Both over- and under-cleaning can harm retrieval quality, so cleaning requires engineering judgment.

5. How do audit reports contribute to retrieval quality in the pipeline described?

Show answer
Correct answer: They surface missing fields and contradictions early so issues can be fixed before degrading retrieval
Audits quantify missing/conflicting fields early, helping prevent downstream retrieval degradation during refreshes.

Chapter 4: Chunk Strategy for High-Recall Retrieval

Chunking is where a RAG pipeline becomes “course-catalog aware.” Cleaning and normalization make text consistent, but chunking determines what your retriever can actually find. In education catalogs, users rarely ask for an entire page; they ask for one prerequisite, one credit rule, one elective option, one deadline, or one exception clause. Your chunk strategy should therefore be built around how people ask questions—and how policy language is enforced—so retrieval returns the smallest, most self-contained evidence that still preserves meaning.

This chapter treats chunking as an engineering design choice with measurable outcomes. You will align chunk boundaries to user intent, implement section-aware chunking for structured pages, add overlap and context windows without exploding your index, attach metadata filters to reduce false matches, and run experiments to choose a default policy. The goal is high recall (you can retrieve the needed evidence) without sacrificing precision (you retrieve mostly relevant evidence) and without creating brittle, hard-to-maintain rules.

A practical way to think about chunking is: every chunk should be “answer-ready.” If retrieved alone, it should typically contain enough context to support a grounded answer, or at least clearly indicate what it refers to. When it cannot, your pipeline should have a planned mechanism for context stitching (for example, pulling the parent header or adjacent chunks). Done well, chunking reduces hallucinations because the model sees the exact governing text, not a loose paraphrase.

  • Outcome: higher recall for targeted questions (prereqs, credits, grading, residency, transfer, deadlines).
  • Outcome: fewer irrelevant matches because chunks carry section structure and metadata filters.
  • Outcome: stable, repeatable chunking that supports weekly refreshes and safe re-indexing later.

The sections that follow give you concrete templates and the judgement calls you must make. You will see common failure modes (too big, too small, too repetitive), a hierarchical approach that respects headings, and a validation routine based on “golden questions” rather than intuition.

Practice note for Choose chunk boundaries aligned to how users ask questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement section-aware chunking for structured pages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add overlap and context windows without bloating the index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Attach metadata filters to chunks for precise retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run chunking experiments and pick a default policy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose chunk boundaries aligned to how users ask questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement section-aware chunking for structured pages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Chunking objectives and failure modes

Start by defining what “good chunking” means for your catalog RAG. In most course-catalog assistants, your primary retrieval objective is high recall under narrow, specific queries: “Does CS 201 require CS 101?” “How many credits are needed for the certificate?” “Can I retake a course for a higher grade?” If your chunk boundaries hide the exact sentence that answers these, the model will guess. If your chunk boundaries are too broad, you will retrieve a chunk that contains the answer but is buried among unrelated details, making generation less reliable.

Common failure modes appear in predictable patterns. Overly large chunks (e.g., whole pages or 2,000+ token blocks) dilute similarity signals and often retrieve for the wrong reasons (a shared word like “credits”). Overly small chunks (e.g., single sentences) can lose referents like “this program” or “students,” and can break policy clauses that depend on exceptions and definitions. Another failure mode is boundary mismatch with user questions: users ask about prerequisites and grading, but your chunks follow arbitrary character limits, splitting the prereq list from the course description or splitting a policy rule from its exception.

  • Recall failure: the retriever cannot find the one paragraph that governs the rule.
  • Precision failure: many chunks match superficial keywords (e.g., “transfer”) but are about different programs.
  • Context failure: the chunk contains the rule but not the scope (“applies to undergraduate,” “effective Fall 2025”).

Define two measurable targets before implementing: (1) retrieval recall@k for known-answer questions, and (2) median chunk size and index growth (cost). Your chunk strategy is “done” when recall stops improving meaningfully compared to the added cost and complexity. This framing keeps you from over-optimizing a clever chunker that is hard to maintain during weekly refreshes.

Section 4.2: Document-type templates (course vs policy vs program)

Course catalogs are not one document type; they are a bundle of genres. If you apply one universal chunk rule, you will underperform on at least one genre. Build templates by document type so chunk boundaries align with how users ask questions and how the content is structured.

Course pages are usually short and queryable by fields: title, description, credits, prerequisites, corequisites, restrictions, outcomes, offering terms. A strong default is “section-aware course chunks”: one chunk for the core description block plus one chunk per key field group (e.g., prerequisites + restrictions together). Users often ask “what are the prereqs?” so keep prereqs as a single coherent chunk, not scattered.

Policy pages (grading, withdrawal, academic standing, transfer) are rule-dense and exception-heavy. Chunk by heading and subheading, preserving rule + exception together. Avoid splitting numbered lists mid-list; policy meaning is often in list structure (“all of the following,” “any one of”).

Program pages (degrees, certificates, minors) combine narrative plus requirements tables and footnotes. Users ask “How many credits?” “Which courses satisfy X?” “What GPA is required?” Chunk around requirement groups: “Core requirements,” “Electives,” “Capstone,” “Residency,” and “Progression/continuation rules.” If a page contains a course list table, consider converting each row to a normalized textual record and chunking by requirement group so retrieval can connect the list to its governing header.

  • Template principle: chunk boundaries should align with user intent (prereqs, credits, eligibility) rather than page layout alone.
  • Template principle: keep governing text and its scope together (who/when/which program).

In implementation, store the chosen template name (e.g., doc_template=course_v2) as metadata so you can analyze performance by document type and update policies without rethinking everything at once.

Section 4.3: Hierarchical chunking with headings

Section-aware chunking becomes more robust when it is hierarchical. Instead of producing flat chunks from raw text, build a tree: document → headings (H1/H2/H3) → paragraphs/lists/tables. Each chunk is then emitted with a “path” that captures where it came from (for example: Program Requirements → Core Courses → Capstone). This path is extremely valuable both for retrieval quality and for explainable citations.

A practical hierarchical algorithm looks like this: parse the HTML (or converted markdown) into blocks, identify heading levels, and maintain a stack representing the current section path. Aggregate content under a heading until you hit a token budget, then emit a chunk. If a section is tiny (e.g., one sentence), merge it with its parent or nearest sibling so it remains answer-ready. If a section is huge (e.g., an entire “Policies” section), split at subheadings first; if no subheadings exist, split at paragraph boundaries while duplicating the section title into each chunk.

  • Chunk text should begin with a compact header prefix: “Course: BIO 210 — Genetics | Section: Prerequisites”.
  • Preserve lists: keep bullet/numbered lists intact whenever possible, because list semantics often encode requirements.
  • Stable IDs: generate chunk IDs from (document_id + section_path + ordinal) so re-indexing is safe and deterministic.

Engineering judgement: do not treat headings as purely decorative. In catalogs, headings usually represent policy scope or requirement groups. Including the heading path inside the chunk text improves embedding relevance, and storing it as metadata improves filtering and faceting later. This is one of the simplest ways to improve recall without adding more chunks.

Section 4.4: Overlap, max tokens, and context stitching

Overlap is a tool, not a requirement. You use overlap to prevent boundary cuts from losing critical context, but uncontrolled overlap inflates your index and can cause near-duplicate retrieval results. The right approach is to first choose sensible boundaries (headings, paragraphs, lists) and then apply minimal overlap only where natural boundaries are unavailable.

Set a default max chunk size in tokens based on your embedding model and the typical density of catalog text. Many teams start with 250–500 tokens for courses and 400–800 tokens for policies/program rules, then tune empirically. The danger of a single global max is that short course pages become fragmented unnecessarily, while long policy pages still contain multi-rule blocks. Prefer per-template budgets.

Use overlap strategically in two scenarios: (1) long paragraphs that you must split, and (2) sequences where the “scope sentence” precedes multiple clauses (e.g., “Students must meet all of the following…”). In these cases, a 10–15% overlap or a “prefix duplication” of the scope sentence is often enough. Prefix duplication is cheaper than full overlap because you can add a short context line (header + scope) to each child chunk rather than repeating entire paragraphs.

  • Context windows: store pointers to previous/next chunk IDs within the same section path for optional stitching at query time.
  • Stitching rule: if the top chunk is retrieved but contains unresolved pronouns (“this program”), pull the parent heading chunk or the immediately previous chunk.
  • Dedup control: when retrieving top-k, collapse near-identical chunks by chunk_hash or section_path to avoid wasting context budget.

Common mistake: using large overlaps to compensate for poor boundaries. This hides the real issue and bloats costs. A better workflow is: fix boundaries first (section-aware), then add minimal overlap or header prefixes, then add stitching only for the minority of cases where users need “one more paragraph” to make the rule interpretable.

Section 4.5: Metadata-on-chunk design (filters and faceting)

Metadata turns “similar text retrieval” into “precise catalog retrieval.” For course catalogs, you almost always need to filter by institution, academic year/term, campus, program, and sometimes modality (online/in-person), level (undergrad/grad), and department. If you attach metadata only at the document level, you miss the chance to filter at chunk granularity—especially when a single page contains mixed information (e.g., multiple concentrations, multiple effective dates, cross-listed courses).

Design metadata in two layers: (1) inherited document metadata applied to every chunk, and (2) section-derived metadata extracted during chunking. Inherited metadata includes institution_id, source_url, catalog_year, doc_type, and last_updated. Section-derived metadata includes course_code (for course pages), program_id, requirement_group (Core/Electives/Capstone), policy_topic (Withdrawal/Grading/Transfer), and effective_term if present in the section.

  • Filter-first retrieval: apply strict filters (institution, catalog_year, doc_type) before vector similarity to reduce false positives.
  • Faceting: store fields that let you summarize results (“Top matches are all from Transfer Policy, 2025-26 catalog”).
  • Debuggability: include section_path and chunk_ordinal so you can inspect why a chunk matched.

Practical outcome: when a user asks “Can I transfer credits into the BS in Data Science?”, you can filter by program_id and policy_topic=transfer while still using embeddings for semantic match. This reduces hallucinations because the model is less likely to see an irrelevant transfer rule from a different program or year.

Section 4.6: Golden questions to validate chunk quality

Chunking quality should be validated with retrieval experiments, not gut feel. Create a small, stable evaluation set of “golden questions” based on real student/advisor queries and the rules that often cause support tickets. Each question must map to a specific authoritative passage in your catalog, so you can judge whether retrieval brought back the correct chunk(s). This is how you compare chunk policies and pick a default.

Build your golden set across document types: course prerequisites, credit totals, GPA thresholds, admissions requirements, retake policies, transfer limits, residency requirements, modality constraints, and effective-date changes. For each question, record the expected document_id and section_path (or a canonical citation). Then measure recall@k and MRR (mean reciprocal rank). Recall@k tells you if the right evidence appears anywhere in the top-k; MRR tells you if it appears early enough to reliably enter the model context.

  • Chunk sanity checks: does the retrieved chunk contain the rule plus its scope? Does it preserve list semantics?
  • Boundary checks: are exceptions and definitions separated from the main rule?
  • Noise checks: are you retrieving near-duplicates due to excessive overlap?

When results are poor, diagnose by category. If recall is low for prerequisites, your course template may be splitting prereqs across chunks or not including course codes in chunk text. If MRR is low for policies, your policy chunks may be too large or missing heading prefixes that provide semantic anchors. Iterate by changing one variable at a time (max tokens, heading aggregation, overlap, prefix strategy, metadata filters) and rerun the same golden set. Once you select a default policy, lock it behind a version (e.g., chunk_policy=v3) so weekly refreshes remain consistent and changes are deliberate, measurable upgrades.

Chapter milestones
  • Choose chunk boundaries aligned to how users ask questions
  • Implement section-aware chunking for structured pages
  • Add overlap and context windows without bloating the index
  • Attach metadata filters to chunks for precise retrieval
  • Run chunking experiments and pick a default policy
Chapter quiz

1. Why does the chapter argue that chunking is especially important for course catalogs in a RAG pipeline?

Show answer
Correct answer: Because it determines what the retriever can find for targeted, policy-specific questions
Users ask for specific rules (prereqs, credits, deadlines), so chunking controls whether the retriever can surface the right evidence.

2. What does the chapter mean by making each chunk "answer-ready"?

Show answer
Correct answer: A chunk should contain enough context to support a grounded answer on its own, or clearly indicate what it refers to
Answer-ready chunks are self-contained evidence units; if they lack context, the pipeline should stitch context via headers or adjacent chunks.

3. How does the chapter recommend handling structured pages to improve retrieval quality?

Show answer
Correct answer: Use section-aware chunking that respects headings and hierarchy
Section-aware chunking preserves meaning and reduces irrelevant matches by aligning chunks to document structure.

4. What is the intended benefit of adding overlap and context windows "without bloating the index"?

Show answer
Correct answer: Preserve continuity across boundaries while controlling index growth and duplication
Overlap/context windows help keep necessary surrounding context, but the design must avoid excessive redundancy and index explosion.

5. According to the chapter, what is a recommended way to validate and choose a default chunking policy?

Show answer
Correct answer: Run experiments and evaluate using "golden questions" rather than intuition
The chapter frames chunking as an engineering choice with measurable outcomes, validated via experiments and golden-question evaluation.

Chapter 5: Embed, Index, and Retrieve with Controls

By the time you reach embedding and indexing, your pipeline has already done the hard work: cleaned course titles, normalized credit formats, separated “policy” text from “course description” text, and chunked pages into retrieval-sized units. Now you need to make those chunks searchable at production scale—fast, filterable, and stable week after week.

This chapter focuses on the engineering judgment that turns “we have embeddings” into “we have reliable retrieval.” In course catalogs, users ask for precise constraints (“online only,” “meets Gen Ed Area B,” “prerequisite is CSCI 101,” “available Spring”), and the system must retrieve text that actually supports those claims. That requires disciplined versioning of embedding models, a vector database schema that matches your metadata strategy, and retrieval controls like top-k limits, hybrid search, reranking, and citations.

We’ll also address two problems that appear only in real deployments: index drift (when model changes silently degrade retrieval) and indexing correctness (when chunks duplicate or disappear across refreshes). Finally, we’ll introduce practical evaluation metrics—recall, MRR, and nDCG—that let you track whether retrieval quality is getting better, not just “feels okay in demos.”

  • Practical outcome: a repeatable embed→index→retrieve loop with guardrails that reduce hallucinations and make weekly refresh safe.
  • Common mistake: treating embeddings and indexing as a one-time step, rather than a controlled system with versions, idempotency, and evaluation.

The rest of this chapter breaks the work into six concrete sections: choosing and versioning an embedding model, designing vector namespaces and metadata filters, implementing hybrid retrieval with rerankers, making indexing idempotent with a document-to-chunk map, storing citation-ready fields, and evaluating retrieval quality with standard metrics.

Practice note for Select an embedding model and define versioning rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a vector index with hybrid search and metadata filters: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement upserts, deletes, and idempotent indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add retrieval controls: top-k, reranking, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Benchmark latency and cost for catalog-scale traffic: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Select an embedding model and define versioning rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a vector index with hybrid search and metadata filters: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement upserts, deletes, and idempotent indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Embedding model selection and drift considerations

Section 5.1: Embedding model selection and drift considerations

Embedding model selection is not just about “best average score.” In course catalogs, your queries include short codes (e.g., “CSCI 101”), domain phrases (“upper-division writing intensive”), and policy terms (“repeatable for credit”). Pick a model that performs well on your mix of short queries and medium-length chunks. If you have multilingual catalogs or international programs, confirm cross-lingual performance early rather than bolting it on later.

Define versioning rules before you index a single vector. At minimum, store an embedding_model_id (vendor + model name), embedding_dim, and embedding_revision (a date or semantic version you control). Many teams get burned when a provider updates a model behind a stable name; even small shifts can change nearest-neighbor results. Treat embedding generation like a compiled artifact: if the “compiler” changes, you must rebuild or isolate.

  • Rule 1: Never mix embeddings from different model versions in the same searchable space unless your DB explicitly supports per-vector model separation.
  • Rule 2: If you upgrade the model, create a new index or namespace, run A/B evaluation, then cut over.
  • Rule 3: Log the embedding version alongside each chunk to support audits and rollbacks.

Watch for drift in both data and model. Data drift happens when the catalog changes style (new templates, new program rules format, new abbreviations). Model drift happens when you change embedding models, rerankers, chunking, or tokenization. A practical control is to maintain a small, fixed “golden query set” (20–50 representative questions) and run it weekly. If recall or MRR drops, you have an early warning that something in the pipeline changed.

Finally, choose an embedding length strategy aligned to your chunking. If your chunks are 150–400 tokens, a general-purpose embedding model works well. If you embed entire pages, similarity becomes noisy. The model choice and chunk size are a coupled decision—optimize them together.

Section 5.2: Vector DB schema and namespace design

Section 5.2: Vector DB schema and namespace design

A vector database becomes maintainable when its schema matches how your product filters results. For course catalogs, you almost always need metadata filters such as institution, term (or academic year), program, department, campus, delivery_mode (online/in-person/hybrid), and document_type (course page, syllabus, program rule, policy). If you don’t store these as filterable fields, you will attempt to “prompt” the model into respecting constraints—and it will fail intermittently.

Start with a stable primary identifier strategy. A good pattern is a globally unique chunk_id plus a doc_id that ties chunks back to the source document. For example: {institution}:{catalog_year}:{doc_type}:{doc_slug}#{chunk_index}. Keep IDs deterministic so reprocessing the same input yields the same IDs (critical for upserts and deletes).

Namespace design is your lever for safe upgrades. Common approaches:

  • By institution (separate namespaces per school) to simplify data governance and reduce cross-school contamination.
  • By embedding version (separate namespaces per model) to avoid mixing vectors and to enable rollback.
  • By environment (dev/stage/prod) to prevent accidental production writes.

Many teams combine these: prod/{institution}/{embedding_version}. Inside each namespace, store metadata for filtering. Avoid the mistake of putting “term” in the namespace if you need cross-term comparisons (e.g., “what changed from 2024 to 2025?”). Store term as metadata unless you truly want hard separation.

Finally, design for scale: you will likely have tens to hundreds of thousands of chunks across catalogs, syllabi, and policies. Plan index parameters (HNSW ef_construction, M values, or IVF list counts) based on expected latency and recall requirements. Choose defaults, then tune after you benchmark with realistic traffic and filters enabled.

Section 5.3: Hybrid search (BM25 + vectors) and rerankers

Section 5.3: Hybrid search (BM25 + vectors) and rerankers

Course catalogs are a textbook case for hybrid retrieval. Pure vector similarity can miss exact identifiers like “MATH 221” or policy phrases that users quote verbatim, while pure keyword search struggles with paraphrases (“writing intensive” vs “W-intensive requirement”). Hybrid search combines both: run a BM25 (or similar lexical) search and a vector search, merge candidates, then optionally rerank.

A practical workflow looks like this:

  • Apply metadata filters first (institution, term, program) to shrink the candidate space.
  • Run BM25 to capture exact matches (course codes, prerequisites, named requirements).
  • Run vector search to capture semantic matches (paraphrased queries, concept-level intent).
  • Union the top candidates (e.g., 50 lexical + 50 vector), then rerank down to the final top-k.

Rerankers are especially valuable when chunks are similar (multiple courses share “3 credits, lecture”). A cross-encoder reranker (or a lightweight LLM reranker) can read the query and each candidate chunk and produce a relevance score. Use reranking when you need precision and citations—e.g., showing the exact paragraph supporting “Students must earn a C or better.” The tradeoff is cost and latency.

Retrieval controls should be explicit and measurable:

  • top-k: pick a small final set (often 5–10) to reduce hallucination risk and keep context small.
  • score thresholds: if top results fall below a relevance threshold, return “not found” behavior rather than forcing an answer.
  • diversity controls: avoid returning 10 chunks from the same page by limiting per-doc results.

Common mistakes include reranking without caching (causing expensive repeat calls) and merging lexical/vector results without de-duplicating by doc_id. Treat hybrid retrieval as a pipeline with clear stages and instrumentation, not a single “search” call.

Section 5.4: Idempotent indexing and document-to-chunk mapping

Section 5.4: Idempotent indexing and document-to-chunk mapping

Weekly refresh is where many RAG systems become unreliable. If your indexing job creates new IDs every run, you will accumulate duplicates, inflate cost, and degrade retrieval (“same paragraph appears 4 times”). The fix is idempotent indexing: running the pipeline twice on the same inputs should produce the same index state.

Implement idempotency with a document-to-chunk mapping and deterministic chunk IDs. Store a doc_registry table (or collection) with:

  • doc_id, source_url, doc_type, term, last_seen_at
  • content_checksum of the cleaned, normalized text (not raw HTML)
  • chunk_count and chunk_id_prefix
  • embedding_version used for the current indexed state

On each refresh, compute the cleaned content checksum. If the checksum is unchanged and the embedding version is unchanged, skip embedding and indexing entirely. If the checksum changed, regenerate chunks and embeddings and upsert by deterministic chunk_id. If the doc disappeared from the crawl (no longer present), schedule its chunks for delete based on doc_id.

For deletes, avoid scanning the whole vector index. Keep a secondary mapping of doc_id → list of chunk_ids (or store doc_id as metadata and use “delete by filter” if your DB supports it safely). This enables precise cleanup when program requirements are retired or URLs change.

A subtle but important control: define what constitutes a “document.” A course page might include multiple sections (description, outcomes, prerequisites). If your chunker changes, chunk boundaries shift. Deterministic IDs should be based on stable anchors (section IDs, headings, or rule numbers) when possible, not just “chunk 0, chunk 1.” This reduces churn and makes diffs meaningful across weeks.

Section 5.5: Citation-ready storage (source URL, section, checksum)

Section 5.5: Citation-ready storage (source URL, section, checksum)

Retrieval without citations is a hallucination trap in education settings. Users need to know where a rule or requirement came from, and your system needs an audit trail when content changes. Make every chunk “citation-ready” by storing the fields needed to render a precise reference.

At minimum, store these metadata fields per chunk:

  • source_url: canonical URL after redirects and normalization
  • source_title: page title or course name/code
  • section_path: heading hierarchy like “Program Requirements → Core Courses → Math”
  • section_anchor: HTML id or generated anchor to link directly to the section
  • term_effective: academic year/term applicability
  • content_checksum: checksum of the chunk text (or of the parent doc plus offset)

The checksum matters for two reasons. First, it supports idempotent indexing and change detection. Second, it enables citation integrity: when an answer cites a chunk, you can verify that the chunk text still matches what was indexed at the time of the response. If the catalog updates mid-week, your UI can warn “source updated since this answer was generated.”

Store offsets when possible: char_start/char_end within the cleaned document, or token offsets. Offsets let you reconstruct an exact snippet and highlight it in the UI. If your original sources are PDFs or scanned syllabi, store page numbers and extraction confidence scores; low-confidence OCR text should be down-weighted in retrieval or require stricter thresholds.

Common mistakes include citing only the top-level page URL (users can’t find the relevant part) and failing to preserve the cleaned text used for embedding (citations drift because you display a different version than you embedded). Align what you embed, what you cite, and what you display.

Section 5.6: Retrieval evaluation (recall, MRR, nDCG) basics

Section 5.6: Retrieval evaluation (recall, MRR, nDCG) basics

You cannot improve retrieval reliably without measurement. In catalog RAG, the most useful evaluations are small, consistent, and tied to user intent. Build a labeled set of queries with expected sources—e.g., 100 questions across admissions, prerequisites, program rules, delivery modes, and credit requirements. Each query should map to one or more “relevant” chunks (doc_id + section_path).

Three baseline metrics cover most needs:

  • Recall@k: Did the relevant chunk appear in the top k retrieved candidates? This tells you if retrieval is missing key sources.
  • MRR@k (Mean Reciprocal Rank): How early does the first relevant chunk appear? This matters because most systems only pass the top few chunks to the LLM.
  • nDCG@k: Rewards correct ordering when multiple chunks are relevant (useful for program rules where several sections may apply).

Use evaluation to tune controls. If Recall@50 is high but MRR@10 is low, your candidate generation is fine but ranking is weak—consider reranking or better hybrid weighting. If Recall drops when filters are enabled, your metadata normalization is likely inconsistent (e.g., “Online” vs “online” vs “distance”). If recall is strong but the model still hallucinates, reduce top-k, tighten score thresholds, and ensure citation-required prompting only answers from retrieved text.

Don’t ignore latency and cost during evaluation. Track time spent in each stage: filter + vector search, BM25 search, rerank, and final context assembly. For catalog-scale traffic, you often win by caching frequent queries, caching reranker results for popular pages, and using a two-tier approach (cheap retrieval first, expensive rerank only when ambiguity is high).

Most importantly, evaluate after every change in embedding version, chunking rules, filters, or indexing settings. Retrieval is a system; small changes can cause large shifts. Metrics give you the confidence to ship improvements without breaking students’ ability to find correct, citable program information.

Chapter milestones
  • Select an embedding model and define versioning rules
  • Create a vector index with hybrid search and metadata filters
  • Implement upserts, deletes, and idempotent indexing
  • Add retrieval controls: top-k, reranking, and citations
  • Benchmark latency and cost for catalog-scale traffic
Chapter quiz

1. Why does Chapter 5 emphasize embedding model versioning rules in a course-catalog RAG pipeline?

Show answer
Correct answer: To prevent index drift where silent model changes degrade retrieval over time
Versioning guards against index drift so retrieval stays stable week after week as models or settings change.

2. A user asks: “online only, Gen Ed Area B, offered Spring.” What chapter concept is most directly required to satisfy these precise constraints?

Show answer
Correct answer: Metadata filters aligned with the vector database schema
These constraints are best enforced via a schema and metadata filters that match the catalog’s structured fields.

3. Which approach best addresses indexing correctness when running weekly refreshes so chunks don’t duplicate or disappear?

Show answer
Correct answer: Idempotent indexing using upserts/deletes and a document-to-chunk map
Idempotency plus a document-to-chunk map ensures consistent updates and reliable deletes across refreshes.

4. How do retrieval controls like top-k limits, reranking, and citations contribute to reliable retrieval in this chapter?

Show answer
Correct answer: They limit and refine results and provide citation-ready support to reduce hallucinations
Top-k and reranking control result quality, while citations ensure the retrieved text supports the system’s claims.

5. Which set of metrics does the chapter recommend to track whether retrieval quality is actually improving over time?

Show answer
Correct answer: Recall, MRR, and nDCG
Recall, MRR, and nDCG are standard retrieval metrics for measuring ranking and coverage quality.

Chapter 6: Weekly Refresh, Monitoring, and Operations

Once your course catalog is cleaned, chunked, embedded, and indexed, the work is not “done”—it becomes an operational system. Course catalogs change constantly: prerequisites are revised, delivery modes shift, tuition notes update, programs get renamed, and entire pages move. A RAG pipeline that is correct on day one can quietly drift into being misleading by week six if you do not manage refresh, monitoring, and governance. This chapter turns your pipeline into a dependable weekly service.

The main operational goal is simple: keep retrieval accurate and current without breaking production. That breaks down into five practical responsibilities: (1) detect what changed, (2) update only what needs updating (or rebuild safely when needed), (3) orchestrate runs with retries and alerts, (4) monitor freshness, coverage, and retrieval quality, and (5) document runbooks and governance so the system survives staff turnover and vendor/source changes.

Engineering judgment matters here because the “right” approach depends on your sources (CMS pages, PDFs, SIS exports), your indexing scheme (single index vs multi-collection), and your risk tolerance. A university catalog may be updated weekly but can have high stakes; a bootcamp catalog might change daily but is lower risk. The patterns below aim for safe defaults: incremental updates when possible, guarded by evaluation and rollback mechanisms.

Practice note for Design the weekly refresh workflow and incremental detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement safe re-indexing with backfills and rollbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add monitoring for freshness, coverage, and retrieval quality: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for failures, source changes, and schema updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Ship a production checklist and handoff documentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the weekly refresh workflow and incremental detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement safe re-indexing with backfills and rollbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add monitoring for freshness, coverage, and retrieval quality: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for failures, source changes, and schema updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Change detection (diffing, checksums, sitemaps, feeds)

Weekly refresh starts with knowing what changed. The most common mistake is “re-scrape everything” with no visibility into deltas, which increases cost, introduces unnecessary churn in embeddings, and makes debugging harder. Instead, design a change-detection layer that outputs a stable list of impacted source documents (pages, PDFs, JSON rows) and the type of change (new, modified, deleted, moved).

Use layered detection so you are resilient to imperfect signals:

  • Sitemaps and RSS/Atom feeds: If the catalog site provides a sitemap with lastmod or an update feed, treat it as a hint—not ground truth. Some CMSs fail to update lastmod for template edits.
  • HTTP headers: ETag and Last-Modified can be powerful for HTML pages and PDFs, but proxies sometimes strip or misreport them. Record these values but verify when critical.
  • Checksums: Fetch content and compute a canonical checksum (e.g., SHA-256) after normalization (remove nav/footer, collapse whitespace, standardize dates when appropriate). This reduces false positives from cosmetic changes.
  • Diffing: Store a previous canonical text snapshot and diff it to classify changes. A small edit in “contact email” should not trigger re-embedding of an entire 20-page policy PDF if your chunking can isolate the affected portion.

For course catalogs, canonicalization is essential. If your extraction includes a “Last updated” timestamp inside the page body, your checksum will change every week even when the course content didn’t. Build a “noise removal” step that strips known volatile regions (banner alerts, rotating testimonials, timestamps, cookie prompts). Keep a log that explains what was removed—otherwise you may accidentally remove meaningful policy notes.

Output of this section should be a durable change manifest, for example: {source_id, url, content_hash, prior_hash, change_type, detected_at}. Downstream steps (chunking, embedding, indexing) should consume this manifest, not re-derive changes independently.

Section 6.2: Incremental vs full rebuild strategies

With a change manifest in hand, decide whether to update incrementally or do a full rebuild. Incremental updates are the default for weekly refresh because they are cheaper, faster, and reduce embedding churn. But full rebuilds are sometimes safer—especially after schema changes, chunking strategy updates, or extraction fixes that would otherwise leave your index in an inconsistent mixed state.

A practical decision rule:

  • Incremental update when only a minority of sources changed, and your chunking boundaries are stable (e.g., course pages with predictable sections). Update affected chunks only, delete removed chunks, and upsert new ones.
  • Full rebuild when you changed normalization rules, chunking parameters, metadata schema, embedding model version, or you suspect widespread extraction errors (e.g., the CMS changed markup site-wide).

Implement incremental safely by making your chunk identifiers deterministic. A common pattern is chunk_id = hash(source_id + chunk_type + chunk_start_offset + chunk_end_offset) or a stable semantic anchor like course_code + term + section_heading. If chunk IDs change every run, you cannot reliably delete or update; your index will accumulate duplicates and retrieval will degrade.

Safe re-indexing often means using versioned collections or shadow indexes. For example, write to catalog_v2026_03_25 during a rebuild, run evaluation and sanity checks, then atomically switch the production alias from catalog_current to the new version. Keep the previous version for a rollback window (e.g., 2–4 weeks) and document the rollback procedure. For incremental runs, you can still use versioning by batching updates into a “delta” collection and periodically compacting into a clean full rebuild.

Backfills are another operational need: you may add a new source (department pages) or fix extraction for PDFs. Treat backfills like controlled full rebuilds for a subset of sources: run them in a separate job, validate, then merge. Avoid mixing backfills with the routine weekly refresh until you can measure the impact on coverage and retrieval quality.

Section 6.3: Orchestration (schedules, retries, alerting)

Orchestration turns your pipeline into a service. A weekly refresh workflow should be scheduled, idempotent, and observable. Whether you use Airflow, Dagster, Prefect, GitHub Actions, or a managed ETL tool, the design principles are the same: clear task boundaries, retries for transient failures, and alerts that map to actionable runbooks.

Start by decomposing the refresh into stages aligned with your RAG pipeline:

  • Discover: crawl sitemaps/feeds, list candidates
  • Fetch & extract: download pages/PDFs, parse text
  • Normalize: clean, standardize metadata, canonicalize text
  • Detect changes: compute checksums/diffs, produce manifest
  • Chunk: generate chunk records with stable IDs
  • Embed: create embeddings, batch requests, handle rate limits
  • Index: upsert/delete in vector DB, maintain filters
  • Validate: run eval set, coverage checks, smoke queries
  • Publish: switch alias or finalize delta, notify stakeholders

Idempotency is crucial: if the job dies halfway through embedding, rerunning should not create duplicates. Use a run identifier and write intermediate outputs (manifest, chunk table) to durable storage. Then embedding and indexing can be “resume-able” by selecting only records missing embeddings or missing index confirmation.

Retries should be selective. Retry network and rate-limit failures with exponential backoff; do not blindly retry deterministic parse errors (e.g., malformed PDF) without routing them to a quarantine queue. Alerting should be layered: (1) a “run failed” page for critical failures that stop publishing, (2) warnings for partial coverage drops (e.g., 5% fewer pages indexed), and (3) informational alerts for cost anomalies (embedding spend spike).

Finally, add operational guardrails: a “circuit breaker” that prevents publishing if evaluation metrics regress beyond a threshold or if freshness falls behind your SLA (e.g., more than 10% of courses older than 14 days). This is how you prevent silent degradation.

Section 6.4: Observability metrics (freshness, errors, latency, cost)

You cannot operate what you cannot measure. For RAG in course catalogs, observability is not just system uptime; it includes data freshness, index coverage, and retrieval behavior. A common failure mode is a pipeline that “succeeds” technically but produces an index missing key departments due to a source change. Your dashboards should catch that before users do.

Track four categories of metrics:

  • Freshness: age of indexed content by source type (course pages vs program rules), percent of documents/chunks updated within the expected window, and “staleness budget” remaining before breach.
  • Errors: extraction failures by parser, HTTP error rates by domain, embedding API failures, and indexing upsert/delete failures. Include top failing URLs and error samples.
  • Latency: time per stage and end-to-end run time. Also track time-to-availability after a change is detected (important for weekly/term deadlines).
  • Cost: pages fetched, tokens embedded, embedding API spend, vector DB storage growth, and query-time costs if applicable.

Coverage is the metric that ties data operations to educational outcomes. Maintain expected counts: number of courses by campus/term, number of program pages, number of policy documents. Compare current index counts to a baseline and alert on sudden drops or spikes. Spikes can indicate duplication (chunk ID instability) or a template change that caused chunk explosion.

Include retrieval quality proxies in ops dashboards, even if you also run formal evaluations. Examples: percentage of user queries with zero retrieved results, average top-k similarity score, and distribution of metadata filters used (e.g., campus=online). Sudden shifts often indicate metadata mapping failures.

Instrument your pipeline with structured logs that include source_id, run_id, schema_version, and embedding_model. When someone asks, “Why did this course answer cite the wrong prerequisite?” you need to trace back to the specific chunk and the run that produced it.

Section 6.5: Regression testing with evaluation sets

Weekly refresh changes your index; regression testing ensures it does not change your product behavior in harmful ways. The key is to treat retrieval like any other component: you need a repeatable evaluation set, metrics, and pass/fail thresholds. Many teams only test generation outputs, but in RAG the biggest leverage is testing retrieval first.

Build a small but representative evaluation set from real catalog questions and edge cases:

  • Course facts: “What are the prerequisites for CS-201?” “How many credits is BIO 110?”
  • Program rules: “Can I double count electives?” “Minimum GPA to progress?”
  • Policy constraints: “Are online students eligible for internships?”
  • Ambiguity: cross-listed courses, renamed programs, multiple campuses

For each query, store expected documents or chunk IDs (goldens), plus acceptable alternates. Then compute retrieval metrics such as Recall@k (did we retrieve the right source anywhere in top k?) and MRR (how high did it appear?). Track these metrics per slice: by source type, department, campus, and document format (HTML vs PDF). A weekly run that passes overall recall can still fail for a single campus if that subset was missed.

Use regression tests to validate schema and filters too. For example, if your system relies on metadata filters like campus, term, credential_level, add tests that confirm filtered retrieval returns results and does not leak cross-campus content. This directly reduces hallucinations: when retrieval is empty or off-scope, the generator tends to “fill in” unless you enforce guardrails like “answer only from citations.”

Operationally, wire the evaluation step into orchestration: run it after indexing, before publishing. If metrics regress beyond a tolerance (e.g., Recall@5 drops by 5 points, or key queries fail), block the alias switch and trigger an investigation. This is the backbone of safe re-indexing.

Section 6.6: Governance: approvals, versioning, and deprecation

Governance is how you keep a weekly refresh system aligned with institutional trust. In education, incorrect course information can cause real harm (missed prerequisites, delayed graduation). Governance does not mean bureaucracy; it means having explicit, lightweight controls on what changes, who approves it, and how you roll it back.

Start with versioning everywhere it matters:

  • Schema version: when you add/rename metadata fields (e.g., delivery_mode), increment a schema version and maintain backwards compatibility in the index filters until clients are updated.
  • Pipeline version: tag releases of extraction/normalization/chunking code, and record the version in each run’s metadata.
  • Embedding model version: changing models can shift similarity behavior; treat it like a major change requiring a rebuild and expanded evaluation.

Approvals should focus on high-risk changes: new chunking strategy, new sources, new filter fields, and deprecation of old indexes. A practical workflow is: engineer proposes change with expected impact, runs backfill/shadow index, shares evaluation results, then a designated owner (data lead or product lead) approves publishing. Keep the approval artifact (ticket, pull request) linked to the run_id.

Deprecation is often neglected. If you keep old collections forever, cost grows and teams accidentally query the wrong index. Define a retention policy: keep the last N versions or the last X weeks, then delete after confirming no clients depend on them. Document an “index alias contract” so consumers only use catalog_current (or a similar stable handle), never a raw version name.

Finally, write runbooks and handoff documentation as part of governance. Each alert should link to a runbook: what failed, likely causes (source HTML changed, PDF parser bug, embedding rate limits), how to diagnose (logs, sample URLs), and the safe actions (retry stage, quarantine documents, rollback alias). This is what makes the system operable by the next person, not just the builder.

Chapter milestones
  • Design the weekly refresh workflow and incremental detection
  • Implement safe re-indexing with backfills and rollbacks
  • Add monitoring for freshness, coverage, and retrieval quality
  • Create runbooks for failures, source changes, and schema updates
  • Ship a production checklist and handoff documentation
Chapter quiz

1. Why does Chapter 6 emphasize that a RAG pipeline is not “done” after initial cleaning, chunking, embedding, and indexing?

Show answer
Correct answer: Because course catalogs change and the pipeline can drift into returning misleading results without ongoing operations
The chapter stresses catalogs change constantly, so without refresh, monitoring, and governance, retrieval can become outdated or misleading.

2. What is the main operational goal described for running the pipeline as a weekly service?

Show answer
Correct answer: Keep retrieval accurate and current without breaking production
The chapter states the core goal is to keep retrieval accurate/current while maintaining production stability.

3. Which set best matches the chapter’s five operational responsibilities?

Show answer
Correct answer: Detect changes; update only what needs updating or rebuild safely; orchestrate runs with retries/alerts; monitor freshness/coverage/retrieval quality; document runbooks/governance
The chapter explicitly lists these five responsibilities as the operational breakdown.

4. According to the chapter, what factors most influence the “right” operational approach for refresh and re-indexing?

Show answer
Correct answer: Source types, indexing scheme, and risk tolerance
It depends on sources (CMS/PDF/SIS), indexing scheme (single vs multi-collection), and risk tolerance.

5. What does the chapter present as a safe default pattern for keeping the system current?

Show answer
Correct answer: Prefer incremental updates when possible, guarded by evaluation and rollback mechanisms
The chapter recommends incremental updates when possible, with evaluation and rollback to manage risk.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.