AI In EdTech & Career Growth — Intermediate
Turn messy catalogs into a weekly-refreshed RAG index that works.
Course catalogs are messy by nature: multiple sources, frequent updates, inconsistent formatting, and critical edge cases like prerequisites and program rules. If you want accurate AI answers for learners, advisors, or sales teams, you need more than a chatbot—you need a reliable RAG (Retrieval-Augmented Generation) data pipeline that keeps your index clean, structured, and fresh.
This book-style course walks you through building a practical, production-ready pipeline for course catalogs: ingest, normalize, clean, chunk, embed, index, and refresh on a weekly cadence. The goal is not just “it works on a demo,” but “it keeps working after the catalog changes.”
By the end, you will have a blueprint you can implement in your stack (Python + your choice of vector database) to power:
Chapter 1 establishes the data contract: what counts as a “document,” how IDs work, what freshness means, and what success looks like. Chapter 2 brings sources into a canonical format while preserving lineage. Chapter 3 makes the data trustworthy through cleaning, deduplication, enrichment, and audits.
Once your catalog is consistent, Chapter 4 focuses on chunking—where most RAG systems succeed or fail. You will design chunk boundaries that match real user questions and support precise retrieval. Chapter 5 turns those chunks into an index: embeddings, hybrid retrieval, reranking, idempotent upserts, and citation-ready storage. Finally, Chapter 6 operationalizes everything with weekly refresh, change detection, monitoring, regression tests, and governance.
This course is designed for EdTech and education-adjacent teams—product engineers, data engineers, ML engineers, and technical product managers—who need a dependable approach to catalog RAG. If you are responsible for search, advising, enrollment funnels, or learner support, this pipeline will directly improve answer quality and user trust.
If you want to ship a RAG system that stays accurate as your catalog evolves, this course gives you the structure, milestones, and operational thinking to do it right. Register free to begin, or browse all courses to compare related learning paths.
Senior Machine Learning Engineer, Retrieval & Data Platforms
Sofia Chen designs retrieval systems and data pipelines for education and marketplace platforms. She has led production RAG deployments covering ingestion, indexing, evaluation, and monitoring. Her focus is building pragmatic systems that stay accurate as catalogs change.
A course catalog looks simple until you try to answer real student and advisor questions reliably. “Can I take CS 302 next term?” is not just a search problem; it mixes prerequisites, term availability, campus rules, transfer credit policies, and sometimes exceptions. Retrieval-Augmented Generation (RAG) can help, but only if the pipeline is designed around the catalog’s true user journeys and a strict data contract that keeps documents stable, fresh, and governable.
This chapter sets the foundation for the rest of the course: you will pick the highest-value journeys, define what a “retrieval unit” is in your system (course vs. program rule vs. policy vs. FAQ), and draft a contract for fields, identifiers, and freshness rules. You will also decide what “good” looks like—acceptance criteria and success metrics for answers—so your team can evaluate changes without subjective arguments. Finally, you’ll establish privacy, licensing, and source-of-truth governance so the system stays compliant and trustworthy.
Engineering judgment matters most at this stage. Many catalog RAG efforts fail not because embeddings are “bad,” but because the upstream data contract is fuzzy: duplicate courses, ambiguous IDs, mismatched campuses, or PDFs that drift out of date. The work in Chapter 1 is deliberately operational: you are designing the boundaries of your system so it can be cleaned, chunked, indexed, refreshed, and evaluated repeatably.
In the sections that follow, you’ll translate “catalog” into an explicit set of documents, IDs, and rules. This becomes the contract your ingestion and indexing pipeline must satisfy every run—especially during weekly refreshes and incremental updates later in the course.
Practice note for Select high-value user journeys (search, advising, prerequisites): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Draft the catalog data contract (fields, IDs, freshness rules): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose retrieval unit types (course, program, policy, FAQ): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set acceptance criteria and success metrics for answers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Plan privacy, licensing, and source-of-truth governance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Select high-value user journeys (search, advising, prerequisites): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Draft the catalog data contract (fields, IDs, freshness rules): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start by selecting high-value user journeys, because they determine your RAG pattern. In education, three journeys tend to dominate: (1) search and discovery (“Find evening data analytics courses under $X”), (2) advising and planning (“What should I take next to finish the minor?”), and (3) prerequisites and eligibility (“Can I enroll if I have MATH 101 from a community college?”). Each journey expects different evidence and different tolerance for ambiguity.
For search/discovery, RAG often behaves like semantic search with filtering: retrieve courses using embeddings, then apply metadata filters (campus, modality, level, term) and re-rank by structured signals (credit hours, department). For advising, you need multi-document retrieval where answers cite a course page plus a program rule plus a policy snippet. For prerequisites, treat it as a high-precision compliance query: the assistant should be conservative, cite the prerequisite rule verbatim, and fall back to “I can’t confirm” when sources conflict.
Choose retrieval unit types explicitly. A common catalog set is: Course (title, description, credits, prereq text, learning outcomes), Program (degree requirements, electives, residency rules), Policy (grading, repeats, transfer credit, academic standing), and FAQ (registrar explanations that students actually understand). Don’t overload “course documents” with program rules; mixing them harms retrieval because embeddings learn broad topical similarity instead of answering the specific question.
Define acceptance criteria early. For example: “Every answer must cite at least one authoritative source; prerequisite answers must include the exact prerequisite statement; if the requested term is unknown, the assistant must explicitly state that availability is not guaranteed.” Tie these to measurable success metrics like recall@k for known queries, MRR for search ranking quality, and “citation coverage” (percentage of responses with citations to the correct retrieval units). This is how you reduce hallucinations: you constrain the system with retrieval patterns and measurable requirements instead of hoping prompting will compensate.
Before drafting a data contract, inventory your sources and decide which is the source of truth for each field. Course catalogs are often split across systems: a public website (marketing-friendly text), a SIS (official codes, credits, prereqs), an LMS (syllabi and schedules), a CMS (policy pages), and a long tail of PDFs (handbooks, archived catalogs, articulation agreements). RAG will faithfully reproduce whatever you index—so governance starts with knowing what you are ingesting.
Make a simple table for each source: system, data type, update cadence, export method (API, database view, scraping, SFTP), license/permissions, PII risk, and authoritativeness. For example, the SIS may be authoritative for credits and requisites, while the web CMS is authoritative for program narrative descriptions. PDFs are the most dangerous: they are often outdated, hard to parse, and duplicated across directories. If you include PDFs, record versioning signals (publication date, effective term) and plan to suppress older versions at retrieval time.
Practical workflow: start with web + SIS + CMS policy pages; add syllabi later only if you can control privacy and versioning. Syllabi can improve answers about workload and topics, but they also introduce instructor names, email addresses, office hours, and assessment details that may change frequently. If you cannot guarantee safe redaction and term scoping, keep syllabi out of the index and instead link to them as external references.
Common mistake: treating “the catalog” as a single dataset. In reality, you have competing truth claims. Resolve conflicts with explicit precedence rules (e.g., “If SIS credits differs from web credits, use SIS”) and embed that rule in your pipeline so it is consistently applied. This is a core part of planning licensing and source-of-truth governance: your RAG system must be able to explain where information came from and why that source was chosen.
A RAG index is only as stable as its identifiers. If the same course appears with three slightly different titles across systems, your embeddings will cluster them, but retrieval will produce inconsistent citations and your weekly refresh may create duplicates. Your data contract must define canonical identifiers and deterministic deduplication keys for each retrieval unit type.
For courses, define a canonical course_id that is stable across years and systems. Many institutions have a SIS internal ID plus a human-readable code (e.g., “CS-302”). Use both: store the SIS ID as the primary key when available, and store the display code as a searchable attribute. If your institution reuses course codes over time, include an “effective_start_term” and “effective_end_term” in the identity or at least in the versioning fields so old and new definitions do not collide.
For programs and policies, IDs are often missing. Create them deterministically using normalized paths: for example, program:{campus}:{slug} or policy:{department}:{url_path}. Your dedup key should be reproducible from raw inputs, not assigned manually during ingestion. That way, incremental updates can upsert correctly and safe re-indexing becomes possible.
Dedup strategy should combine structural keys (IDs, URLs, SIS keys) with content fingerprints (hash of normalized text) to detect when two sources represent the same entity. A practical approach is: (1) choose the best canonical record by precedence rules, (2) attach aliases for other identifiers (old codes, cross-listed codes), and (3) keep a source_map field listing all contributing sources with timestamps. This supports auditing, troubleshooting, and citations (“This prerequisite comes from SIS rule text, updated on…”).
Common mistake: using the URL as the only ID. URLs change during website redesigns; your index will fragment, and metrics will degrade silently. Treat URLs as attributes, not identity. Your contract should specify which IDs are immutable, which are versioned, and how to handle mergers (e.g., two departments cross-listing a course). This design choice pays off later when you implement weekly refresh with incremental updates.
Metadata is not optional in catalog RAG; it is how you prevent plausible but wrong answers. Your retrieval should not only be semantic—it should be constrained by filters that reflect real academic structure. The minimum metadata set typically includes: campus, modality (in-person, online, hybrid), term/effective period, and level (undergraduate, graduate, continuing education). Add department/school, credit range, and language if relevant.
Draft these fields as part of the catalog data contract with explicit allowed values. For example, campus should be an enum (“main”, “downtown”, “virtual”), not free text (“Main Campus”, “main campus”, “MAIN”). Modality should be derived consistently: if the SIS uses codes, map them to your canonical set. Term is tricky: distinguish between effective_term (the rules that govern the course description and prerequisites) and offered_terms (when sections are typically scheduled). Many catalogs conflate them; your pipeline should not.
Use metadata to support the user journeys from Section 1.1. For search, metadata enables faceting and fast narrowing. For advising, it allows you to exclude irrelevant campuses or levels. For prerequisites, it prevents mixing rules across terms (“In 2022, the prerequisite was…”) and reduces hallucinations by ensuring retrieved chunks come from the correct effective period.
Common mistake: stuffing everything into the embedding text and hoping similarity search will “figure it out.” Embeddings are good at topical similarity but weak at strict constraints like “graduate-only” or “available this fall.” Put constraints into metadata fields and enforce them at query time (vector search + filters). Also define metadata for retrieval unit type so your index can retrieve a policy chunk when the question is about grading, not the course description.
Practical outcome: by the end of this section you should have a metadata dictionary in your contract: field name, type, allowed values, derivation rule, and whether it is filterable in the vector database. This becomes the bridge between raw catalog sources and reliable retrieval.
Freshness is where catalog RAG becomes operational. Students will ask questions that depend on the current term, newly approved curriculum changes, or updated policies. Your pipeline needs explicit freshness rules and a service-level agreement (SLA) so everyone knows when the assistant is allowed to answer confidently.
Start by defining “freshness” per retrieval unit type. Course descriptions might change once or twice a year; section schedules may change daily during registration; policies can change mid-year; FAQs might change whenever staff update the website. If your assistant is aimed at catalog rules (not real-time scheduling), you may intentionally exclude schedule data and set the SLA accordingly (“rules updated weekly; schedules not included”). Be clear in the contract about scope: it is better to be reliably incomplete than inconsistently current.
In the data contract, include fields like source_updated_at, ingested_at, effective_start_term, and expires_at. Then define thresholds: e.g., “Policies must be ingested within 7 days of source update,” “Course pages within 14 days,” “If source_updated_at is unknown for a PDF, treat it as low-trust and require an effective term in the text to be retrievable.”
Acceptance criteria should include freshness behaviors. For example: “If the retrieved sources are older than the SLA, the answer must warn that information may be outdated and provide a link to the authoritative source.” This is a guardrail that reduces hallucinations by making staleness visible instead of silently generating a confident response.
Common mistake: a single global refresh schedule (“we reindex weekly”) without unit-level SLAs. That hides risk: a daily-changing policy page and a yearly course description do not deserve the same monitoring. Plan now for weekly refresh with incremental updates and safe re-indexing later: you will need stable IDs (Section 1.3) plus a freshness policy (this section) to decide what to upsert, what to delete, and what to keep as historical versions.
A catalog RAG system is an information system, so treat it like one: define a threat model. The main threats are not adversarial hackers; they are routine institutional risks that produce harmful answers. Three high-impact threats are PII leakage, policy errors, and outdated information.
PII leakage often enters through syllabi, LMS exports, advising notes, or “helpful” PDFs that include names, emails, student examples, or accommodation details. Your governance rule should be simple: if a source can contain student data or instructor contact details not intended for public distribution, do not index it unless you have a redaction pipeline and clear permission. Enforce this with allowlists (approved domains, repositories) and automated scanning for common PII patterns. In the contract, include a pii_risk classification per source and block high-risk inputs by default.
Policy errors are more subtle. If two policy pages conflict (e.g., repeat policy updated, old PDF still online), naive retrieval may surface both and the model may synthesize an incorrect hybrid. Mitigations: (1) strict source precedence and deprecation rules, (2) effective-term metadata and filtering, and (3) answer acceptance criteria requiring direct quotation for high-stakes rules (prerequisites, graduation requirements, financial obligations). When in doubt, the assistant should refuse to generalize and instead direct the user to an official office.
Outdated info is inevitable unless you plan for it. Your freshness SLAs (Section 1.5) should drive runtime guardrails: if retrieved documents are stale, the assistant must label the answer as potentially outdated. Also consider “outdated by design” risks: archived catalogs are valuable for alumni, but harmful for current students if mixed into the same index without term filters. Separate indexes or strict term-based filtering prevents accidental retrieval of old rules.
Finally, licensing and governance matter even for “public” catalogs. Some institutions restrict reuse or have accessibility obligations. Document who owns each source, what can be stored, and how citations should be displayed. A practical outcome of this threat model is a short, enforceable checklist embedded into ingestion: approved sources only, PII scan pass, effective term present (or low-trust flag), and de-duplication complete. This is how you keep the assistant trustworthy as you scale the pipeline.
1. Why does the chapter argue that questions like “Can I take CS 302 next term?” are not just a simple search problem?
2. What is the main purpose of defining a strict catalog data contract in a RAG pipeline?
3. How does the chapter suggest you should define the system’s 'retrieval unit'?
4. What is the benefit of setting acceptance criteria and success metrics early in the project?
5. According to the chapter, why do many catalog RAG efforts fail even when embeddings are not the problem?
Your RAG system is only as reliable as the documents you feed it. Course catalogs look simple—pages with titles, credits, prerequisites—but in practice they come from a messy mix of HTML pages, PDF handbooks, and structured exports from SIS/CMS tools. This chapter focuses on the first half of the pipeline: building connectors, capturing raw artifacts, and normalizing everything into a canonical document model that downstream chunking and indexing can trust.
The engineering goal is repeatability. You should be able to run ingestion weekly, detect what changed, and re-index safely without losing traceability. That means designing connectors that handle redirects and pagination, dealing with “multi-term” versions (e.g., 2024–2025 vs 2025–2026), and storing both raw and processed artifacts. It also means keeping lineage logs that answer: Where did this text come from? When did we fetch it? Which extractor and cleaning rules were applied?
In course catalogs, common mistakes start early: scraping only rendered text without URLs and timestamps; stripping too aggressively and losing tables that contain requirements; ignoring PDF layout and gluing together unrelated columns; and failing to preserve term/version metadata so retrieval returns outdated program rules. The practical outcome of this chapter is a robust ingest-and-normalize layer that produces consistent, well-labeled JSON documents, ready for chunking and embedding in later chapters.
Throughout the chapter, treat ingestion as a data product: raw artifacts are immutable, processed artifacts are reproducible, and every record has a stable identity. That mindset sets you up for incremental updates and evaluation later, because you can compare retrieval performance across catalog terms and extraction versions without guessing what changed.
Practice note for Build connectors for HTML, PDFs, and structured exports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Normalize text and structure into a canonical document format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Handle redirects, pagination, and multi-term versions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate and store raw vs processed artifacts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create lineage logs for traceability: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build connectors for HTML, PDFs, and structured exports: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Normalize text and structure into a canonical document format: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Most catalogs offer at least two extraction paths: crawl the public website (HTML) or pull structured data via an export/API (SIS report, CMS feed, or vendor endpoint). Prefer APIs when they are stable, authenticated appropriately, and contain the fields you need (course code, title, credits, prerequisites, term, campus). APIs reduce boilerplate noise and usually provide explicit identifiers—critical for incremental refresh.
Crawling is often unavoidable because program rules, narrative descriptions, and policy pages may exist only as web content or PDFs. Crawlers must handle redirects (HTTP 301/302), canonical URLs, pagination (A–Z indexes, “load more” lists), and robots constraints. A practical pattern is a two-stage crawl: first discover URLs from index pages and sitemaps, then fetch and archive each page with response headers and a content hash.
Engineering judgment: combine sources rather than choosing one. If an API provides course records but not the narrative around prerequisites exceptions or residency requirements, ingest both and reconcile them in your canonical model using stable keys (e.g., institution_id + course_code + term_id). Track the source priority so downstream retrieval can prefer the authoritative field (API credits) while still indexing helpful explanations (HTML).
catalog_year or effective_term at discovery time, not after parsing.A common mistake is discovering URLs without recording the discovery context. If a course appears under multiple pages (department listing, search results, archived version), you need to know which path produced it so you can de-duplicate and keep the correct version. Log both the discovered URL and the referrer/index page that contained it; this becomes part of your lineage story.
HTML catalog pages mix valuable text with navigation menus, footers, cookie banners, and “related links.” If you embed all of it, retrieval quality drops because the vector index learns irrelevant patterns (“Apply Now”, “Contact Us”) and pushes down the real course details. The goal is to extract the main content consistently while preserving structure like headings, lists, and tables that carry meaning.
Start with DOM parsing (e.g., Cheerio/BeautifulSoup) rather than regex. Identify the main content container using site-specific selectors when possible (main, article, vendor-specific classes). When site structure varies, add heuristics: choose the node with the highest text density, exclude nodes with repeated link lists, and remove elements by role/class (nav, footer, aside). Preserve heading levels (h1-h4) because they help later chunking maintain semantic boundaries.
Boilerplate removal is where over-cleaning can hurt. Course pages often include tables for credit breakdowns or prerequisite logic. Convert tables into readable text (row-wise “Key: Value” lines) rather than dropping them. Keep link targets when they are meaningful (e.g., “MATH 101” links) by converting anchors into “text (URL)” or capturing resolved identifiers in metadata.
requested_url and final_url.Common mistakes include stripping all lists (losing learning outcomes), flattening headings (making chunks incoherent), and not capturing the page title reliably (some catalogs render it via JavaScript; you may need server-side rendering or a fallback to OpenGraph metadata). Treat HTML extraction as a versioned component: when selectors change, bump an extractor version and record it in lineage logs so you can explain differences in downstream retrieval.
PDF catalogs are frequently the “source of truth” for policy and program rules, but they are the most error-prone to extract. Unlike HTML, PDFs encode layout, not reading order. Two-column pages, headers/footers, hyphenated line breaks, and scanned images can produce text that looks plausible but is semantically scrambled—exactly the kind of input that causes RAG hallucinations because the model tries to reconcile contradictory fragments.
Use a layered approach. First, attempt text-based extraction (pdfminer, PyMuPDF, or similar) and capture per-page text with coordinates when possible. Then apply layout-aware reconstruction: detect columns by clustering x-coordinates, remove repeating headers/footers by matching near-identical lines across pages, and join hyphenated words only when the hyphen occurs at line end and the next line begins with a lowercase continuation.
For scanned PDFs, integrate OCR (Tesseract or managed OCR) but do it selectively: OCR only pages with low text yield to control cost and error rates. Store confidence scores and mark OCR-derived text so you can quarantine or down-rank it later if it proves noisy.
page_number, bbox (if available), and extraction method (text vs ocr).A practical check is to compute “reading coherence” metrics: average line length, percentage of repeated header lines, and frequency of isolated single words. Spikes often indicate column mixing or footer pollution. Instead of forcing every PDF through the same path, route problematic files to a quarantine or “manual rules required” bucket, and continue indexing the clean majority.
Normalization means every source—HTML, PDF, structured export—becomes the same kind of object downstream. A canonical JSON model is the contract between ingestion and chunking/indexing. Design it to support retrieval filtering (term, campus, credential), provenance (citations), and safe reprocessing (raw artifact references). Avoid embedding-specific assumptions here; focus on representing content and metadata cleanly.
A practical document model for catalogs typically includes: a stable doc_id, doc_type (course, program, policy, syllabus), source (html/pdf/api), and effective metadata (catalog year, term range). Add title, body_text, and structure elements (headings, lists, tables) either as a simplified markdown-like array or as an ordered block list. Store urls (requested/final/canonical), plus raw_artifact_uri pointing to immutable storage (S3/GCS/blob).
For RAG, citations matter. Include spans or content_blocks with offsets and source anchors: HTML CSS selectors/XPaths, or PDF page numbers. This enables later “answer with citations” behavior and makes debugging straightforward when a retrieved chunk seems wrong.
doc_id should be deterministic (e.g., hash of institution + canonical_url + effective_term + doc_type).extractor_version and normalized_at timestamps.effective_from/effective_to or catalog_year explicitly; do not bury it in free text.Common mistakes are treating each web page as a unique “document” without consolidating variants (same course page across terms) or losing key fields by flattening everything into a single text blob. Your canonical model should be rich enough to support filters later (e.g., retrieve only undergraduate policies for 2025–2026) while still being simple enough that every connector can populate it.
Once you have extracted text, normalize it so your embeddings reflect meaning rather than encoding quirks. Catalogs often contain smart quotes, non-breaking spaces, weird bullet characters, and department abbreviations that vary by source. Normalize too little and you get duplicate near-identical chunks; normalize too aggressively and you erase distinctions (e.g., “C++” becomes “C”). The goal is consistent, reversible-enough text cleaning.
Start with Unicode normalization (typically NFKC) to standardize look-alike characters, then normalize whitespace: convert non-breaking spaces to spaces, collapse repeated spaces, and preserve intentional newlines around headings and lists. Standardize bullets and numbering into a consistent representation. Keep case as-is for course codes (e.g., “CS 101”) and preserve punctuation that changes meaning. When you remove artifacts like page headers (“2024–2025 Catalog”), do it via anchored patterns and record the rule name in your processing log.
Language handling matters for multilingual institutions. Detect language at the document or block level and store language metadata. If you plan to embed with multilingual models, keep the original text; if you translate, store both original and translated fields and mark which one is indexed. Never silently translate without lineage—retrieval and citations become hard to trust.
A practical outcome is fewer duplicates and cleaner chunks later. You should be able to run normalization twice and get the same output (idempotency). If normalization is not idempotent, it will be hard to trust incremental refresh because documents will appear to “change” every run.
Validation is the gate that protects your vector index from malformed or misleading documents. Treat ingestion as producing two artifact tiers: raw (immutable bytes + fetch metadata) and processed (canonical JSON + normalized text). Validation runs on processed artifacts and decides: accept, reject, or quarantine for review/retry. This is where you prevent downstream hallucinations caused by partial pages, mixed columns, or missing term metadata.
Define schema validation (required fields, types, allowed enums) and content validation (minimum text length, presence of key fields like title, detection of “Access Denied” or cookie-wall pages). Add structural checks by doc type: course docs should contain a course code pattern; program rules should include headings or section numbering; syllabus docs should include instructor/date only if you intend to keep those fields. For HTML, validate that the final URL matches the expected domain and that redirects did not land on a generic landing page.
Quarantine is not failure; it’s controlled uncertainty. Store quarantined items with an error code, sample text snippet, and pointers to raw artifacts so you can debug quickly. Many issues are transient (timeouts, anti-bot interstitials) and should be retried with backoff. Others require extractor updates (new CSS selectors, changed PDF layout). Your lineage logs should record every decision: fetched → parsed → normalized → validated → accepted/quarantined, with timestamps and versions.
Common mistakes include validating only the schema (ignoring “empty but valid” documents), overwriting last-known-good processed artifacts, and failing to capture enough context to reproduce an error. When you implement weekly refresh later, this quarantine and lineage foundation is what lets you update incrementally with confidence and explain exactly why a retrieved answer cites a particular term’s catalog page.
1. Why does Chapter 2 emphasize producing a canonical document model before chunking and indexing?
2. Which design choice best supports safe weekly re-ingestion with traceability?
3. A team scrapes only rendered text from catalog pages and discards URLs and timestamps. What risk does Chapter 2 highlight?
4. What is the main reason to handle multi-term versions (e.g., 2024–2025 vs 2025–2026) during ingestion and normalization?
5. Which workflow aligns with the chapter’s goal of keeping bad documents out of the index?
A Retrieval-Augmented Generation (RAG) system is only as reliable as the catalog data it retrieves. Course catalogs are deceptively messy: PDFs converted to text introduce hyphenation artifacts, bullet lists collapse into run-on sentences, and the same course appears across terms with tiny edits. If you embed and index this data “as-is,” you will amplify noise, fragment recall, and increase hallucination risk because the model sees conflicting snippets as equally plausible.
This chapter builds the middle of your pipeline: deterministic cleaning, normalization, deduplication, and metadata enrichment. The goal is not to create a perfect catalog; it’s to create a repeatable workflow that produces stable text for embeddings and structured metadata for filtering and ranking. You will also learn to generate audit reports that surface missing fields and contradictions early, before they degrade retrieval quality.
Engineering judgment matters here. Over-cleaning can remove meaning (e.g., stripping punctuation that distinguishes “C++” from “C”), while under-cleaning can leave spurious tokens that dominate embeddings. Similarly, deduplication is not “delete anything similar”; it is “group representations of the same real-world course offering and keep a canonical record plus provenance.” The outcome should be a catalog schema that is retrieval-friendly: clean text for chunks, plus metadata fields such as normalized course code, credit range, prerequisites, term, campus, and program relationships that enable precise filters in your vector index.
By the end of this chapter, you should have a pipeline stage that turns raw catalog pages into canonical, linked, and auditable course records ready for chunking and indexing.
Practice note for Create deterministic cleaning rules and unit tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deduplicate near-identical course entries across terms: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Extract entities (credits, prerequisites, outcomes) into metadata: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Resolve cross-links (course codes, program requirements): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Produce audit reports for missing and conflicting fields: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create deterministic cleaning rules and unit tests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deduplicate near-identical course entries across terms: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Extract entities (credits, prerequisites, outcomes) into metadata: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with deterministic cleaning rules that transform raw catalog text into stable, embedding-friendly content. Deterministic means: the same input always produces the same output, regardless of runtime order or environment. This matters because embeddings and dedup hashes change when whitespace or punctuation changes, which can cause unnecessary re-indexing and churn.
Typical artifacts include: repeated whitespace, page headers/footers, broken hyphenation from PDF extraction (e.g., “pre- requisite”), bullet characters (•, ◦) that become random Unicode, and “smart quotes” that vary by source. A practical approach is to define a “clean_text(text) -> text” function with a small sequence of rules, each justified and unit-tested.
Common mistakes: stripping all punctuation (breaks course codes like “MATH-101”), deleting all short lines (often removes prerequisites), and removing parentheses (often contain credit ranges or co-requisite notes). Keep a “lossy vs lossless” mindset: cleaning should remove noise but preserve semantics. Finally, write unit tests using real samples: a PDF-extracted course description, an HTML page, and a copied syllabus snippet. Your tests should assert exact expected outputs, so a future change (like a new regex) cannot silently change the downstream embeddings.
Normalization is about creating a canonical representation for fields that drive joins, filters, and grouping. In course catalogs, the biggest offenders are course codes (spacing and punctuation), titles (case and subtitles), and credits (integers, decimals, ranges, or “variable”). If these fields are inconsistent, you cannot reliably deduplicate across terms or resolve cross-links like prerequisites.
For course codes, define a strict parser that produces both a display form and a normalized key. For example, map “CS 101”, “CS-101”, and “cs101” to subject=CS, number=101, and normalized key CS101. Keep suffixes (e.g., “101L”, “101A”) and separators used in your institution (e.g., “CSCI 1100”). Store the original raw string for audit and explainability.
For titles, avoid aggressive rewriting. Normalize whitespace, trim trailing punctuation, and optionally store a lowercase “title_key” for matching, but keep the human-readable title intact for generation and citations. If the catalog sometimes includes subtitles like “Introduction to Biology: Cells and Systems,” consider splitting into title and subtitle only if you can do it reliably; otherwise, keep a single field to avoid brittle heuristics.
Credits need a structured schema. A practical model is: credits_min, credits_max, and credits_text. Parse “3 credits” as (3,3), “3–4 credits” as (3,4), and “variable” as (null,null) with credits_text="Variable". This enables filter queries like “courses with at least 4 credits” without guessing from raw text. The key engineering judgment: never discard the original credit string, because edge cases (labs, contact hours) may need manual review later.
Course catalogs often repeat the same course across terms, campuses, or delivery modes, with minor edits (“updated textbook,” “same outcomes, new wording”). If you index all variants separately, retrieval can surface multiple near-identical chunks and waste context window budget. Deduplication should reduce redundancy while preserving provenance: you want one canonical course entity with linked offerings and source URLs.
Use a two-stage approach: fast candidate generation with hashing, then confirmation with similarity. First, build a stable “dedup key” from normalized fields (e.g., CS101 + normalized title + credits range). Then compute a content hash on cleaned description text (e.g., SHA-256). Exact hash matches are safe duplicates. For near-duplicates, compute similarity on a normalized representation (e.g., TF-IDF cosine, MinHash/LSH for shingles, or a lightweight embedding similarity if you already compute embeddings). Set thresholds conservatively and review borderline cases.
Canonical selection should be deterministic: prefer the newest catalog year, the most complete record (fewest missing fields), or a trusted source system over scraped HTML. Store a merged_from list with term/campus/source identifiers so weekly refreshes can update offerings without changing the canonical entity unnecessarily. Common mistakes include deduplicating purely on title (“Special Topics” repeats with different content) and ignoring suffixes (“BIO101” vs “BIO101L”). Your dedup logic should be unit-tested with known pairs: identical, near-identical, and deceptively similar but distinct.
Metadata is what makes RAG retrieval precise. Clean text helps semantic matching, but structured fields enable filtering (“only undergraduate,” “offered online,” “requires CHEM101”). Enrichment is the step where you extract entities—credits, prerequisites, learning outcomes—from the cleaned text and store them in a schema designed for retrieval and evaluation.
Start with rule-based parsing because it is explainable and stable. For prerequisites and co-requisites, build a small grammar that recognizes patterns like “Prerequisite(s):”, “Pre-req:”, “Must have completed”, and lists of course codes separated by commas, “and/or.” Don’t try to perfectly model logical expressions on day one; instead, capture two layers: (1) a raw prerequisite string, and (2) a parsed list of referenced course codes. Later, you can improve logic handling (AND/OR groups) as a separate iteration.
Heuristics should be scoped and measurable. For example, detect delivery mode from phrases (“online,” “hybrid”) only if your false positive rate is acceptable; otherwise, treat it as unknown. A common mistake is to over-interpret prose and create “confident” metadata that is wrong, which leads to filtered searches excluding the correct course. Prefer conservative enrichment with explicit “unknown” states, and produce audit flags when parsing fails (e.g., “Prerequisite mentioned but no codes parsed”). This sets you up for better recall and fewer hallucinations because the retriever can constrain results to relevant subsets without relying solely on embeddings.
Catalog pages are full of cross-links: course pages reference other courses, and program pages reference required courses, electives, and rules (“Choose two from…”). For RAG, resolving these cross-links into an explicit relationship graph provides two benefits: it improves retrieval (you can pull connected context) and it reduces hallucinations (you can validate generated answers against known relationships).
Model relationships as edges between canonical entities. At minimum, you need: COURSE --prereq--> COURSE, COURSE --coreq--> COURSE, PROGRAM --requires--> COURSE, and PROGRAM --elective_group--> COURSE. Each edge should have provenance (source URL, catalog year) and confidence (parsed vs manually curated). When parsing program requirements, treat rule text as first-class data: store the raw rule paragraph, then store extracted course references and group labels (e.g., “Math foundation electives”).
effective_year so the retriever can filter by a student’s catalog year.Practical retrieval pattern: if the user asks “Can I take CS201 next term?” retrieve CS201 plus its prereq nodes and the student’s program rules node (if available). This creates grounded answers with citations. Common mistakes: flattening program rules into one long chunk with no structure (hard to filter and cite), and ignoring year/version, which causes the system to mix requirements from different catalogs. A small, explicit graph keeps your pipeline honest and enables downstream guardrails like “only cite prerequisites that exist as edges in the graph.”
Cleaning and enrichment are only trustworthy if you can measure their outputs over time. Build quality checks that run every pipeline execution and produce audit artifacts you can diff week to week. This is where you turn ad hoc data wrangling into a maintainable system that supports incremental updates and safe re-indexing.
Define a set of invariants and completeness checks tied to your schema. Examples: every course must have a normalized code, a title, and at least one source URL; credits must either parse into min/max or be explicitly labeled variable; prerequisite text that contains a course-code-like token must produce at least one parsed reference; and every relationship edge must point to a known node or a placeholder with an “unresolved” flag.
Make audits reproducible by storing run metadata (timestamp, git commit, input snapshot IDs) and emitting machine-readable outputs (CSV/JSON) plus a human-readable summary. Unit tests should cover deterministic cleaning and parsing rules, while integration tests can validate end-to-end expectations on a small fixture catalog. Common mistakes include only checking averages (which hides outliers) and not keeping historical audits (which makes regressions hard to detect). When you later evaluate retrieval quality (recall, MRR), these audits help you connect a retrieval regression back to a concrete upstream issue—like a parsing change that stopped extracting prerequisites—so you can fix the pipeline rather than “tuning prompts” to compensate.
1. Why is embedding and indexing course catalog text “as-is” risky for a RAG system?
2. What is the primary goal of deterministic cleaning, normalization, deduplication, and enrichment in this chapter?
3. Which approach best matches the chapter’s definition of deduplication for course offerings across terms?
4. What is a key tradeoff highlighted in the chapter regarding cleaning rules?
5. How do audit reports contribute to retrieval quality in the pipeline described?
Chunking is where a RAG pipeline becomes “course-catalog aware.” Cleaning and normalization make text consistent, but chunking determines what your retriever can actually find. In education catalogs, users rarely ask for an entire page; they ask for one prerequisite, one credit rule, one elective option, one deadline, or one exception clause. Your chunk strategy should therefore be built around how people ask questions—and how policy language is enforced—so retrieval returns the smallest, most self-contained evidence that still preserves meaning.
This chapter treats chunking as an engineering design choice with measurable outcomes. You will align chunk boundaries to user intent, implement section-aware chunking for structured pages, add overlap and context windows without exploding your index, attach metadata filters to reduce false matches, and run experiments to choose a default policy. The goal is high recall (you can retrieve the needed evidence) without sacrificing precision (you retrieve mostly relevant evidence) and without creating brittle, hard-to-maintain rules.
A practical way to think about chunking is: every chunk should be “answer-ready.” If retrieved alone, it should typically contain enough context to support a grounded answer, or at least clearly indicate what it refers to. When it cannot, your pipeline should have a planned mechanism for context stitching (for example, pulling the parent header or adjacent chunks). Done well, chunking reduces hallucinations because the model sees the exact governing text, not a loose paraphrase.
The sections that follow give you concrete templates and the judgement calls you must make. You will see common failure modes (too big, too small, too repetitive), a hierarchical approach that respects headings, and a validation routine based on “golden questions” rather than intuition.
Practice note for Choose chunk boundaries aligned to how users ask questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement section-aware chunking for structured pages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add overlap and context windows without bloating the index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Attach metadata filters to chunks for precise retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run chunking experiments and pick a default policy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Choose chunk boundaries aligned to how users ask questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement section-aware chunking for structured pages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start by defining what “good chunking” means for your catalog RAG. In most course-catalog assistants, your primary retrieval objective is high recall under narrow, specific queries: “Does CS 201 require CS 101?” “How many credits are needed for the certificate?” “Can I retake a course for a higher grade?” If your chunk boundaries hide the exact sentence that answers these, the model will guess. If your chunk boundaries are too broad, you will retrieve a chunk that contains the answer but is buried among unrelated details, making generation less reliable.
Common failure modes appear in predictable patterns. Overly large chunks (e.g., whole pages or 2,000+ token blocks) dilute similarity signals and often retrieve for the wrong reasons (a shared word like “credits”). Overly small chunks (e.g., single sentences) can lose referents like “this program” or “students,” and can break policy clauses that depend on exceptions and definitions. Another failure mode is boundary mismatch with user questions: users ask about prerequisites and grading, but your chunks follow arbitrary character limits, splitting the prereq list from the course description or splitting a policy rule from its exception.
Define two measurable targets before implementing: (1) retrieval recall@k for known-answer questions, and (2) median chunk size and index growth (cost). Your chunk strategy is “done” when recall stops improving meaningfully compared to the added cost and complexity. This framing keeps you from over-optimizing a clever chunker that is hard to maintain during weekly refreshes.
Course catalogs are not one document type; they are a bundle of genres. If you apply one universal chunk rule, you will underperform on at least one genre. Build templates by document type so chunk boundaries align with how users ask questions and how the content is structured.
Course pages are usually short and queryable by fields: title, description, credits, prerequisites, corequisites, restrictions, outcomes, offering terms. A strong default is “section-aware course chunks”: one chunk for the core description block plus one chunk per key field group (e.g., prerequisites + restrictions together). Users often ask “what are the prereqs?” so keep prereqs as a single coherent chunk, not scattered.
Policy pages (grading, withdrawal, academic standing, transfer) are rule-dense and exception-heavy. Chunk by heading and subheading, preserving rule + exception together. Avoid splitting numbered lists mid-list; policy meaning is often in list structure (“all of the following,” “any one of”).
Program pages (degrees, certificates, minors) combine narrative plus requirements tables and footnotes. Users ask “How many credits?” “Which courses satisfy X?” “What GPA is required?” Chunk around requirement groups: “Core requirements,” “Electives,” “Capstone,” “Residency,” and “Progression/continuation rules.” If a page contains a course list table, consider converting each row to a normalized textual record and chunking by requirement group so retrieval can connect the list to its governing header.
In implementation, store the chosen template name (e.g., doc_template=course_v2) as metadata so you can analyze performance by document type and update policies without rethinking everything at once.
Section-aware chunking becomes more robust when it is hierarchical. Instead of producing flat chunks from raw text, build a tree: document → headings (H1/H2/H3) → paragraphs/lists/tables. Each chunk is then emitted with a “path” that captures where it came from (for example: Program Requirements → Core Courses → Capstone). This path is extremely valuable both for retrieval quality and for explainable citations.
A practical hierarchical algorithm looks like this: parse the HTML (or converted markdown) into blocks, identify heading levels, and maintain a stack representing the current section path. Aggregate content under a heading until you hit a token budget, then emit a chunk. If a section is tiny (e.g., one sentence), merge it with its parent or nearest sibling so it remains answer-ready. If a section is huge (e.g., an entire “Policies” section), split at subheadings first; if no subheadings exist, split at paragraph boundaries while duplicating the section title into each chunk.
Engineering judgement: do not treat headings as purely decorative. In catalogs, headings usually represent policy scope or requirement groups. Including the heading path inside the chunk text improves embedding relevance, and storing it as metadata improves filtering and faceting later. This is one of the simplest ways to improve recall without adding more chunks.
Overlap is a tool, not a requirement. You use overlap to prevent boundary cuts from losing critical context, but uncontrolled overlap inflates your index and can cause near-duplicate retrieval results. The right approach is to first choose sensible boundaries (headings, paragraphs, lists) and then apply minimal overlap only where natural boundaries are unavailable.
Set a default max chunk size in tokens based on your embedding model and the typical density of catalog text. Many teams start with 250–500 tokens for courses and 400–800 tokens for policies/program rules, then tune empirically. The danger of a single global max is that short course pages become fragmented unnecessarily, while long policy pages still contain multi-rule blocks. Prefer per-template budgets.
Use overlap strategically in two scenarios: (1) long paragraphs that you must split, and (2) sequences where the “scope sentence” precedes multiple clauses (e.g., “Students must meet all of the following…”). In these cases, a 10–15% overlap or a “prefix duplication” of the scope sentence is often enough. Prefix duplication is cheaper than full overlap because you can add a short context line (header + scope) to each child chunk rather than repeating entire paragraphs.
Common mistake: using large overlaps to compensate for poor boundaries. This hides the real issue and bloats costs. A better workflow is: fix boundaries first (section-aware), then add minimal overlap or header prefixes, then add stitching only for the minority of cases where users need “one more paragraph” to make the rule interpretable.
Metadata turns “similar text retrieval” into “precise catalog retrieval.” For course catalogs, you almost always need to filter by institution, academic year/term, campus, program, and sometimes modality (online/in-person), level (undergrad/grad), and department. If you attach metadata only at the document level, you miss the chance to filter at chunk granularity—especially when a single page contains mixed information (e.g., multiple concentrations, multiple effective dates, cross-listed courses).
Design metadata in two layers: (1) inherited document metadata applied to every chunk, and (2) section-derived metadata extracted during chunking. Inherited metadata includes institution_id, source_url, catalog_year, doc_type, and last_updated. Section-derived metadata includes course_code (for course pages), program_id, requirement_group (Core/Electives/Capstone), policy_topic (Withdrawal/Grading/Transfer), and effective_term if present in the section.
Practical outcome: when a user asks “Can I transfer credits into the BS in Data Science?”, you can filter by program_id and policy_topic=transfer while still using embeddings for semantic match. This reduces hallucinations because the model is less likely to see an irrelevant transfer rule from a different program or year.
Chunking quality should be validated with retrieval experiments, not gut feel. Create a small, stable evaluation set of “golden questions” based on real student/advisor queries and the rules that often cause support tickets. Each question must map to a specific authoritative passage in your catalog, so you can judge whether retrieval brought back the correct chunk(s). This is how you compare chunk policies and pick a default.
Build your golden set across document types: course prerequisites, credit totals, GPA thresholds, admissions requirements, retake policies, transfer limits, residency requirements, modality constraints, and effective-date changes. For each question, record the expected document_id and section_path (or a canonical citation). Then measure recall@k and MRR (mean reciprocal rank). Recall@k tells you if the right evidence appears anywhere in the top-k; MRR tells you if it appears early enough to reliably enter the model context.
When results are poor, diagnose by category. If recall is low for prerequisites, your course template may be splitting prereqs across chunks or not including course codes in chunk text. If MRR is low for policies, your policy chunks may be too large or missing heading prefixes that provide semantic anchors. Iterate by changing one variable at a time (max tokens, heading aggregation, overlap, prefix strategy, metadata filters) and rerun the same golden set. Once you select a default policy, lock it behind a version (e.g., chunk_policy=v3) so weekly refreshes remain consistent and changes are deliberate, measurable upgrades.
1. Why does the chapter argue that chunking is especially important for course catalogs in a RAG pipeline?
2. What does the chapter mean by making each chunk "answer-ready"?
3. How does the chapter recommend handling structured pages to improve retrieval quality?
4. What is the intended benefit of adding overlap and context windows "without bloating the index"?
5. According to the chapter, what is a recommended way to validate and choose a default chunking policy?
By the time you reach embedding and indexing, your pipeline has already done the hard work: cleaned course titles, normalized credit formats, separated “policy” text from “course description” text, and chunked pages into retrieval-sized units. Now you need to make those chunks searchable at production scale—fast, filterable, and stable week after week.
This chapter focuses on the engineering judgment that turns “we have embeddings” into “we have reliable retrieval.” In course catalogs, users ask for precise constraints (“online only,” “meets Gen Ed Area B,” “prerequisite is CSCI 101,” “available Spring”), and the system must retrieve text that actually supports those claims. That requires disciplined versioning of embedding models, a vector database schema that matches your metadata strategy, and retrieval controls like top-k limits, hybrid search, reranking, and citations.
We’ll also address two problems that appear only in real deployments: index drift (when model changes silently degrade retrieval) and indexing correctness (when chunks duplicate or disappear across refreshes). Finally, we’ll introduce practical evaluation metrics—recall, MRR, and nDCG—that let you track whether retrieval quality is getting better, not just “feels okay in demos.”
The rest of this chapter breaks the work into six concrete sections: choosing and versioning an embedding model, designing vector namespaces and metadata filters, implementing hybrid retrieval with rerankers, making indexing idempotent with a document-to-chunk map, storing citation-ready fields, and evaluating retrieval quality with standard metrics.
Practice note for Select an embedding model and define versioning rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a vector index with hybrid search and metadata filters: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement upserts, deletes, and idempotent indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add retrieval controls: top-k, reranking, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Benchmark latency and cost for catalog-scale traffic: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Select an embedding model and define versioning rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a vector index with hybrid search and metadata filters: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement upserts, deletes, and idempotent indexing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Embedding model selection is not just about “best average score.” In course catalogs, your queries include short codes (e.g., “CSCI 101”), domain phrases (“upper-division writing intensive”), and policy terms (“repeatable for credit”). Pick a model that performs well on your mix of short queries and medium-length chunks. If you have multilingual catalogs or international programs, confirm cross-lingual performance early rather than bolting it on later.
Define versioning rules before you index a single vector. At minimum, store an embedding_model_id (vendor + model name), embedding_dim, and embedding_revision (a date or semantic version you control). Many teams get burned when a provider updates a model behind a stable name; even small shifts can change nearest-neighbor results. Treat embedding generation like a compiled artifact: if the “compiler” changes, you must rebuild or isolate.
Watch for drift in both data and model. Data drift happens when the catalog changes style (new templates, new program rules format, new abbreviations). Model drift happens when you change embedding models, rerankers, chunking, or tokenization. A practical control is to maintain a small, fixed “golden query set” (20–50 representative questions) and run it weekly. If recall or MRR drops, you have an early warning that something in the pipeline changed.
Finally, choose an embedding length strategy aligned to your chunking. If your chunks are 150–400 tokens, a general-purpose embedding model works well. If you embed entire pages, similarity becomes noisy. The model choice and chunk size are a coupled decision—optimize them together.
A vector database becomes maintainable when its schema matches how your product filters results. For course catalogs, you almost always need metadata filters such as institution, term (or academic year), program, department, campus, delivery_mode (online/in-person/hybrid), and document_type (course page, syllabus, program rule, policy). If you don’t store these as filterable fields, you will attempt to “prompt” the model into respecting constraints—and it will fail intermittently.
Start with a stable primary identifier strategy. A good pattern is a globally unique chunk_id plus a doc_id that ties chunks back to the source document. For example: {institution}:{catalog_year}:{doc_type}:{doc_slug}#{chunk_index}. Keep IDs deterministic so reprocessing the same input yields the same IDs (critical for upserts and deletes).
Namespace design is your lever for safe upgrades. Common approaches:
Many teams combine these: prod/{institution}/{embedding_version}. Inside each namespace, store metadata for filtering. Avoid the mistake of putting “term” in the namespace if you need cross-term comparisons (e.g., “what changed from 2024 to 2025?”). Store term as metadata unless you truly want hard separation.
Finally, design for scale: you will likely have tens to hundreds of thousands of chunks across catalogs, syllabi, and policies. Plan index parameters (HNSW ef_construction, M values, or IVF list counts) based on expected latency and recall requirements. Choose defaults, then tune after you benchmark with realistic traffic and filters enabled.
Course catalogs are a textbook case for hybrid retrieval. Pure vector similarity can miss exact identifiers like “MATH 221” or policy phrases that users quote verbatim, while pure keyword search struggles with paraphrases (“writing intensive” vs “W-intensive requirement”). Hybrid search combines both: run a BM25 (or similar lexical) search and a vector search, merge candidates, then optionally rerank.
A practical workflow looks like this:
Rerankers are especially valuable when chunks are similar (multiple courses share “3 credits, lecture”). A cross-encoder reranker (or a lightweight LLM reranker) can read the query and each candidate chunk and produce a relevance score. Use reranking when you need precision and citations—e.g., showing the exact paragraph supporting “Students must earn a C or better.” The tradeoff is cost and latency.
Retrieval controls should be explicit and measurable:
Common mistakes include reranking without caching (causing expensive repeat calls) and merging lexical/vector results without de-duplicating by doc_id. Treat hybrid retrieval as a pipeline with clear stages and instrumentation, not a single “search” call.
Weekly refresh is where many RAG systems become unreliable. If your indexing job creates new IDs every run, you will accumulate duplicates, inflate cost, and degrade retrieval (“same paragraph appears 4 times”). The fix is idempotent indexing: running the pipeline twice on the same inputs should produce the same index state.
Implement idempotency with a document-to-chunk mapping and deterministic chunk IDs. Store a doc_registry table (or collection) with:
On each refresh, compute the cleaned content checksum. If the checksum is unchanged and the embedding version is unchanged, skip embedding and indexing entirely. If the checksum changed, regenerate chunks and embeddings and upsert by deterministic chunk_id. If the doc disappeared from the crawl (no longer present), schedule its chunks for delete based on doc_id.
For deletes, avoid scanning the whole vector index. Keep a secondary mapping of doc_id → list of chunk_ids (or store doc_id as metadata and use “delete by filter” if your DB supports it safely). This enables precise cleanup when program requirements are retired or URLs change.
A subtle but important control: define what constitutes a “document.” A course page might include multiple sections (description, outcomes, prerequisites). If your chunker changes, chunk boundaries shift. Deterministic IDs should be based on stable anchors (section IDs, headings, or rule numbers) when possible, not just “chunk 0, chunk 1.” This reduces churn and makes diffs meaningful across weeks.
Retrieval without citations is a hallucination trap in education settings. Users need to know where a rule or requirement came from, and your system needs an audit trail when content changes. Make every chunk “citation-ready” by storing the fields needed to render a precise reference.
At minimum, store these metadata fields per chunk:
The checksum matters for two reasons. First, it supports idempotent indexing and change detection. Second, it enables citation integrity: when an answer cites a chunk, you can verify that the chunk text still matches what was indexed at the time of the response. If the catalog updates mid-week, your UI can warn “source updated since this answer was generated.”
Store offsets when possible: char_start/char_end within the cleaned document, or token offsets. Offsets let you reconstruct an exact snippet and highlight it in the UI. If your original sources are PDFs or scanned syllabi, store page numbers and extraction confidence scores; low-confidence OCR text should be down-weighted in retrieval or require stricter thresholds.
Common mistakes include citing only the top-level page URL (users can’t find the relevant part) and failing to preserve the cleaned text used for embedding (citations drift because you display a different version than you embedded). Align what you embed, what you cite, and what you display.
You cannot improve retrieval reliably without measurement. In catalog RAG, the most useful evaluations are small, consistent, and tied to user intent. Build a labeled set of queries with expected sources—e.g., 100 questions across admissions, prerequisites, program rules, delivery modes, and credit requirements. Each query should map to one or more “relevant” chunks (doc_id + section_path).
Three baseline metrics cover most needs:
Use evaluation to tune controls. If Recall@50 is high but MRR@10 is low, your candidate generation is fine but ranking is weak—consider reranking or better hybrid weighting. If Recall drops when filters are enabled, your metadata normalization is likely inconsistent (e.g., “Online” vs “online” vs “distance”). If recall is strong but the model still hallucinates, reduce top-k, tighten score thresholds, and ensure citation-required prompting only answers from retrieved text.
Don’t ignore latency and cost during evaluation. Track time spent in each stage: filter + vector search, BM25 search, rerank, and final context assembly. For catalog-scale traffic, you often win by caching frequent queries, caching reranker results for popular pages, and using a two-tier approach (cheap retrieval first, expensive rerank only when ambiguity is high).
Most importantly, evaluate after every change in embedding version, chunking rules, filters, or indexing settings. Retrieval is a system; small changes can cause large shifts. Metrics give you the confidence to ship improvements without breaking students’ ability to find correct, citable program information.
1. Why does Chapter 5 emphasize embedding model versioning rules in a course-catalog RAG pipeline?
2. A user asks: “online only, Gen Ed Area B, offered Spring.” What chapter concept is most directly required to satisfy these precise constraints?
3. Which approach best addresses indexing correctness when running weekly refreshes so chunks don’t duplicate or disappear?
4. How do retrieval controls like top-k limits, reranking, and citations contribute to reliable retrieval in this chapter?
5. Which set of metrics does the chapter recommend to track whether retrieval quality is actually improving over time?
Once your course catalog is cleaned, chunked, embedded, and indexed, the work is not “done”—it becomes an operational system. Course catalogs change constantly: prerequisites are revised, delivery modes shift, tuition notes update, programs get renamed, and entire pages move. A RAG pipeline that is correct on day one can quietly drift into being misleading by week six if you do not manage refresh, monitoring, and governance. This chapter turns your pipeline into a dependable weekly service.
The main operational goal is simple: keep retrieval accurate and current without breaking production. That breaks down into five practical responsibilities: (1) detect what changed, (2) update only what needs updating (or rebuild safely when needed), (3) orchestrate runs with retries and alerts, (4) monitor freshness, coverage, and retrieval quality, and (5) document runbooks and governance so the system survives staff turnover and vendor/source changes.
Engineering judgment matters here because the “right” approach depends on your sources (CMS pages, PDFs, SIS exports), your indexing scheme (single index vs multi-collection), and your risk tolerance. A university catalog may be updated weekly but can have high stakes; a bootcamp catalog might change daily but is lower risk. The patterns below aim for safe defaults: incremental updates when possible, guarded by evaluation and rollback mechanisms.
Practice note for Design the weekly refresh workflow and incremental detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement safe re-indexing with backfills and rollbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add monitoring for freshness, coverage, and retrieval quality: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create runbooks for failures, source changes, and schema updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Ship a production checklist and handoff documentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design the weekly refresh workflow and incremental detection: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement safe re-indexing with backfills and rollbacks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add monitoring for freshness, coverage, and retrieval quality: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create runbooks for failures, source changes, and schema updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Weekly refresh starts with knowing what changed. The most common mistake is “re-scrape everything” with no visibility into deltas, which increases cost, introduces unnecessary churn in embeddings, and makes debugging harder. Instead, design a change-detection layer that outputs a stable list of impacted source documents (pages, PDFs, JSON rows) and the type of change (new, modified, deleted, moved).
Use layered detection so you are resilient to imperfect signals:
lastmod or an update feed, treat it as a hint—not ground truth. Some CMSs fail to update lastmod for template edits.ETag and Last-Modified can be powerful for HTML pages and PDFs, but proxies sometimes strip or misreport them. Record these values but verify when critical.For course catalogs, canonicalization is essential. If your extraction includes a “Last updated” timestamp inside the page body, your checksum will change every week even when the course content didn’t. Build a “noise removal” step that strips known volatile regions (banner alerts, rotating testimonials, timestamps, cookie prompts). Keep a log that explains what was removed—otherwise you may accidentally remove meaningful policy notes.
Output of this section should be a durable change manifest, for example: {source_id, url, content_hash, prior_hash, change_type, detected_at}. Downstream steps (chunking, embedding, indexing) should consume this manifest, not re-derive changes independently.
With a change manifest in hand, decide whether to update incrementally or do a full rebuild. Incremental updates are the default for weekly refresh because they are cheaper, faster, and reduce embedding churn. But full rebuilds are sometimes safer—especially after schema changes, chunking strategy updates, or extraction fixes that would otherwise leave your index in an inconsistent mixed state.
A practical decision rule:
Implement incremental safely by making your chunk identifiers deterministic. A common pattern is chunk_id = hash(source_id + chunk_type + chunk_start_offset + chunk_end_offset) or a stable semantic anchor like course_code + term + section_heading. If chunk IDs change every run, you cannot reliably delete or update; your index will accumulate duplicates and retrieval will degrade.
Safe re-indexing often means using versioned collections or shadow indexes. For example, write to catalog_v2026_03_25 during a rebuild, run evaluation and sanity checks, then atomically switch the production alias from catalog_current to the new version. Keep the previous version for a rollback window (e.g., 2–4 weeks) and document the rollback procedure. For incremental runs, you can still use versioning by batching updates into a “delta” collection and periodically compacting into a clean full rebuild.
Backfills are another operational need: you may add a new source (department pages) or fix extraction for PDFs. Treat backfills like controlled full rebuilds for a subset of sources: run them in a separate job, validate, then merge. Avoid mixing backfills with the routine weekly refresh until you can measure the impact on coverage and retrieval quality.
Orchestration turns your pipeline into a service. A weekly refresh workflow should be scheduled, idempotent, and observable. Whether you use Airflow, Dagster, Prefect, GitHub Actions, or a managed ETL tool, the design principles are the same: clear task boundaries, retries for transient failures, and alerts that map to actionable runbooks.
Start by decomposing the refresh into stages aligned with your RAG pipeline:
Idempotency is crucial: if the job dies halfway through embedding, rerunning should not create duplicates. Use a run identifier and write intermediate outputs (manifest, chunk table) to durable storage. Then embedding and indexing can be “resume-able” by selecting only records missing embeddings or missing index confirmation.
Retries should be selective. Retry network and rate-limit failures with exponential backoff; do not blindly retry deterministic parse errors (e.g., malformed PDF) without routing them to a quarantine queue. Alerting should be layered: (1) a “run failed” page for critical failures that stop publishing, (2) warnings for partial coverage drops (e.g., 5% fewer pages indexed), and (3) informational alerts for cost anomalies (embedding spend spike).
Finally, add operational guardrails: a “circuit breaker” that prevents publishing if evaluation metrics regress beyond a threshold or if freshness falls behind your SLA (e.g., more than 10% of courses older than 14 days). This is how you prevent silent degradation.
You cannot operate what you cannot measure. For RAG in course catalogs, observability is not just system uptime; it includes data freshness, index coverage, and retrieval behavior. A common failure mode is a pipeline that “succeeds” technically but produces an index missing key departments due to a source change. Your dashboards should catch that before users do.
Track four categories of metrics:
Coverage is the metric that ties data operations to educational outcomes. Maintain expected counts: number of courses by campus/term, number of program pages, number of policy documents. Compare current index counts to a baseline and alert on sudden drops or spikes. Spikes can indicate duplication (chunk ID instability) or a template change that caused chunk explosion.
Include retrieval quality proxies in ops dashboards, even if you also run formal evaluations. Examples: percentage of user queries with zero retrieved results, average top-k similarity score, and distribution of metadata filters used (e.g., campus=online). Sudden shifts often indicate metadata mapping failures.
Instrument your pipeline with structured logs that include source_id, run_id, schema_version, and embedding_model. When someone asks, “Why did this course answer cite the wrong prerequisite?” you need to trace back to the specific chunk and the run that produced it.
Weekly refresh changes your index; regression testing ensures it does not change your product behavior in harmful ways. The key is to treat retrieval like any other component: you need a repeatable evaluation set, metrics, and pass/fail thresholds. Many teams only test generation outputs, but in RAG the biggest leverage is testing retrieval first.
Build a small but representative evaluation set from real catalog questions and edge cases:
For each query, store expected documents or chunk IDs (goldens), plus acceptable alternates. Then compute retrieval metrics such as Recall@k (did we retrieve the right source anywhere in top k?) and MRR (how high did it appear?). Track these metrics per slice: by source type, department, campus, and document format (HTML vs PDF). A weekly run that passes overall recall can still fail for a single campus if that subset was missed.
Use regression tests to validate schema and filters too. For example, if your system relies on metadata filters like campus, term, credential_level, add tests that confirm filtered retrieval returns results and does not leak cross-campus content. This directly reduces hallucinations: when retrieval is empty or off-scope, the generator tends to “fill in” unless you enforce guardrails like “answer only from citations.”
Operationally, wire the evaluation step into orchestration: run it after indexing, before publishing. If metrics regress beyond a tolerance (e.g., Recall@5 drops by 5 points, or key queries fail), block the alias switch and trigger an investigation. This is the backbone of safe re-indexing.
Governance is how you keep a weekly refresh system aligned with institutional trust. In education, incorrect course information can cause real harm (missed prerequisites, delayed graduation). Governance does not mean bureaucracy; it means having explicit, lightweight controls on what changes, who approves it, and how you roll it back.
Start with versioning everywhere it matters:
delivery_mode), increment a schema version and maintain backwards compatibility in the index filters until clients are updated.Approvals should focus on high-risk changes: new chunking strategy, new sources, new filter fields, and deprecation of old indexes. A practical workflow is: engineer proposes change with expected impact, runs backfill/shadow index, shares evaluation results, then a designated owner (data lead or product lead) approves publishing. Keep the approval artifact (ticket, pull request) linked to the run_id.
Deprecation is often neglected. If you keep old collections forever, cost grows and teams accidentally query the wrong index. Define a retention policy: keep the last N versions or the last X weeks, then delete after confirming no clients depend on them. Document an “index alias contract” so consumers only use catalog_current (or a similar stable handle), never a raw version name.
Finally, write runbooks and handoff documentation as part of governance. Each alert should link to a runbook: what failed, likely causes (source HTML changed, PDF parser bug, embedding rate limits), how to diagnose (logs, sample URLs), and the safe actions (retry stage, quarantine documents, rollback alias). This is what makes the system operable by the next person, not just the builder.
1. Why does Chapter 6 emphasize that a RAG pipeline is not “done” after initial cleaning, chunking, embedding, and indexing?
2. What is the main operational goal described for running the pipeline as a weekly service?
3. Which set best matches the chapter’s five operational responsibilities?
4. According to the chapter, what factors most influence the “right” operational approach for refresh and re-indexing?
5. What does the chapter present as a safe default pattern for keeping the system current?