AI Certifications & Exam Prep — Intermediate
Pass OCI Generative AI by building production-style RAG apps end to end.
This course is a short, technical, book-style blueprint designed to help you prepare for the OCI Generative AI Professional exam by building a realistic Retrieval-Augmented Generation (RAG) application on Oracle Cloud. Instead of memorizing disconnected facts, you’ll connect exam objectives to concrete implementation decisions: tenancy setup, secure retrieval, prompting for grounded answers, evaluation, and production-minded operations.
Across six tightly sequenced chapters, you’ll create a working RAG pipeline—from ingestion and embeddings to retrieval, generation, guardrails, and testing—while continuously mapping your work back to certification-style scenarios. The result is both exam confidence and a portfolio-ready reference architecture you can reuse at work.
You’ll design and implement a complete RAG workflow that can answer questions from your private documents with citations, while respecting security boundaries. Along the way, you’ll practice the skills that typically appear in exam prompts: choosing the right services, configuring IAM, troubleshooting access, and validating correctness and safety.
Each chapter reads like a focused technical chapter: concepts first, then a milestone-driven build plan, then exam-style reasoning. The progression is intentional: you start by translating the exam blueprint into architecture decisions, then you lay down OCI foundations, then you build retrieval, then you add generation, then you harden the system, and finally you test and prepare for exam scenarios.
This course is ideal if you’re comfortable with basic Python and APIs and want a pragmatic path to the OCI Generative AI Professional certification. It’s also a strong fit for cloud engineers and app developers who need to demonstrate they can build RAG systems responsibly: secure, observable, and testable.
Exam-aligned engineering: every chapter connects hands-on artifacts to the kinds of choices exam questions test.
Production-minded RAG: you’ll focus on evaluation, guardrails, and operational readiness—areas candidates often neglect.
Reusable templates: you’ll leave with checklists for IAM, retrieval tuning, prompt patterns, and regression tests.
If you’re ready to turn certification prep into a real build, you can Register free and begin Chapter 1. Prefer to compare options first? You can also browse all courses on Edu AI.
By the end, you won’t just “know about” RAG on OCI—you’ll be able to explain and defend your design choices, test your system for reliability, and walk into the exam with a structured approach to scenario questions.
Senior Machine Learning Engineer, Cloud AI & Search Systems
Sofia Chen is a senior machine learning engineer specializing in retrieval-augmented generation, vector search, and cloud-native ML delivery on Oracle Cloud. She has designed secure, testable AI services for enterprise teams and mentors engineers preparing for OCI certification exams.
This course is an exam-prep guide that behaves like an engineering playbook: every objective you study should turn into something you can build, measure, and defend in a design review. The OCI Generative AI certification expects you to understand not only what Retrieval-Augmented Generation (RAG) is, but how it is operationalized on Oracle Cloud Infrastructure (OCI): identity boundaries, networking, observability, data handling, and repeatable evaluation. In this chapter you will convert the exam blueprint into a reference RAG architecture and a personal study plan. The goal is to remove “mystery points” from the exam by making each domain a concrete task: provision a compartment and policies, build a minimal pipeline, enforce safety and data controls, and prove it works with metrics.
Think of the rest of the book as a set of lab increments. If you can run the pipeline end-to-end (ingest → chunk → embed → index → retrieve → generate) and then harden it (guardrails, grounding, logging, cost control), you will naturally cover the bulk of the exam scope. Throughout, you will practice engineering judgment: where to place components, how to reduce latency, how to tune relevance, and how to avoid common security mistakes that invalidate an otherwise correct solution.
Practice note for Identify exam domains and translate them into build tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the project repo, environment, and OCI authentication path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design a baseline RAG architecture for Oracle Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define success criteria: latency, accuracy, safety, and cost targets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a study-and-build plan with checkpoints and mock questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Identify exam domains and translate them into build tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the project repo, environment, and OCI authentication path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design a baseline RAG architecture for Oracle Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define success criteria: latency, accuracy, safety, and cost targets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The certification scope typically spans three overlapping areas: (1) core OCI foundations that enable AI workloads (IAM, compartments, networking, logging/monitoring), (2) generative AI usage patterns (prompting, RAG, embeddings, vector search, evaluation), and (3) production concerns (security, cost, reliability). Your first translation step is to map each objective to a “build task.” If the blueprint mentions identity and access, your build task is not reading docs—it is creating a compartment structure, dynamic groups (if you use compute), policies for Object Storage/Logging, and verifying that the runtime can read documents and write traces without over-privilege.
Common pitfalls are predictable. Many candidates memorize service names but cannot explain data flow end-to-end: where documents land, who can read them, which subnet the app runs in, and where logs go. Another frequent mistake is mixing responsibilities: letting the model “remember” enterprise facts instead of retrieving them, or indexing raw PDFs without text normalization and chunking. Finally, exam scenarios often probe least privilege and boundary thinking; overly broad policies (“manage all-resources in tenancy”) might make a lab work, but it is a design anti-pattern you should recognize and avoid.
Practical outcome for this section: create a one-page “objective → artifact” mapping. Example artifacts include: a repo with Infrastructure-as-Code (or CLI scripts), a minimal RAG app that runs against OCI services, and a dashboard view showing request latency and retrieval quality signals. Your study time becomes measurable output, not passive reading.
RAG succeeds when you clearly separate retrieval responsibilities from generation responsibilities. Retrieval is about finding the right evidence quickly and consistently: document parsing, chunking strategy, embeddings, vector indexing, filtering, and ranking. Generation is about writing a helpful answer that is constrained by retrieved evidence: prompt structure, grounding, citations, refusal behaviors, and formatting. If you blur the line—asking the LLM to “guess” missing facts—you create hallucinations and compliance risks.
In a baseline enterprise document QA flow, the user question is embedded, the vector index returns top-k chunks, and the generator model is prompted with those chunks plus instructions. Engineering judgment shows up immediately. Chunk size affects recall vs. precision: large chunks preserve context but may dilute similarity; small chunks improve specificity but can lose meaning. Overlap helps continuity but increases index size and cost. Retrieval also benefits from metadata filters (department, document type, time range) to narrow search space before similarity scoring, which improves both speed and relevance.
Common mistakes include embedding the question and searching without any normalization (case, punctuation, boilerplate), indexing duplicate content, or ignoring “near-duplicate” chunks that crowd out diverse results. Another mistake is not validating retrieval independently. You should be able to answer: “Did the right chunk appear in the top-k?” before asking “Did the model answer well?” Practical outcome: define a minimal retrieval test set (a few known Q/A pairs with expected source documents) and track hit-rate@k as a first offline metric, even before you worry about perfect generation.
A reference OCI RAG architecture is easiest to understand as layers: data, indexing, application runtime, model services, and operations. On the data side, Object Storage is a common landing zone for documents and extracted text. A processing component (a container, function, or job) reads objects, extracts text (PDF/HTML/Office), chunks it, and writes clean chunks plus metadata back to storage or directly into your index pipeline. For indexing, you need a vector-capable store. Depending on your chosen OCI-friendly stack, this might be a managed service you operate in your tenancy or a database/service that supports vector search; the architectural principle is consistent: store embeddings, chunk text, and metadata; support kNN search and filters.
For generation and embeddings, use OCI Generative AI endpoints where appropriate, keeping model calls inside the tenancy boundary and ensuring network paths are controlled. The application runtime can be a containerized API (for example on OCI Container Instances or Kubernetes) that exposes an endpoint: /ask triggers retrieval and generation, /ingest triggers processing, and /health reports dependencies. Networking typically includes a VCN with private subnets for app components, service gateways/private endpoints for access to OCI services, and tight security lists/NSGs. IAM binds it together: policies for reading Object Storage buckets, invoking Generative AI, writing logs, and (if needed) accessing databases.
Observability is not optional. Configure Logging for application logs and audit trails, and add metrics for latency across stages: embed time, vector query time, model generation time. Practical outcome: draw a data-flow diagram with arrows labeled by protocol and identity (which principal calls what). In exams and in real life, clarity about “who calls whom, from where, with what permissions” is often the difference between a correct and an incomplete architecture.
Most RAG projects begin in a notebook, but they pass the exam—and survive production—only when you can move from exploratory code to a service with controlled configuration. A strong workflow starts with a repo layout that separates notebooks (experiments), src/ (reusable pipeline code), infra/ (OCI provisioning scripts), and tests/ (retrieval and regression checks). Keep a single configuration contract (YAML or environment variables) that can run locally and in OCI, so you avoid “works on my machine” drift.
OCI authentication should be explicit. Locally you might use an OCI config profile with an API key, but for deployed workloads you should prefer resource principals or instance principals so secrets are not copied into containers. This is an exam-relevant habit: treat credentials as short-lived and scoped. Store sensitive values in OCI Vault (or equivalent secret storage), retrieve them at runtime, and never commit them to Git. Another practical pattern is to define a thin “client factory” module that creates OCI SDK clients using either local config or resource principal automatically, so the same code path runs in both environments.
Common mistakes include hardcoding compartment OCIDs, mixing dev and prod buckets, or logging full prompts and retrieved content without redaction. Practical outcome: implement a minimal “auth path” checklist in your repo: how to run locally, how to run in OCI, where secrets live, and how to rotate them. This reduces setup friction and directly supports exam scenarios around secure operations.
RAG architecture is as much about boundaries as it is about models. Your design must clearly state what data is allowed to leave a compartment, a VCN, or the tenancy—and what must never leave. Enterprise documents, user queries that contain sensitive identifiers, and retrieved passages often fall into “must remain inside tenancy” by default. That requirement influences service selection, network routing, logging strategy, and even prompt templates.
Start by classifying data: public, internal, confidential, regulated. Then enforce boundaries with technical controls. Use private networking patterns where possible (private endpoints/service gateways) so calls to OCI services do not traverse the public internet. Apply IAM policies that allow only the app runtime to read the source buckets and only the ingestion pipeline to write processed chunks. For logging, default to metadata logging (timings, doc IDs, similarity scores) and carefully gate any content logging behind explicit debug flags with redaction. For prompts, apply guardrails: instructions to cite sources, refuse unsupported claims, and avoid revealing system prompts or sensitive content not present in retrieved context.
A common mistake is accidental exfiltration through observability: writing full retrieved chunks into logs for troubleshooting. Another is uncontrolled caching in client-side apps. Practical outcome: write a “data handling contract” for your app: what gets stored (chunks, embeddings, metadata), retention periods, who can access it, and how you prevent cross-tenant or cross-compartment leakage. Even when not explicitly asked, this mindset aligns with exam expectations for secure, compliant OCI deployments.
To prepare efficiently, you need a rubric that connects objectives to labs and to evidence that you learned the skill. Build a simple table with three columns: exam objective, lab checkpoint, and verification artifact. For example: “Provision IAM and networking for AI workload” maps to a checkpoint where your RAG service runs in a private subnet with least-privilege policies; the artifact is a policy snippet, a network diagram, and a successful run that writes logs to OCI Logging. “Implement vector search and relevance tuning” maps to a checkpoint where you can adjust chunk size, overlap, top-k, and metadata filters, and show retrieval hit-rate improvements on a small golden set.
Your review loop should be iterative: implement, measure, adjust, and document. Define success criteria early so you can tell when you are done. At minimum, set targets for latency (p95 end-to-end and per-stage), accuracy (retrieval hit-rate@k and a qualitative answer review), safety (refusal/grounding behavior, no sensitive leakage in logs), and cost (embedding/index size and model call frequency). Avoid the trap of optimizing only the model response; many RAG failures are retrieval failures, and many cost blowups come from over-chunking and overly large top-k contexts.
Practical outcome: a personal study-and-build plan with dates and checkpoints. Each checkpoint ends with a short written note: what changed, what metric moved, what risk remains. This turns exam prep into a portfolio of verified competencies and makes later chapters faster, because you will already have the reference architecture, repo, and evaluation habits in place.
1. What is the chapter’s core approach to preparing for the OCI Generative AI certification?
2. Which set of concerns does the chapter emphasize as necessary to operationalize RAG on OCI (beyond knowing what RAG is)?
3. Which option best represents the minimal end-to-end RAG pipeline described in the chapter?
4. According to the chapter, what is the purpose of defining success criteria such as latency, accuracy, safety, and cost targets?
5. Which action best reflects the chapter’s method for turning an exam domain into a concrete task?
Before you build a Retrieval-Augmented Generation (RAG) pipeline on Oracle Cloud Infrastructure (OCI), you need foundations that won’t collapse under real enterprise constraints: least-privilege access, predictable networking, measurable observability, and safe configuration handling. Many “it worked on my laptop” AI prototypes fail not because the model is wrong, but because credentials leak into logs, services can’t reach endpoints privately, or nobody can explain why latency spiked during a release.
This chapter sets up the OCI primitives you’ll reuse throughout the course: compartments and tags for resource hygiene, IAM policies and dynamic groups for secure automation, VCN patterns for private access, logging/monitoring/audit baselines for production traceability, and a secrets strategy that keeps endpoints and keys out of source control. You will end by validating connectivity with a small “hello LLM” call—your first proof that identity, network, and configuration are aligned.
Think of these foundations as the scaffolding for every exam objective that follows: ingestion pipelines, vector search, prompt safety, and evaluation all depend on controlled access and reliable telemetry.
Practice note for Create compartments, policies, and dynamic groups for least privilege: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Stand up networking basics and private access patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Configure logging, monitoring, and audit for AI services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Establish a secure secrets strategy for keys and endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate connectivity with a small “hello LLM” call: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create compartments, policies, and dynamic groups for least privilege: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Stand up networking basics and private access patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Configure logging, monitoring, and audit for AI services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Establish a secure secrets strategy for keys and endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start by treating your OCI tenancy like a multi-environment software product. The simplest robust pattern is a top-level compartment for the course or program (for example, genai-rag), with child compartments for dev, test, and prod. This allows you to scope IAM policies cleanly and to quarantine experimental work without risking production data or quotas. A frequent mistake is placing everything in the root compartment “temporarily”—and then never moving it. That breaks audit clarity and makes least privilege harder because policies become too broad.
Use tags to enforce hygiene and cost attribution. Define a small tag namespace (such as CostCenter, Environment, Owner, DataSensitivity) and require them for major resources: VCNs, subnets, compute, databases, object storage buckets, logging, and API gateways. If you do nothing else, tag Environment and Owner; those two tags dramatically reduce the “what is this and who pays?” problem during incidents.
rag-dev-vcn, rag-prod-logs). Consistency beats cleverness.Practical outcome: by the time you build indexing and retrieval, you’ll be able to isolate test datasets, rotate credentials safely, and produce an audit-friendly record of where enterprise documents live and who accessed them.
Generative AI workloads involve multiple identities: humans (developers, operators), services (functions, compute instances, container workloads), and sometimes external callers (CI/CD systems). OCI IAM gives you the vocabulary to express least privilege: groups for humans, dynamic groups for resources, and policies that grant permissions in specific compartments.
For a RAG app, avoid “one policy to rule them all.” Instead, separate roles: a RAG-Developer group that can manage dev resources; a RAG-Operator group that can read logs/metrics and restart services; and a RAG-ReadOnly group for auditors or reviewers. Then use a dynamic group for the runtime (for example, an instance, OKE worker nodes, or Functions) so the app can read secrets, write logs, and call AI services without embedding user credentials.
Engineering judgement: start strict and open only what you must. When something fails, don’t immediately widen a policy. First identify whether the failure is IAM, network, or endpoint configuration. In OCI, many “permission denied” errors are actually missing compartment scoping or using the wrong principal (user API key vs resource principal). Getting this right early makes later guardrails (grounding, citations, data protection) enforceable because you’ll know which identity accessed which dataset and when.
RAG apps are networked systems: they ingest documents from storage, call embedding and LLM endpoints, query a vector store, and emit logs and metrics. Your default goal is controlled egress and predictable paths. Create a VCN with at least two subnets: a private subnet for app runtimes (compute/OKE/functions), and a public subnet only if you need an internet-facing load balancer or bastion-style access. Keep databases, caches, and internal services private.
Decide how the app reaches OCI services. The most common pattern is private workloads with controlled outbound access via a NAT Gateway for internet egress, plus a Service Gateway for private access to OCI services that support it (so traffic stays on Oracle’s network). If you need inbound administration, use a Bastion service or a locked-down jump host rather than opening SSH broadly.
Practical outcome: when you later implement vector search and relevance tuning, you can trust that latency measurements reflect your system design—not accidental internet routing. And when you introduce enterprise document sources, private access patterns reduce exposure and simplify compliance arguments.
Generative AI systems require more observability than typical CRUD apps because failures often appear as “bad answers” rather than explicit errors. Establish an observability baseline now: Logging for application and service logs, Monitoring for metrics, Alarms for actionable thresholds, and Audit for governance-grade records of API calls.
In practice, create log groups per environment (dev/test/prod) and decide what your RAG app must log. At minimum: request ID, document IDs retrieved, embedding model/version, top-k scores, and the final prompt token count. Avoid logging raw prompts or retrieved text by default if they may contain sensitive data—log hashes, IDs, and short summaries instead. A common mistake is dumping full documents into logs “for debugging,” which becomes a data leak and a retention liability.
Engineering judgement: build “debuggability” into the app contract. If a user reports hallucination, you should be able to reconstruct what context was retrieved, what model was called, and what configuration was active—without exposing the underlying sensitive text. This discipline directly supports later evaluation and regression testing because you can correlate answer quality shifts with configuration changes.
RAG apps touch many sensitive values: API keys (if you must use them), private endpoints, database passwords, signing keys, and occasionally third-party credentials for content ingestion. The rule is simple: no secrets in source control, no secrets in container images, and no secrets in plain environment files committed to repositories. Use OCI Vault for secrets and encryption keys, and grant access via IAM policies to the workload’s dynamic group.
Separate configuration into (1) non-secret runtime config (region, compartment OCIDs, model names, index names), and (2) secrets (tokens, passwords, private keys). Non-secret config can live in environment variables or config files delivered by your deployment pipeline. Secrets should be fetched at runtime from Vault, ideally cached in memory with a short TTL and refreshed safely. If you need to rotate secrets, rotation should not require a code change—only a Vault update.
To validate everything end-to-end, perform a small “hello LLM” call from the same runtime environment your RAG service will use (not from your laptop). This confirms the principal, network routes, DNS resolution, and secret retrieval are correct. Keep the prompt harmless and generic, and record only minimal metadata (status, latency, token counts) in logs.
Generative AI is cost-sensitive by design: token-based billing, embedding at ingestion time, vector storage growth, and bursty query traffic. Build cost and quota awareness into your foundation so you don’t discover limits during a demo or an exam lab. Start by reviewing service limits and quotas relevant to your architecture: model endpoints, request rates, object storage capacity, networking limits, and any vector database constraints. Treat quotas as design inputs, not afterthoughts.
Establish budgets per compartment (dev/test/prod) and use tagging to attribute spend. In dev, set tighter budgets and alert thresholds. Also adopt safe defaults in code and deployment: cap max tokens, limit top-k retrieval, and enforce request timeouts. A common mistake is leaving “max tokens” unconstrained during testing; one runaway prompt or loop can turn into a surprise bill and noisy logs.
Practical outcome: your later RAG evaluation and regression suites can run predictably without accidentally exhausting limits. Cost awareness also improves engineering judgement: you’ll know when to optimize chunking and retrieval (cheaper) versus calling larger models more often (expensive). With compartments, IAM, networking, observability, and secrets in place, you’re ready to build the first real pipeline components confidently.
1. Which foundation best addresses the risk of AI prototypes failing due to over-permissioned access and insecure automation?
2. What problem is the chapter’s networking guidance primarily trying to prevent for generative AI workloads on OCI?
3. Why does Chapter 2 stress configuring logging, monitoring, and audit before building the RAG pipeline?
4. Which practice aligns with the chapter’s goal of keeping endpoints and keys out of source control and logs?
5. What is the purpose of ending the chapter with a small “hello LLM” call?
A Retrieval-Augmented Generation (RAG) app is only as trustworthy as its corpus pipeline. In exam terms, this chapter maps directly to objectives around building an end-to-end RAG workflow (ingest → chunk → embed → index → retrieve), plus the operational controls that make it repeatable and auditable on Oracle Cloud Infrastructure (OCI). In engineering terms, you are building a “document supply chain”: raw files arrive, are normalized, broken into retrieval-sized units, enriched with metadata, embedded, indexed, and then continuously maintained.
The most common failure mode is treating ingestion and indexing as a one-time script. Real enterprise corpora change daily: policies are revised, PDFs are re-exported, Confluence pages move, and access rights change. Your ingestion layer therefore needs clear patterns for batch and incremental processing, plus an approach to change data capture (CDC) so the index reflects the source of truth. Next, your chunking strategy determines retrieval quality and cost: large chunks often reduce the number of embeddings but can bury the answer; tiny chunks improve pinpointing but increase index size and retrieval noise. Metadata is the third leg of the stool—without it you cannot filter by access control, freshness, or source provenance, and you cannot produce citations with confidence.
Once chunking and metadata are stable, you can generate embeddings and construct a vector index. Here you make model choices (dimension, multilingual support, domain fit), operational choices (caching, normalization), and platform choices (where the index lives and how it is queried). Finally, you run retrieval experiments—vary top-k, apply filters, compare chunk sizes—so you can tune relevance before you ever connect a large language model (LLM) to generation. The practical outcome is a packaged indexing job you can run repeatedly (e.g., daily), producing predictable artifacts and logs suitable for regression testing.
This chapter focuses on the ingestion + vector index layer. In later chapters, you will connect this to the generation step, grounding, citations, and evaluation. For now, think like a systems builder: design for repeatability, traceability, and controlled change.
Practice note for Ingest documents and normalize formats for retrieval readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement chunking strategies and metadata enrichment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Generate embeddings and build a vector index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run retrieval experiments and tune top-k + filtering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package the indexing job for repeatable runs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Ingest documents and normalize formats for retrieval readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement chunking strategies and metadata enrichment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Ingestion is the controlled movement of documents from a source system into a retrieval-ready store. Start by classifying sources: object storage buckets (PDF/DOCX), wikis (HTML), ticketing systems, shared drives, and databases. Each class has different update behavior, which drives your ingestion pattern.
Batch ingestion is simplest: on a schedule, scan a location and reprocess everything. It’s acceptable for small corpora or early prototypes, but it becomes costly as you scale because you repeatedly parse and embed unchanged content. A common mistake is using batch forever and then being surprised by long indexing windows and exploding embedding costs.
Incremental ingestion processes only new or modified items since the last run. Practically, this means maintaining a watermark: a last-seen timestamp, an object ETag, or a content hash. For OCI Object Storage, you can use object metadata (last modified) plus ETag/versioning where available. Incremental runs should be idempotent: re-running the job should not create duplicates or inconsistent states.
CDC (Change Data Capture) concepts apply when your source is a system of record (database, content platform) that emits changes. The key idea is to ingest “events” (create/update/delete) rather than periodically scanning. Even if you cannot implement true CDC, you can emulate it by computing a stable content fingerprint per document and comparing it on each run. Do not ignore deletes—stale chunks remaining in the index are a major cause of hallucinated answers because the retriever keeps surfacing content that no longer exists.
Normalization belongs in ingestion: convert documents to a canonical text form (e.g., UTF-8), preserve headings and lists, extract tables carefully, and record extraction warnings. In OCI workflows, write the normalized text and a structured manifest (document_id, source_uri, extracted_at, hash) to Object Storage so downstream chunking and embedding can be rerun deterministically.
Chunking turns normalized documents into retrieval units. The goal is to maximize the chance that the retriever returns a chunk containing the answer while minimizing noise and cost. There is no universal chunk size; you choose based on document style, question types, and token budgets.
A practical baseline is semantic-ish chunking by structure: split by headings, then by paragraphs, and only fall back to fixed token windows when sections are extremely long. Fixed-size chunking (e.g., 400–800 tokens) is easy to implement, but it can cut definitions in half or separate a procedure from its constraints. Structure-aware chunking tends to improve relevance because boundaries align with meaning.
Overlap (e.g., 10–20% of chunk length) reduces boundary loss: if the answer spans two chunks, overlap increases the odds at least one chunk contains the full context. The tradeoff is straightforward: overlap increases chunk count, embedding calls, index size, and sometimes near-duplicate retrieval results. A common mistake is applying large overlap everywhere; instead, use more overlap for narrative text and less for bullet-heavy specs where boundaries are already concise.
Engineering judgement: tune chunk size to your retrieval objective. If questions are “Where do I click?” or “What is the definition of X?”, smaller chunks often win. If questions require multi-step context (policy + exceptions), larger chunks may be necessary. Measure rather than guess: build a small golden set of queries and compare recall@k as you vary chunk size and overlap.
Finally, keep chunk identifiers stable. A good pattern is doc_id + section_path + chunk_index. Stable IDs make reindexing and deduplication possible, and they allow you to attach citations that are consistent across runs.
Metadata is what turns vector search from “similar text” into “correct text for this user, right now, from the right place.” You should design metadata deliberately before you embed anything, because changing metadata later can force large reindex operations.
Start with source metadata: source_uri (where it came from), source_type (pdf/wiki/db), title, author/owner if known, and a human-friendly citation label. This directly supports grounded answers with citations and enables debugging when retrieval returns surprising chunks.
Next is access control metadata. For enterprise QA, you must prevent retrieval of content the caller should not see. Add fields like access_group_ids, classification (public/internal/confidential), and optionally tenancy_id or business unit. The retrieval query should always apply filters based on the caller’s entitlements. A common mistake is applying access checks only at the UI or generation layer; by then you have already leaked content into the prompt or logs.
Freshness metadata lets you prefer the newest policy or exclude obsolete material. Store source_last_modified, ingested_at, and a computed content_version (hash). This supports ranking tweaks (boost recent) and enables you to find stale chunks during audits.
Lineage is how you make the pipeline testable: record the extractor version, chunker version, embedding model name/version, and indexing job run_id. When retrieval quality changes after a deployment, lineage allows you to correlate the change with pipeline modifications rather than guessing. In OCI, write the lineage fields into each indexed record and mirror them in your object-store manifests so you can reproduce an index build exactly.
Embeddings convert each chunk into a numeric vector that captures semantic similarity. Your embedding choices shape both retrieval quality and operational cost.
Model choice: prefer a model that matches your language and domain. If you have multilingual content, validate cross-language retrieval explicitly (query in English, answer in Japanese, etc.). If your documents contain many product codes or acronyms, test whether the model preserves those distinctions. In OCI Generative AI contexts, align with supported embedding models and note their intended use cases; exam scenarios often emphasize selecting appropriate models for retrieval versus generation.
Dimension affects index size and sometimes accuracy. Higher dimensions can represent nuance but cost more in storage and compute. Don’t assume “bigger is better”; run retrieval experiments with a fixed golden set and compare recall/precision at k. Also plan for model upgrades: store the embedding model identifier in metadata so you can run parallel indexes during migration.
Normalization: many vector databases and libraries assume cosine similarity; for cosine, L2-normalizing vectors is common. If your index uses inner product or Euclidean distance, ensure your embedding output and index metric match. A subtle but frequent bug is mixing normalized vectors with a distance metric that expects raw magnitudes, leading to degraded relevance that is hard to diagnose.
Caching and batching: embeddings are often the most expensive step. Cache by (chunk_text_hash, model_id) so reruns don’t re-embed unchanged chunks. Batch embedding calls to reduce overhead, but keep batches small enough to handle retries cleanly. Persist intermediate artifacts (chunk JSONL + embeddings) to Object Storage so you can rerun indexing without recomputing embeddings when only the index backend changes.
On Oracle Cloud, you have multiple ways to host and query a vector index depending on scale, latency requirements, and the systems you already operate. The key is to separate concerns: (1) where vectors live, (2) how you filter by metadata, and (3) how you run similarity search efficiently.
A common enterprise pattern is to use an Oracle-managed datastore that supports vector search alongside rich filtering. In those setups, you model each chunk as a row/document with columns/fields for metadata plus a vector column/field for embeddings. This gives you two critical capabilities: hybrid querying (vector similarity + metadata predicates) and operational governance (backups, access policies, auditing). Another viable pattern is to keep vectors in a dedicated vector engine while storing metadata in a relational store; this can scale well but makes filtering and consistency harder because you must join results across systems.
Query patterns to practice for RAG: (1) top-k similarity for a user query embedding; (2) filtered top-k by access labels and source type; (3) time-bounded retrieval for “latest policy”; and (4) hybrid lexical + vector when exact terms matter (error codes, part numbers). If your platform supports it, hybrid search often improves precision for technical corpora because it blends semantic similarity with keyword constraints.
Tuning top-k is not “set it to 20 and forget it.” Larger k increases recall but also increases prompt length and the chance of injecting irrelevant context into generation. Start with k=5–10, then evaluate. If you frequently miss answers, increase k or improve chunking; if answers are verbose or off-topic, reduce k and add filters. Always log retrieval results (doc_id, score, metadata) so you can audit why a given answer was produced.
Once your index exists, the real work begins: keeping it correct as the corpus evolves. Plan for reindexing as a first-class operation, not an emergency procedure.
Reindexing triggers include: new embedding model, changed chunking rules, metadata schema updates, or major source format changes (e.g., a wiki export layout). The safe approach is to build a new index in parallel, run retrieval regression tests against a golden set, then switch traffic. This avoids partial migrations where half the corpus is embedded differently.
Deduplication prevents repeated content from dominating retrieval. Duplicates come from mirrored sources, re-exported PDFs, and overlapping chunk strategies. Implement a two-layer defense: document-level dedupe using a normalized document hash, and chunk-level dedupe using a chunk_text_hash. When duplicates are legitimate (e.g., the same policy appears in two portals), preserve both but tag them with a canonical_source field so you can prefer the authoritative location.
Drift is the gradual mismatch between the index and reality: access rights change, documents are deleted, or “latest” guidance moves. Drift shows up as users seeing outdated answers or citations pointing to missing pages. Combat drift with scheduled incremental runs, delete propagation (tombstones), and freshness checks that alert when high-traffic documents haven’t been re-ingested recently.
Finally, package the indexing job for repeatable runs: configuration-driven source connectors, deterministic manifests, structured logs, and a run_id that ties together ingestion, chunking, embedding, and indexing. In OCI, this packaging supports operational handoff: you can run the job in a CI/CD pipeline or scheduled automation, capture logs centrally, and treat the vector index as a governed artifact rather than an ad-hoc cache.
1. Why does Chapter 3 emphasize designing ingestion and indexing as a repeatable pipeline rather than a one-time script?
2. What is the main trade-off described when choosing chunk sizes for retrieval?
3. According to the chapter, why is metadata considered essential in the ingestion + indexing layer?
4. What is the purpose of running retrieval experiments before connecting an LLM to generation?
5. Which outcome best matches the chapter’s definition of a well-built ingestion + vector index layer on OCI?
In Chapters 1–3 you built the ingredients of Retrieval-Augmented Generation (RAG): data ingestion, chunking, embeddings, and a vector index. This chapter turns those ingredients into an application that behaves like an enterprise QA assistant: it retrieves the right evidence, uses the model to answer while staying grounded, and exposes reliable APIs for chat and document question answering.
A production RAG app is not “vector search + a prompt.” It is an orchestration of retrieval decisions (filters, query rewriting, hybrid search), relevance tuning (reranking hooks), generation controls (grounding, citations, refusals), and operational guardrails (timeouts, rate limits, logging). You will also add conversation memory in a way that helps the user without leaking sensitive context across sessions.
Throughout this chapter, keep one engineering principle in mind: every step should either improve relevance, improve safety, or improve reliability. If a step cannot be measured or observed (via logs/metrics), it will be hard to debug later. By the end, you will be able to run an end-to-end demo on a representative dataset and explain how each component maps to OCI Generative AI exam objectives and real-world implementation choices.
Practice note for Implement retrieval with filters, reranking hooks, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build prompting templates for grounded answers and tool-style behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create an API layer for chat + QA workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add conversation memory without leaking sensitive context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run end-to-end demos on a representative dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement retrieval with filters, reranking hooks, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build prompting templates for grounded answers and tool-style behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create an API layer for chat + QA workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add conversation memory without leaking sensitive context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run end-to-end demos on a representative dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Retrieval quality often fails before vector search even runs: the user’s question may be ambiguous, missing nouns, or overloaded with chatty context. Query rewriting is a lightweight step that converts the raw message into a search-ready query that matches your document language. Pair it with intent detection so you can choose the right retrieval strategy (QA vs. policy lookup vs. troubleshooting vs. “how-to”).
Practical workflow: first, classify intent and constraints. For example, detect whether the user is asking for (a) a factual answer with citations, (b) a procedural checklist, (c) a comparison, or (d) an action request outside scope. Then rewrite the query into 1–3 concise search queries, preserving key entities (product names, dates, region, service). In an OCI setting, you might also map to metadata filters (compartment, doc type, business unit, effective date) to reduce noise and improve security boundaries.
Common mistakes include rewriting that drops critical nouns (“it,” “this policy”) or expanding the query with hallucinated details. To avoid that, enforce a rule: rewritten queries must only use entities present in the conversation or in a controlled glossary. Track this in logs: raw query, rewritten query, detected intent, and any extracted filters. These artifacts become your debugging trail when retrieval seems “random.”
Vector search is strong at semantic similarity but can miss exact matches (error codes, part numbers, legal terms). Lexical search (BM25-style keyword matching) is strong at exactness but weak on paraphrase. Hybrid search combines both, typically by retrieving candidates from both systems and merging them, then reranking the merged set.
A practical pattern is a two-stage retriever: Stage 1 retrieves a broad candidate set (e.g., top 50) using vector similarity, lexical match, or both. Stage 2 reranks those candidates using a cross-encoder reranker or a model-based scoring hook. Even if you do not yet deploy a dedicated reranker model, design the hook now: it is an interface that takes (query, chunk) pairs and returns a relevance score. Your OCI implementation can start with a weighted blend of scores (vector score + keyword score + metadata boosts) and later upgrade to a learned reranker.
Citations depend on stable chunk identifiers. If reranking changes the top-k, your citation list should follow the reranked order, and you should keep enough text for quoting (e.g., 1–2 sentences) without exceeding prompt limits. A common mistake is retrieving too few candidates (top 5) and then attempting reranking; reranking cannot rescue missing evidence. Another mistake is using hybrid retrieval but forgetting to normalize scores, causing one subsystem to dominate and degrade results. Build a small tuning harness: run a golden set of questions, track Recall@K, and compare configurations (vector-only vs lexical-only vs hybrid + rerank).
The generator is where relevance becomes an answer, and where safety failures become user-facing incidents. Your prompt must clearly define grounded behavior: use the retrieved context, cite sources, and refuse when evidence is missing. Treat the prompt as a contract between your application and the model.
A robust RAG prompt typically includes: (1) role and task, (2) grounding rules, (3) citation format, (4) refusal criteria, and (5) response style. Keep the rules short and testable. For example: “If the answer is not explicitly supported by the provided sources, say you cannot confirm and ask a clarifying question.” This reduces hallucinations more effectively than adding more context.
Engineering judgment: do not overload the prompt with 20 chunks “just in case.” Too much context increases confusion and can lower accuracy. Prefer fewer, higher-quality chunks after reranking. Another common mistake is mixing chat history directly into the context block; this can cause the model to cite user messages as “sources.” Separate “conversation” from “evidence,” and only the evidence block should be eligible for citations.
Finally, make refusals user-helpful: suggest what evidence is needed (“I don’t have a source covering EMEA exceptions; do you have the policy revision date?”) or propose a safe next step. In enterprise settings, refusals are part of user trust, not a failure mode to hide.
Many RAG apps fail not because the answer is wrong, but because the output is inconsistent and hard to integrate. Response shaping solves this by asking the model to produce structured outputs (often JSON) that your API layer can validate. This is especially important for tool-style behavior: returning “answer + citations + follow-ups + extracted fields” in a predictable format.
Define a schema that matches your product needs. A common pattern is:
Then enforce it. Validate JSON strictly server-side; if the model returns invalid JSON, re-ask with a repair prompt (“Return only valid JSON that matches this schema…”). Avoid letting invalid outputs silently pass to clients, because that becomes a reliability bug disguised as “model variability.”
Structured extraction is also how you turn RAG into workflows. For example, a troubleshooting assistant can extract “error_code,” “component,” and “recommended_runbook,” then your app can fetch the runbook and run a second RAG pass. Keep extraction grounded: require that each extracted field include a citation or “unknown.” The common mistake is to treat extracted fields as authoritative even when the context is weak; instead, mark uncertain fields and request confirmation.
Conversation memory improves usability, but it is also a major source of data leakage. Design memory intentionally: what should persist, for how long, and under what security boundaries. Separate short-term context (recent turns) from long-term preferences (user role, preferred format) and from retrieved evidence (documents). Never treat memory as evidence.
A practical design uses three stores:
Summarization is not just compression; it is policy enforcement. Your summarizer prompt should explicitly drop secrets, credentials, and personal identifiers, and keep only task-relevant facts. Common mistakes include summarizing verbatim (which preserves secrets) or summarizing with invented details. To mitigate, require the summary to be derived only from the conversation, and run a lightweight PII/secret scan before storing it.
When you retrieve documents, do not store retrieved chunks in memory by default. That can accidentally persist licensed or sensitive text beyond its intended scope. Instead, store references (chunk IDs) and re-retrieve as needed. This also keeps citations consistent with the index and reduces prompt bloat over time.
Your RAG system becomes real when it is callable as a service. A clean API layer lets you support chat, single-shot QA, batch evaluation, and admin operations (index refresh, health checks). Start with a small set of REST endpoints and make reliability features first-class: timeouts, retries, idempotency, and rate limits.
A practical minimal API set:
Timeout budgeting is the difference between a “demo” and a product. Assign budgets per stage: query rewrite (small), retrieval (bounded), rerank (bounded), generation (largest). If generation times out, return a partial response with citations and a message to retry, rather than hanging. Rate limits should protect both cost and stability; implement per-user and per-tenant quotas, and return clear error payloads for clients to handle gracefully.
To run an end-to-end demo on a representative dataset, script a few scenarios: (1) a straightforward factual question with citations, (2) a question requiring filters (region, date), (3) a query that needs rewriting, (4) a case where evidence is missing and the system must refuse or ask a clarifier. Log each stage (intent, rewritten query, retrieved chunk IDs, reranked top-k, prompt token count, latency). These logs are your practical link between exam objectives—retrieval, grounding, guardrails, evaluation—and a working OCI-hosted RAG application.
1. Why does Chapter 4 emphasize that a production RAG app is not simply “vector search + a prompt”?
2. Which set of components best matches the chapter’s categories of controls in a RAG application?
3. What is the primary purpose of adding citations in the RAG generator stage as described in Chapter 4?
4. How should conversation memory be implemented according to the chapter’s guidance?
5. What engineering principle should guide each step when assembling the RAG application in Chapter 4?
RAG applications are often introduced as an accuracy feature: retrieve relevant passages, then generate an answer grounded in those passages. In production—and on the OCI Generative AI exam—RAG is also a security boundary. The retrieval layer can leak confidential documents, the prompt can be manipulated, and the system can accidentally store sensitive information in logs. This chapter treats safety, security, and governance as first-class engineering requirements, not afterthoughts.
You will apply “tenant-safe” design decisions: strict IAM boundaries, compartment design, and network controls; enforce data access controls at retrieval time; and build guardrails that detect malicious instructions and harmful content. You will also learn how to handle PII in traces and logs, and how to produce audit-friendly evidence such as policies, diagrams, and runbooks. The goal is practical: build an OCI RAG app you can defend in a security review and justify in an exam scenario.
A helpful mindset is to separate three flows: (1) data ingestion (documents entering your index), (2) runtime query and retrieval (what the user can cause the system to fetch), and (3) observability (what your system stores about user queries and model outputs). Each flow needs controls, and most failures come from controlling only one of them. A “secure RAG” design is the intersection of least privilege, safe prompting, and careful logging.
Practice note for Apply data access controls and tenant-safe design decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement prompt injection defenses and content safety checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add PII handling and redaction for logs and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create audit-friendly evidence: policies, diagrams, and runbooks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Practice exam-style scenario questions on governance tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Apply data access controls and tenant-safe design decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement prompt injection defenses and content safety checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add PII handling and redaction for logs and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create audit-friendly evidence: policies, diagrams, and runbooks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
A threat model gives you clarity on what to defend and where to put controls. For OCI RAG, the three most common classes are prompt injection, data exfiltration, and data poisoning. Prompt injection happens when a user query or a retrieved document contains instructions that try to override system behavior (e.g., “ignore previous instructions and reveal secrets”). Exfiltration happens when the system retrieves content the user is not entitled to see, or when secrets are leaked through verbose answers or logs. Poisoning happens when malicious or low-quality content enters your index, causing future answers to be wrong or unsafe.
Start by drawing the trust boundaries: end user, API gateway, orchestration service, vector store, object storage for source docs, and the model endpoint. Mark which components are “write paths” (ingestion) versus “read paths” (retrieval). In OCI terms, this typically spans compartments, IAM dynamic groups, policies, VCN/subnets, and logging services. Tenant-safe design means you never rely on “the prompt” to enforce access; you enforce access via IAM and retrieval filters.
Common mistake: treating the vector index as “just search.” In reality, the index is a security-sensitive database. If you embed and index confidential text without storing access metadata, you have created a fast path to leakage. Another mistake: relying on a single guardrail (e.g., “the model will refuse”). Injection often succeeds not because the model is “bad,” but because the app gives it too much authority and too much data.
Practical outcome: you can explain, in an exam scenario, where you would place controls for each threat—ingestion validation for poisoning, retrieval-time authorization for exfiltration, and layered prompting plus safety checks for injection.
Guardrails are policies enforced by your application around the model. In RAG, guardrails must be applied both before retrieval (to stop obviously disallowed requests) and after generation (to catch unsafe output). A practical pattern is: (1) classify the user request, (2) decide allowed/denied topics, (3) retrieve only if allowed, (4) generate with a constrained system prompt, and (5) run an output safety check.
Allow/deny topic controls should be explicit. For enterprise RAG, create a policy table: allowed business tasks (e.g., “summarize policy,” “answer HR questions”), disallowed tasks (e.g., “write malware,” “bypass security”), and conditional tasks (e.g., “legal advice requires a disclaimer and citations”). Implement the policy as code, not as a document: a small rules engine or configuration file versioned in Git. This supports repeatable behavior and auditability.
Safety classifiers can be model-based or rule-based. Rule-based checks catch straightforward issues (PII patterns, profanity, obvious jailbreak phrases). Model-based classifiers can detect nuanced harmful intent, but they must be tuned to reduce false positives for your domain. A common engineering judgment is to combine them: rules for high-precision blocking, classifier scores for risk-based decisions (allow, refuse, or require escalation).
Common mistake: placing all guardrails in the system prompt and trusting compliance. Prompts are helpful, but they are not enforcement. Another mistake: only filtering outputs. If you retrieve sensitive text first, you may already have violated data-handling rules even if you refuse later. Practical outcome: you can describe a layered guardrail pipeline and justify why checks occur before and after retrieval.
Secure retrieval is the core “tenant-safe” control in RAG. The retrieval system must return only documents the caller is authorized to see, regardless of how the prompt is written. Treat your vector store like a database with row-level security: each chunk (or document) carries access metadata, and every query applies an authorization filter.
Implement this with two design steps. First, at ingestion time, attach metadata such as compartment, business unit, document classification, and allowed principals/groups. The key is that the metadata is derived from a trusted source (document management system, object storage prefix policies, or an ingestion service running with controlled IAM). Second, at query time, compute the caller’s entitlements from IAM (groups, claims, tenancy/compartment context) and apply filters to vector search. If your vector database does not support expressive filters, you must enforce the filter in an outer layer (retrieve candidates, then hard-filter) and accept the performance hit—security comes first.
Common mistake: filtering only by “tenant” but not by department or project, which becomes a cross-team data leak. Another mistake: storing ACL metadata in a separate system and forgetting to join it at query time. The exam often frames this as a governance tradeoff: performance and simplicity versus strict authorization. The correct posture in regulated environments is deterministic authorization, even if it reduces recall or increases latency.
Practical outcome: you can articulate how to implement row-level security concepts for vector retrieval, and how IAM/compartment design supports those controls on OCI.
Privacy failures in RAG systems rarely come from the model itself; they come from what the application stores. Prompts, retrieved passages, and generated answers are highly likely to contain PII, credentials, internal identifiers, or confidential text. Your default posture should be logging minimization: store the least amount of data needed for troubleshooting and evaluation, and keep it for the shortest acceptable time.
Start by classifying your observability data. Separate metrics (counts, latencies, error rates), traces (request path and timings), and content logs (prompts, retrieved chunks, outputs). Metrics are usually safe; content logs are the highest risk. For PII handling, implement redaction at the edge: before writing any logs, detect common PII (emails, phone numbers, national IDs) and redact or hash them. Prefer irreversible hashing for identifiers you only need for correlation. If you must store text for debugging, use sampling, strict access control, and short retention.
Common mistake: enabling verbose request/response logging in production “temporarily” and never disabling it. Another mistake: capturing retrieved source passages verbatim in centralized logs, effectively duplicating confidential documents into a broader-access system. Practical outcome: you can describe a compliant logging strategy with PII redaction, minimal retention, and controlled access—an answer that maps cleanly to governance expectations in exam scenarios.
Hallucinations are a governance issue because they can create business risk: wrong procedures, incorrect policy advice, or fabricated citations. In RAG, you reduce hallucinations by increasing grounding and by signaling uncertainty to downstream systems and users. The objective is not “zero hallucinations,” but controlled behavior with measurable quality.
Grounding checks begin at retrieval. If retrieval returns low-relevance chunks, generation becomes guesswork. Use practical confidence signals: top-k similarity scores, score gaps between the first and subsequent results, and the presence/absence of required policy sections. Establish thresholds: if similarity is below a configured value, the assistant should ask a clarifying question or refuse with “I don’t have enough information in the approved sources.” This is an engineering judgment that trades answer rate for correctness.
Require citations and enforce them. A robust approach is to include document IDs and chunk identifiers in the context and instruct the model to cite them. Then add a post-generation validator that checks whether the answer contains citations and whether cited chunks were actually retrieved. If validation fails, regenerate with stricter instructions or return a safe fallback. Another practical technique is “quote then explain”: include short quoted spans from sources, then the model’s explanation. This improves auditability and user trust.
Common mistake: treating a good-looking answer as “correct” without verifying it is supported by retrieved content. Another mistake: measuring only user satisfaction, not factuality/grounding. Practical outcome: you can implement grounding validation and confidence-based behaviors that reduce risk and support exam objectives around guardrails, citations, and evaluation.
Governance is how you prove you did the right things. On the exam, scenario questions often ask which control or artifact best addresses an auditor, a security team, or a compliance requirement. In real projects, the same artifacts reduce friction and speed approvals. Your goal is “audit-friendly evidence”: documents that map risks to controls and show operational readiness.
Create a controls map that ties each risk area to specific OCI and application controls. Example: “prevent cross-user data leakage” maps to compartment strategy, IAM policies, dynamic groups for runtime identity, and retrieval-time ACL filtering. “Detect prompt injection” maps to input classifiers, system prompt constraints, and refusal behaviors. “Protect PII in logs” maps to redaction middleware, restricted log access, and retention settings. Keep the map short and actionable—auditors and exam graders look for clarity, not volume.
Common mistake: producing a beautiful diagram with no indication of enforcement points (where authorization is checked, where secrets are stored, where logs are redacted). Another mistake: writing policies that do not match implementation—an exam trap is choosing a “policy” answer when the scenario demands a technical enforcement control. Practical outcome: you can select and justify governance tradeoffs, and you can produce artifacts that demonstrate security, privacy, and reliability for OCI RAG systems.
1. In Chapter 5, why is the retrieval layer treated as a security boundary in a production OCI RAG app?
2. Which set of controls best represents the chapter’s “tenant-safe” design decisions?
3. What is the main purpose of implementing prompt injection defenses and content safety checks in an OCI RAG system?
4. The chapter recommends separating controls across three flows. Which flow is most directly about preventing sensitive data from being stored in logs and traces?
5. In an exam-style governance scenario, which approach best matches the chapter’s definition of a “secure RAG” design?
A RAG application is only “done” when it is measurable, testable, deployable, and operable. In earlier chapters you built ingestion, chunking, embeddings, vector indexing, retrieval, and grounded generation. Now you will turn that pipeline into an engineering system: you will define what “good” means, prove it with offline metrics, protect it with regression tests, ship it with safe deployment patterns, and run it with observability and rollback. This chapter also aligns those activities to OCI Generative AI exam objectives so you can reason through scenarios under time pressure.
A practical mental model is to treat your RAG app as two coupled subsystems: retrieval and generation. Retrieval determines whether the model can possibly answer correctly; generation determines whether it answers faithfully and safely given the retrieved evidence. Your evaluation set, metrics, and tests must therefore separate retrieval failures (no relevant chunks found) from generation failures (hallucination, missing citations, unsafe phrasing), because the fixes are different. The goal is to create an improvement loop you can run repeatedly: adjust chunking/indexing/ranking, rerun retrieval metrics; adjust prompts/guardrails, rerun faithfulness and citation checks; then verify latency and cost constraints with caching and batching.
Finally, you will operationalize the service on OCI with clear environment separation, configuration discipline, and observability. That last step is where many RAG teams stumble: they validate locally, then discover in production that network egress, IAM policies, rate limits, and logging choices affect latency and failure modes. By the end of this chapter, you should be able to ship a RAG service that you can test automatically, monitor confidently, and improve intentionally—and you will have a concrete exam readiness plan to close gaps quickly.
Practice note for Build an evaluation set and compute offline RAG metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create automated regression tests for retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deploy the service with observability and rollback strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize latency and cost with caching and batching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a final readiness sprint: timed practice and gap closure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build an evaluation set and compute offline RAG metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create automated regression tests for retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deploy the service with observability and rollback strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start evaluation by deciding what you are trying to prove. A RAG system is not evaluated by “does it sound good,” but by whether it reliably supports specific user workflows. Convert stakeholder needs into user stories (e.g., “As a support engineer, I need the precise troubleshooting steps for error X with references to the official runbook”). For each story, define acceptance criteria that are checkable: must include a citation, must name the exact policy clause, must answer within N seconds, must refuse to answer when sources are missing.
From those stories, build a golden set (also called an evaluation set): a curated list of questions, expected answer traits, and expected supporting sources. Keep it small enough to maintain (often 50–200 items) but representative: include easy lookups, multi-hop questions that require combining two documents, ambiguous queries, and “negative” queries where the answer should be “not found” or “requires human review.” Store metadata alongside each item: document IDs, section anchors, effective dates, and whether the query is time-sensitive.
Common mistakes include using only “happy path” questions, writing expected answers that are too vague to score, and changing the corpus without updating golden sources. A practical outcome of this section is a living evaluation artifact that enables offline testing before you deploy: your team can argue about acceptance criteria once, then let metrics and tests enforce them continuously.
Offline RAG metrics are your fast feedback loop. Begin with retrieval metrics because they cap system performance: if relevant content is not retrieved, generation cannot succeed. The most practical retrieval metric is recall@k: for each golden question, does the top-k retrieved set contain at least one chunk that maps to the expected source section? Choose k based on your prompt context budget (often 3–8). Track recall@1 and recall@5 separately; recall@1 reflects ranking quality, while recall@5 reflects embedding/index coverage.
Then measure generation quality with faithfulness (sometimes called groundedness): does the answer’s claims match the retrieved text? You can approximate this offline by extracting atomic claims from the answer and verifying they are entailed by the retrieved passages, using either deterministic heuristics (string/number matching, citation-required rules) or an LLM-as-judge rubric that checks “supported vs unsupported.” Keep the rubric strict and consistent; a permissive judge hides hallucinations.
Citation coverage is a metric that links your guardrails to user trust: what fraction of key answer statements have citations, and do those citations point to the correct chunks? A simple operational rule is “every paragraph must have at least one citation,” but stronger is “every factual claim must be cited.” Track missing citations as a separate failure class from wrong citations.
Finally, measure latency end-to-end and by stage: retrieval time, rerank time (if used), model generation time, and any post-processing (citation formatting, safety filtering). Combine this with cost metrics: token usage, embedding calls, vector queries. This is where caching and batching become engineering levers: cache embeddings for repeated queries; cache retrieval results for popular questions; batch embedding during ingestion; and batch vector inserts. The practical outcome is a scorecard you can run on every change: recall@k, faithfulness, citation coverage, and p95 latency, with thresholds tied to the acceptance criteria you defined in Section 6.1.
Metrics tell you “how good,” but tests tell you “what broke.” Build an automated test harness that runs in CI and covers three layers: unit tests, integration tests, and prompt regression tests. Unit tests validate deterministic code paths: chunking boundaries, metadata extraction, PII redaction, citation formatter, and query normalization. These tests should not call external OCI services; use fixtures and snapshots.
Integration tests validate the RAG pipeline across service boundaries: ingest a small known corpus, embed, index, retrieve, and generate. Keep it small and stable so failures are actionable. In OCI terms, integration tests often run against a non-production compartment with constrained policies and budgets. Validate IAM assumptions explicitly: a common failure is that local credentials have broad permissions, while the CI runtime or deployment runtime does not.
Prompt regression testing is uniquely important in RAG because prompt edits can improve one scenario while degrading another. Use your golden set as a regression suite: run the same prompts and compare outputs with tolerant assertions. Avoid brittle string equality; instead assert properties such as “contains a citation,” “does not mention forbidden topics,” “matches expected document ID,” “answers ‘unknown’ when retrieval confidence is low.” Keep a “known issues” allowlist but timebox it—allowlists tend to grow and silently normalize bad behavior.
The practical outcome is confidence: you can refactor chunking, switch an embedding model, or tune top-k and reranking without guessing. Your harness will tell you if retrieval degraded, if citations broke, or if generation started hallucinating.
Deploying a RAG service on OCI is primarily about controlling change. Use multiple environments—at minimum dev, staging, prod—with separate compartments, network controls, and policies. Staging should mirror production shape: same subnet types, similar access to OCI Generative AI endpoints, same logging and rate limits. If staging is “too small” or “too permissive,” you will learn the wrong lessons.
Adopt CI/CD basics that favor reproducibility: build an immutable artifact (container image), run unit tests and golden-set regressions, then deploy to staging with infrastructure-as-code (Terraform/OCI Resource Manager). Promote to production only if staging passes SLO and correctness gates. Configuration should be externalized (environment variables or OCI Vault/Secrets), not baked into images: model OCID, top-k, rerank flag, max context tokens, safety thresholds, and logging verbosity.
Plan rollback as a first-class feature. Keep the prior version deployable, and ensure configuration changes are versioned too. Two common rollback pitfalls in RAG are (1) changing the index schema or embedding model without a parallel index, and (2) re-chunking the corpus without preserving prior chunk IDs, which breaks citations. A safer pattern is blue/green or canary deployment: route a small percentage of traffic to the new version, compare metrics (latency, refusal rate, citation coverage), then ramp up.
Optimize latency and cost during deployment design, not after. Add caching layers deliberately: cache embeddings for repeated queries, cache retrieval results for frequent questions, and consider response caching when the corpus is stable and the query is identical. Use batching in ingestion pipelines (embedding calls, vector inserts) and consider asynchronous ingestion so production traffic is not impacted. The practical outcome is a deployment process you can repeat under exam-style constraints: least privilege IAM, predictable networking, observable services, and a rollback story when changes regress quality.
Once deployed, you need to know whether the system is healthy and whether answers remain trustworthy. Build dashboards that reflect both platform health and RAG-specific quality signals. At the platform layer, track request rate, error rate, and latency (p50/p95/p99). At the RAG layer, track retrieval confidence, top-k hit distribution, citation coverage, refusal rate, and “no relevant context found” frequency. Use structured logs so you can correlate each answer with retrieved chunk IDs and model parameters.
Alerting should be tied to SLOs (service level objectives). Define a small set that matters: p95 latency under a threshold, error rate under a threshold, and a quality SLO such as “citation coverage above X%” or “faithfulness violations below Y%,” measured via periodic offline sampling. Avoid alert storms by using burn-rate alerts for SLOs rather than raw metric spikes.
Create incident playbooks that map symptoms to likely causes and actions. Example: if latency spikes, check model endpoint latency vs vector search latency; if retrieval recall drops, check index freshness and ingestion jobs; if faithfulness drops, inspect prompt changes, context truncation, or top-k reductions. Include rollback steps and communication templates. Treat “model changed behavior” as an expected incident class: log model versions, and schedule periodic re-evaluations of the golden set.
The practical outcome is operational maturity: you can detect regressions quickly, respond consistently, and demonstrate to auditors and stakeholders that the system is controlled, monitored, and aligned to enterprise expectations.
The OCI Generative AI exam is scenario-heavy: you are expected to choose the most appropriate action given constraints (security, cost, latency, correctness). Anchor your reasoning in the lifecycle you practiced: define objectives → implement RAG pipeline → add guardrails → evaluate → deploy → operate. When reading a question, identify which stage is failing. If users report wrong answers with correct sources retrieved, it is a generation/faithfulness issue; if answers are vague and uncited, it is likely prompt/citation enforcement; if answers are “I don’t know” too often, it may be retrieval recall, chunking, or top-k.
Use elimination tactics grounded in OCI best practices. Prefer least privilege IAM and compartment separation; prefer staging that mirrors production; prefer observability with logs/metrics; prefer rollback-capable deployments. When multiple choices appear plausible, choose the one that reduces risk while preserving measurability—for example, adding a golden set and offline metrics is often a better first step than changing models blindly.
Run a final readiness sprint over 5–7 days. Timebox daily sessions: (1) review one objective area (IAM/networking, RAG components, evaluation/operations), (2) implement or re-implement a small hands-on task (e.g., compute recall@k on a golden set, add a prompt regression gate, configure a canary rollout), and (3) record gaps as “facts to memorize” vs “skills to practice.” Your review plan should end with a timed full practice test and a focused gap closure day where you revisit only the missed objective categories.
The practical outcome is exam confidence rooted in engineering habit: you will recognize patterns (retrieval vs generation vs operations), select the safest OCI-aligned approach, and justify choices using measurable criteria like recall@k, citation coverage, latency, and SLOs.
1. Why does Chapter 6 recommend treating a RAG app as two coupled subsystems (retrieval and generation) when evaluating quality?
2. A user report shows the model produced a plausible answer that lacks citations and appears invented, even though relevant chunks were retrieved. What type of failure is this and what should you adjust first?
3. What is the primary purpose of building an evaluation set and computing offline RAG metrics in Chapter 6?
4. Which approach best matches the chapter’s improvement loop for a RAG system?
5. According to Chapter 6, why do many RAG teams encounter issues after validating locally but moving to production on OCI?