HELP

+40 722 606 166

messenger@eduailast.com

OCI Generative AI Pro Prep: Build & Test RAG Apps on Oracle Cloud

AI Certifications & Exam Prep — Intermediate

OCI Generative AI Pro Prep: Build & Test RAG Apps on Oracle Cloud

OCI Generative AI Pro Prep: Build & Test RAG Apps on Oracle Cloud

Pass OCI Generative AI by building production-style RAG apps end to end.

Intermediate oci · oracle-cloud · rag · generative-ai

Build what the OCI Generative AI exam expects you to understand

This course is a short, technical, book-style blueprint designed to help you prepare for the OCI Generative AI Professional exam by building a realistic Retrieval-Augmented Generation (RAG) application on Oracle Cloud. Instead of memorizing disconnected facts, you’ll connect exam objectives to concrete implementation decisions: tenancy setup, secure retrieval, prompting for grounded answers, evaluation, and production-minded operations.

Across six tightly sequenced chapters, you’ll create a working RAG pipeline—from ingestion and embeddings to retrieval, generation, guardrails, and testing—while continuously mapping your work back to certification-style scenarios. The result is both exam confidence and a portfolio-ready reference architecture you can reuse at work.

What you’ll build (end to end)

You’ll design and implement a complete RAG workflow that can answer questions from your private documents with citations, while respecting security boundaries. Along the way, you’ll practice the skills that typically appear in exam prompts: choosing the right services, configuring IAM, troubleshooting access, and validating correctness and safety.

  • A repeatable ingestion and chunking process with metadata you can filter on
  • An embeddings + vector index layer optimized for recall and latency
  • A retriever + generator orchestration with grounded prompts and citations
  • Defenses against prompt injection and data exfiltration attempts
  • An evaluation and regression-testing harness to prevent quality drift

How the “book” is structured

Each chapter reads like a focused technical chapter: concepts first, then a milestone-driven build plan, then exam-style reasoning. The progression is intentional: you start by translating the exam blueprint into architecture decisions, then you lay down OCI foundations, then you build retrieval, then you add generation, then you harden the system, and finally you test and prepare for exam scenarios.

Who this is for

This course is ideal if you’re comfortable with basic Python and APIs and want a pragmatic path to the OCI Generative AI Professional certification. It’s also a strong fit for cloud engineers and app developers who need to demonstrate they can build RAG systems responsibly: secure, observable, and testable.

What makes this different

  • Exam-aligned engineering: every chapter connects hands-on artifacts to the kinds of choices exam questions test.

  • Production-minded RAG: you’ll focus on evaluation, guardrails, and operational readiness—areas candidates often neglect.

  • Reusable templates: you’ll leave with checklists for IAM, retrieval tuning, prompt patterns, and regression tests.

Get started

If you’re ready to turn certification prep into a real build, you can Register free and begin Chapter 1. Prefer to compare options first? You can also browse all courses on Edu AI.

By the end, you won’t just “know about” RAG on OCI—you’ll be able to explain and defend your design choices, test your system for reliability, and walk into the exam with a structured approach to scenario questions.

What You Will Learn

  • Map OCI Generative AI exam objectives to hands-on RAG implementations
  • Provision core OCI resources for generative AI workloads (IAM, networking, logging)
  • Build a complete RAG pipeline: ingest, chunk, embed, index, retrieve, generate
  • Implement vector search and relevance tuning for enterprise document QA
  • Add guardrails: prompt safety, grounding, citations, and data protection controls
  • Evaluate and test RAG apps with offline metrics, golden sets, and regression suites
  • Deploy and operate RAG services with observability, cost controls, and CI/CD basics

Requirements

  • Basic Python proficiency (functions, packages, virtual environments)
  • Familiarity with REST APIs and JSON
  • General cloud concepts (regions, IAM, networking) helpful but not required
  • An Oracle Cloud account with permissions to create common resources
  • Local dev environment: Python 3.10+ and a code editor

Chapter 1: Exam Blueprint to Reference RAG Architecture

  • Identify exam domains and translate them into build tasks
  • Set up the project repo, environment, and OCI authentication path
  • Design a baseline RAG architecture for Oracle Cloud
  • Define success criteria: latency, accuracy, safety, and cost targets
  • Create a study-and-build plan with checkpoints and mock questions

Chapter 2: OCI Foundations for Generative AI Workloads

  • Create compartments, policies, and dynamic groups for least privilege
  • Stand up networking basics and private access patterns
  • Configure logging, monitoring, and audit for AI services
  • Establish a secure secrets strategy for keys and endpoints
  • Validate connectivity with a small “hello LLM” call

Chapter 3: Build the Ingestion + Vector Index Layer

  • Ingest documents and normalize formats for retrieval readiness
  • Implement chunking strategies and metadata enrichment
  • Generate embeddings and build a vector index
  • Run retrieval experiments and tune top-k + filtering
  • Package the indexing job for repeatable runs

Chapter 4: Assemble the RAG Application (Retriever + Generator)

  • Implement retrieval with filters, reranking hooks, and citations
  • Build prompting templates for grounded answers and tool-style behavior
  • Create an API layer for chat + QA workflows
  • Add conversation memory without leaking sensitive context
  • Run end-to-end demos on a representative dataset

Chapter 5: Safety, Security, and Governance for OCI RAG

  • Apply data access controls and tenant-safe design decisions
  • Implement prompt injection defenses and content safety checks
  • Add PII handling and redaction for logs and traces
  • Create audit-friendly evidence: policies, diagrams, and runbooks
  • Practice exam-style scenario questions on governance tradeoffs

Chapter 6: Testing, Evaluation, Deployment, and Final Exam Readiness

  • Build an evaluation set and compute offline RAG metrics
  • Create automated regression tests for retrieval and generation
  • Deploy the service with observability and rollback strategy
  • Optimize latency and cost with caching and batching
  • Run a final readiness sprint: timed practice and gap closure

Sofia Chen

Senior Machine Learning Engineer, Cloud AI & Search Systems

Sofia Chen is a senior machine learning engineer specializing in retrieval-augmented generation, vector search, and cloud-native ML delivery on Oracle Cloud. She has designed secure, testable AI services for enterprise teams and mentors engineers preparing for OCI certification exams.

Chapter 1: Exam Blueprint to Reference RAG Architecture

This course is an exam-prep guide that behaves like an engineering playbook: every objective you study should turn into something you can build, measure, and defend in a design review. The OCI Generative AI certification expects you to understand not only what Retrieval-Augmented Generation (RAG) is, but how it is operationalized on Oracle Cloud Infrastructure (OCI): identity boundaries, networking, observability, data handling, and repeatable evaluation. In this chapter you will convert the exam blueprint into a reference RAG architecture and a personal study plan. The goal is to remove “mystery points” from the exam by making each domain a concrete task: provision a compartment and policies, build a minimal pipeline, enforce safety and data controls, and prove it works with metrics.

Think of the rest of the book as a set of lab increments. If you can run the pipeline end-to-end (ingest → chunk → embed → index → retrieve → generate) and then harden it (guardrails, grounding, logging, cost control), you will naturally cover the bulk of the exam scope. Throughout, you will practice engineering judgment: where to place components, how to reduce latency, how to tune relevance, and how to avoid common security mistakes that invalidate an otherwise correct solution.

Practice note for Identify exam domains and translate them into build tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up the project repo, environment, and OCI authentication path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a baseline RAG architecture for Oracle Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define success criteria: latency, accuracy, safety, and cost targets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a study-and-build plan with checkpoints and mock questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Identify exam domains and translate them into build tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up the project repo, environment, and OCI authentication path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a baseline RAG architecture for Oracle Cloud: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define success criteria: latency, accuracy, safety, and cost targets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: OCI Generative AI certification scope and common pitfalls

The certification scope typically spans three overlapping areas: (1) core OCI foundations that enable AI workloads (IAM, compartments, networking, logging/monitoring), (2) generative AI usage patterns (prompting, RAG, embeddings, vector search, evaluation), and (3) production concerns (security, cost, reliability). Your first translation step is to map each objective to a “build task.” If the blueprint mentions identity and access, your build task is not reading docs—it is creating a compartment structure, dynamic groups (if you use compute), policies for Object Storage/Logging, and verifying that the runtime can read documents and write traces without over-privilege.

Common pitfalls are predictable. Many candidates memorize service names but cannot explain data flow end-to-end: where documents land, who can read them, which subnet the app runs in, and where logs go. Another frequent mistake is mixing responsibilities: letting the model “remember” enterprise facts instead of retrieving them, or indexing raw PDFs without text normalization and chunking. Finally, exam scenarios often probe least privilege and boundary thinking; overly broad policies (“manage all-resources in tenancy”) might make a lab work, but it is a design anti-pattern you should recognize and avoid.

Practical outcome for this section: create a one-page “objective → artifact” mapping. Example artifacts include: a repo with Infrastructure-as-Code (or CLI scripts), a minimal RAG app that runs against OCI services, and a dashboard view showing request latency and retrieval quality signals. Your study time becomes measurable output, not passive reading.

Section 1.2: RAG fundamentals: retrieval vs. generation responsibilities

RAG succeeds when you clearly separate retrieval responsibilities from generation responsibilities. Retrieval is about finding the right evidence quickly and consistently: document parsing, chunking strategy, embeddings, vector indexing, filtering, and ranking. Generation is about writing a helpful answer that is constrained by retrieved evidence: prompt structure, grounding, citations, refusal behaviors, and formatting. If you blur the line—asking the LLM to “guess” missing facts—you create hallucinations and compliance risks.

In a baseline enterprise document QA flow, the user question is embedded, the vector index returns top-k chunks, and the generator model is prompted with those chunks plus instructions. Engineering judgment shows up immediately. Chunk size affects recall vs. precision: large chunks preserve context but may dilute similarity; small chunks improve specificity but can lose meaning. Overlap helps continuity but increases index size and cost. Retrieval also benefits from metadata filters (department, document type, time range) to narrow search space before similarity scoring, which improves both speed and relevance.

Common mistakes include embedding the question and searching without any normalization (case, punctuation, boilerplate), indexing duplicate content, or ignoring “near-duplicate” chunks that crowd out diverse results. Another mistake is not validating retrieval independently. You should be able to answer: “Did the right chunk appear in the top-k?” before asking “Did the model answer well?” Practical outcome: define a minimal retrieval test set (a few known Q/A pairs with expected source documents) and track hit-rate@k as a first offline metric, even before you worry about perfect generation.

Section 1.3: Oracle Cloud reference components for a RAG app

A reference OCI RAG architecture is easiest to understand as layers: data, indexing, application runtime, model services, and operations. On the data side, Object Storage is a common landing zone for documents and extracted text. A processing component (a container, function, or job) reads objects, extracts text (PDF/HTML/Office), chunks it, and writes clean chunks plus metadata back to storage or directly into your index pipeline. For indexing, you need a vector-capable store. Depending on your chosen OCI-friendly stack, this might be a managed service you operate in your tenancy or a database/service that supports vector search; the architectural principle is consistent: store embeddings, chunk text, and metadata; support kNN search and filters.

For generation and embeddings, use OCI Generative AI endpoints where appropriate, keeping model calls inside the tenancy boundary and ensuring network paths are controlled. The application runtime can be a containerized API (for example on OCI Container Instances or Kubernetes) that exposes an endpoint: /ask triggers retrieval and generation, /ingest triggers processing, and /health reports dependencies. Networking typically includes a VCN with private subnets for app components, service gateways/private endpoints for access to OCI services, and tight security lists/NSGs. IAM binds it together: policies for reading Object Storage buckets, invoking Generative AI, writing logs, and (if needed) accessing databases.

Observability is not optional. Configure Logging for application logs and audit trails, and add metrics for latency across stages: embed time, vector query time, model generation time. Practical outcome: draw a data-flow diagram with arrows labeled by protocol and identity (which principal calls what). In exams and in real life, clarity about “who calls whom, from where, with what permissions” is often the difference between a correct and an incomplete architecture.

Section 1.4: Dev workflow: local notebooks vs. services, secrets, and config

Most RAG projects begin in a notebook, but they pass the exam—and survive production—only when you can move from exploratory code to a service with controlled configuration. A strong workflow starts with a repo layout that separates notebooks (experiments), src/ (reusable pipeline code), infra/ (OCI provisioning scripts), and tests/ (retrieval and regression checks). Keep a single configuration contract (YAML or environment variables) that can run locally and in OCI, so you avoid “works on my machine” drift.

OCI authentication should be explicit. Locally you might use an OCI config profile with an API key, but for deployed workloads you should prefer resource principals or instance principals so secrets are not copied into containers. This is an exam-relevant habit: treat credentials as short-lived and scoped. Store sensitive values in OCI Vault (or equivalent secret storage), retrieve them at runtime, and never commit them to Git. Another practical pattern is to define a thin “client factory” module that creates OCI SDK clients using either local config or resource principal automatically, so the same code path runs in both environments.

Common mistakes include hardcoding compartment OCIDs, mixing dev and prod buckets, or logging full prompts and retrieved content without redaction. Practical outcome: implement a minimal “auth path” checklist in your repo: how to run locally, how to run in OCI, where secrets live, and how to rotate them. This reduces setup friction and directly supports exam scenarios around secure operations.

Section 1.5: Data and model boundaries: what must never leave the tenancy

RAG architecture is as much about boundaries as it is about models. Your design must clearly state what data is allowed to leave a compartment, a VCN, or the tenancy—and what must never leave. Enterprise documents, user queries that contain sensitive identifiers, and retrieved passages often fall into “must remain inside tenancy” by default. That requirement influences service selection, network routing, logging strategy, and even prompt templates.

Start by classifying data: public, internal, confidential, regulated. Then enforce boundaries with technical controls. Use private networking patterns where possible (private endpoints/service gateways) so calls to OCI services do not traverse the public internet. Apply IAM policies that allow only the app runtime to read the source buckets and only the ingestion pipeline to write processed chunks. For logging, default to metadata logging (timings, doc IDs, similarity scores) and carefully gate any content logging behind explicit debug flags with redaction. For prompts, apply guardrails: instructions to cite sources, refuse unsupported claims, and avoid revealing system prompts or sensitive content not present in retrieved context.

A common mistake is accidental exfiltration through observability: writing full retrieved chunks into logs for troubleshooting. Another is uncontrolled caching in client-side apps. Practical outcome: write a “data handling contract” for your app: what gets stored (chunks, embeddings, metadata), retention periods, who can access it, and how you prevent cross-tenant or cross-compartment leakage. Even when not explicitly asked, this mindset aligns with exam expectations for secure, compliant OCI deployments.

Section 1.6: Building a mini exam rubric: objectives, labs, and review loop

To prepare efficiently, you need a rubric that connects objectives to labs and to evidence that you learned the skill. Build a simple table with three columns: exam objective, lab checkpoint, and verification artifact. For example: “Provision IAM and networking for AI workload” maps to a checkpoint where your RAG service runs in a private subnet with least-privilege policies; the artifact is a policy snippet, a network diagram, and a successful run that writes logs to OCI Logging. “Implement vector search and relevance tuning” maps to a checkpoint where you can adjust chunk size, overlap, top-k, and metadata filters, and show retrieval hit-rate improvements on a small golden set.

Your review loop should be iterative: implement, measure, adjust, and document. Define success criteria early so you can tell when you are done. At minimum, set targets for latency (p95 end-to-end and per-stage), accuracy (retrieval hit-rate@k and a qualitative answer review), safety (refusal/grounding behavior, no sensitive leakage in logs), and cost (embedding/index size and model call frequency). Avoid the trap of optimizing only the model response; many RAG failures are retrieval failures, and many cost blowups come from over-chunking and overly large top-k contexts.

Practical outcome: a personal study-and-build plan with dates and checkpoints. Each checkpoint ends with a short written note: what changed, what metric moved, what risk remains. This turns exam prep into a portfolio of verified competencies and makes later chapters faster, because you will already have the reference architecture, repo, and evaluation habits in place.

Chapter milestones
  • Identify exam domains and translate them into build tasks
  • Set up the project repo, environment, and OCI authentication path
  • Design a baseline RAG architecture for Oracle Cloud
  • Define success criteria: latency, accuracy, safety, and cost targets
  • Create a study-and-build plan with checkpoints and mock questions
Chapter quiz

1. What is the chapter’s core approach to preparing for the OCI Generative AI certification?

Show answer
Correct answer: Translate each exam objective into something you can build, measure, and defend in a design review
The chapter frames exam prep as an engineering playbook: every domain becomes a concrete build-and-evaluate task.

2. Which set of concerns does the chapter emphasize as necessary to operationalize RAG on OCI (beyond knowing what RAG is)?

Show answer
Correct answer: Identity boundaries, networking, observability, data handling, and repeatable evaluation
The chapter stresses operationalization on OCI, including security boundaries, infrastructure placement, and measurable evaluation.

3. Which option best represents the minimal end-to-end RAG pipeline described in the chapter?

Show answer
Correct answer: Ingest → chunk → embed → index → retrieve → generate
The chapter explicitly lists the pipeline order from ingestion through generation.

4. According to the chapter, what is the purpose of defining success criteria such as latency, accuracy, safety, and cost targets?

Show answer
Correct answer: To prove the system works with metrics and reduce “mystery points” on the exam
Success criteria make performance and compliance measurable, supporting defensible designs and repeatable evaluation.

5. Which action best reflects the chapter’s method for turning an exam domain into a concrete task?

Show answer
Correct answer: Provision a compartment and policies, then build a minimal pipeline and validate it with metrics
The chapter suggests making each domain tangible (e.g., compartments/policies, pipeline build, safety controls) and validating via metrics.

Chapter 2: OCI Foundations for Generative AI Workloads

Before you build a Retrieval-Augmented Generation (RAG) pipeline on Oracle Cloud Infrastructure (OCI), you need foundations that won’t collapse under real enterprise constraints: least-privilege access, predictable networking, measurable observability, and safe configuration handling. Many “it worked on my laptop” AI prototypes fail not because the model is wrong, but because credentials leak into logs, services can’t reach endpoints privately, or nobody can explain why latency spiked during a release.

This chapter sets up the OCI primitives you’ll reuse throughout the course: compartments and tags for resource hygiene, IAM policies and dynamic groups for secure automation, VCN patterns for private access, logging/monitoring/audit baselines for production traceability, and a secrets strategy that keeps endpoints and keys out of source control. You will end by validating connectivity with a small “hello LLM” call—your first proof that identity, network, and configuration are aligned.

  • Outcome: a tenancy layout that supports multiple RAG environments (dev/test/prod) and clean cost allocation.
  • Outcome: IAM and networking that allow apps to call OCI AI services securely, including private patterns where appropriate.
  • Outcome: observability and secrets practices that prevent common operational and security mistakes.

Think of these foundations as the scaffolding for every exam objective that follows: ingestion pipelines, vector search, prompt safety, and evaluation all depend on controlled access and reliable telemetry.

Practice note for Create compartments, policies, and dynamic groups for least privilege: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Stand up networking basics and private access patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Configure logging, monitoring, and audit for AI services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish a secure secrets strategy for keys and endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate connectivity with a small “hello LLM” call: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create compartments, policies, and dynamic groups for least privilege: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Stand up networking basics and private access patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Configure logging, monitoring, and audit for AI services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish a secure secrets strategy for keys and endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Tenancy structure: compartments, tags, and resource hygiene

Start by treating your OCI tenancy like a multi-environment software product. The simplest robust pattern is a top-level compartment for the course or program (for example, genai-rag), with child compartments for dev, test, and prod. This allows you to scope IAM policies cleanly and to quarantine experimental work without risking production data or quotas. A frequent mistake is placing everything in the root compartment “temporarily”—and then never moving it. That breaks audit clarity and makes least privilege harder because policies become too broad.

Use tags to enforce hygiene and cost attribution. Define a small tag namespace (such as CostCenter, Environment, Owner, DataSensitivity) and require them for major resources: VCNs, subnets, compute, databases, object storage buckets, logging, and API gateways. If you do nothing else, tag Environment and Owner; those two tags dramatically reduce the “what is this and who pays?” problem during incidents.

  • Compartment strategy tip: keep shared services (for example, a central logging bucket or network hub) in a separate shared compartment, and grant read/emit permissions from dev/test/prod.
  • Naming conventions: include environment and region where helpful (e.g., rag-dev-vcn, rag-prod-logs). Consistency beats cleverness.
  • Lifecycle hygiene: decide upfront how to delete or archive ingestion artifacts (documents, embeddings exports) so that experiments don’t accumulate unnoticed.

Practical outcome: by the time you build indexing and retrieval, you’ll be able to isolate test datasets, rotate credentials safely, and produce an audit-friendly record of where enterprise documents live and who accessed them.

Section 2.2: IAM: policies, groups, dynamic groups, and principals

Generative AI workloads involve multiple identities: humans (developers, operators), services (functions, compute instances, container workloads), and sometimes external callers (CI/CD systems). OCI IAM gives you the vocabulary to express least privilege: groups for humans, dynamic groups for resources, and policies that grant permissions in specific compartments.

For a RAG app, avoid “one policy to rule them all.” Instead, separate roles: a RAG-Developer group that can manage dev resources; a RAG-Operator group that can read logs/metrics and restart services; and a RAG-ReadOnly group for auditors or reviewers. Then use a dynamic group for the runtime (for example, an instance, OKE worker nodes, or Functions) so the app can read secrets, write logs, and call AI services without embedding user credentials.

  • Common mistake: giving developers broad tenancy-wide permissions “for speed,” then forgetting to remove them. Scope policies to compartments and use separate compartments for environments.
  • Principal choice: prefer resource principals (dynamic group + policy) for workloads running on OCI. This avoids distributing API keys and reduces rotation burden.
  • Least-privilege mindset: grant only what the app needs: read a specific vault secret, use a specific AI endpoint, write to a specific log group.

Engineering judgement: start strict and open only what you must. When something fails, don’t immediately widen a policy. First identify whether the failure is IAM, network, or endpoint configuration. In OCI, many “permission denied” errors are actually missing compartment scoping or using the wrong principal (user API key vs resource principal). Getting this right early makes later guardrails (grounding, citations, data protection) enforceable because you’ll know which identity accessed which dataset and when.

Section 2.3: Networking for AI apps: VCN, subnets, gateways, private endpoints

RAG apps are networked systems: they ingest documents from storage, call embedding and LLM endpoints, query a vector store, and emit logs and metrics. Your default goal is controlled egress and predictable paths. Create a VCN with at least two subnets: a private subnet for app runtimes (compute/OKE/functions), and a public subnet only if you need an internet-facing load balancer or bastion-style access. Keep databases, caches, and internal services private.

Decide how the app reaches OCI services. The most common pattern is private workloads with controlled outbound access via a NAT Gateway for internet egress, plus a Service Gateway for private access to OCI services that support it (so traffic stays on Oracle’s network). If you need inbound administration, use a Bastion service or a locked-down jump host rather than opening SSH broadly.

  • Private access pattern: run the RAG API in a private subnet; expose only an API Gateway or Load Balancer in a public subnet.
  • Route tables and security lists/NSGs: model flows explicitly: app → AI endpoint, app → object storage, app → logging. Overly permissive “allow all egress” rules are a common security review failure.
  • DNS and endpoints: confirm that your runtime can resolve service endpoints from inside the VCN; misconfigured DNS is a frequent cause of “mysterious timeouts.”

Practical outcome: when you later implement vector search and relevance tuning, you can trust that latency measurements reflect your system design—not accidental internet routing. And when you introduce enterprise document sources, private access patterns reduce exposure and simplify compliance arguments.

Section 2.4: Observability: Logging, Monitoring, Alarms, Audit baseline

Generative AI systems require more observability than typical CRUD apps because failures often appear as “bad answers” rather than explicit errors. Establish an observability baseline now: Logging for application and service logs, Monitoring for metrics, Alarms for actionable thresholds, and Audit for governance-grade records of API calls.

In practice, create log groups per environment (dev/test/prod) and decide what your RAG app must log. At minimum: request ID, document IDs retrieved, embedding model/version, top-k scores, and the final prompt token count. Avoid logging raw prompts or retrieved text by default if they may contain sensitive data—log hashes, IDs, and short summaries instead. A common mistake is dumping full documents into logs “for debugging,” which becomes a data leak and a retention liability.

  • Monitoring signals: latency (p50/p95), error rates, rate-limit responses, token usage, vector query time, and retrieval hit rate.
  • Alarms: trigger on sustained 5xx errors, unusual token spikes, or retrieval returning zero results above a minimum score.
  • Audit baseline: verify that Audit is enabled and that you can answer: who created an endpoint, who changed a policy, and when secrets were accessed.

Engineering judgement: build “debuggability” into the app contract. If a user reports hallucination, you should be able to reconstruct what context was retrieved, what model was called, and what configuration was active—without exposing the underlying sensitive text. This discipline directly supports later evaluation and regression testing because you can correlate answer quality shifts with configuration changes.

Section 2.5: Secrets management and configuration patterns

RAG apps touch many sensitive values: API keys (if you must use them), private endpoints, database passwords, signing keys, and occasionally third-party credentials for content ingestion. The rule is simple: no secrets in source control, no secrets in container images, and no secrets in plain environment files committed to repositories. Use OCI Vault for secrets and encryption keys, and grant access via IAM policies to the workload’s dynamic group.

Separate configuration into (1) non-secret runtime config (region, compartment OCIDs, model names, index names), and (2) secrets (tokens, passwords, private keys). Non-secret config can live in environment variables or config files delivered by your deployment pipeline. Secrets should be fetched at runtime from Vault, ideally cached in memory with a short TTL and refreshed safely. If you need to rotate secrets, rotation should not require a code change—only a Vault update.

  • Common mistake: logging environment variables during startup. Many frameworks print config on boot; disable or redact it.
  • Endpoint hygiene: treat internal endpoints and OCIDs as sensitive in enterprise contexts; they can reveal architecture.
  • Practical pattern: store a single “app config” secret containing a JSON map of credentials, versioned in Vault, and validate schema at startup.

To validate everything end-to-end, perform a small “hello LLM” call from the same runtime environment your RAG service will use (not from your laptop). This confirms the principal, network routes, DNS resolution, and secret retrieval are correct. Keep the prompt harmless and generic, and record only minimal metadata (status, latency, token counts) in logs.

Section 2.6: Cost and quota awareness: limits, budgeting, and safe defaults

Generative AI is cost-sensitive by design: token-based billing, embedding at ingestion time, vector storage growth, and bursty query traffic. Build cost and quota awareness into your foundation so you don’t discover limits during a demo or an exam lab. Start by reviewing service limits and quotas relevant to your architecture: model endpoints, request rates, object storage capacity, networking limits, and any vector database constraints. Treat quotas as design inputs, not afterthoughts.

Establish budgets per compartment (dev/test/prod) and use tagging to attribute spend. In dev, set tighter budgets and alert thresholds. Also adopt safe defaults in code and deployment: cap max tokens, limit top-k retrieval, and enforce request timeouts. A common mistake is leaving “max tokens” unconstrained during testing; one runaway prompt or loop can turn into a surprise bill and noisy logs.

  • Budgeting practice: create alerts at 50%, 80%, and 100% of monthly budget; route them to the on-call channel.
  • Quota planning: request quota increases before load testing; don’t interpret throttling as “the model is down.”
  • Cost guardrails: default to smaller models for dev, batch embeddings, and implement caching for repeated retrieval queries.

Practical outcome: your later RAG evaluation and regression suites can run predictably without accidentally exhausting limits. Cost awareness also improves engineering judgement: you’ll know when to optimize chunking and retrieval (cheaper) versus calling larger models more often (expensive). With compartments, IAM, networking, observability, and secrets in place, you’re ready to build the first real pipeline components confidently.

Chapter milestones
  • Create compartments, policies, and dynamic groups for least privilege
  • Stand up networking basics and private access patterns
  • Configure logging, monitoring, and audit for AI services
  • Establish a secure secrets strategy for keys and endpoints
  • Validate connectivity with a small “hello LLM” call
Chapter quiz

1. Which foundation best addresses the risk of AI prototypes failing due to over-permissioned access and insecure automation?

Show answer
Correct answer: Use compartments with IAM policies and dynamic groups to enforce least privilege
The chapter emphasizes least-privilege access using compartments, IAM policies, and dynamic groups to secure automated workloads.

2. What problem is the chapter’s networking guidance primarily trying to prevent for generative AI workloads on OCI?

Show answer
Correct answer: Services being unable to reach required endpoints privately in enterprise environments
It highlights VCN patterns and private access so services can reach endpoints securely without relying on public exposure.

3. Why does Chapter 2 stress configuring logging, monitoring, and audit before building the RAG pipeline?

Show answer
Correct answer: To provide production traceability and explain issues like latency spikes during releases
The chapter frames observability as essential for measurable operations and post-change diagnosis (e.g., latency spikes).

4. Which practice aligns with the chapter’s goal of keeping endpoints and keys out of source control and logs?

Show answer
Correct answer: Establish a secrets strategy to store and retrieve configuration securely
A secure secrets strategy prevents credentials and endpoints from leaking into source control or logs.

5. What is the purpose of ending the chapter with a small “hello LLM” call?

Show answer
Correct answer: To validate that identity, network, and configuration are correctly aligned
The “hello LLM” call is presented as the first proof that IAM, networking, and configuration foundations are working together.

Chapter 3: Build the Ingestion + Vector Index Layer

A Retrieval-Augmented Generation (RAG) app is only as trustworthy as its corpus pipeline. In exam terms, this chapter maps directly to objectives around building an end-to-end RAG workflow (ingest → chunk → embed → index → retrieve), plus the operational controls that make it repeatable and auditable on Oracle Cloud Infrastructure (OCI). In engineering terms, you are building a “document supply chain”: raw files arrive, are normalized, broken into retrieval-sized units, enriched with metadata, embedded, indexed, and then continuously maintained.

The most common failure mode is treating ingestion and indexing as a one-time script. Real enterprise corpora change daily: policies are revised, PDFs are re-exported, Confluence pages move, and access rights change. Your ingestion layer therefore needs clear patterns for batch and incremental processing, plus an approach to change data capture (CDC) so the index reflects the source of truth. Next, your chunking strategy determines retrieval quality and cost: large chunks often reduce the number of embeddings but can bury the answer; tiny chunks improve pinpointing but increase index size and retrieval noise. Metadata is the third leg of the stool—without it you cannot filter by access control, freshness, or source provenance, and you cannot produce citations with confidence.

Once chunking and metadata are stable, you can generate embeddings and construct a vector index. Here you make model choices (dimension, multilingual support, domain fit), operational choices (caching, normalization), and platform choices (where the index lives and how it is queried). Finally, you run retrieval experiments—vary top-k, apply filters, compare chunk sizes—so you can tune relevance before you ever connect a large language model (LLM) to generation. The practical outcome is a packaged indexing job you can run repeatedly (e.g., daily), producing predictable artifacts and logs suitable for regression testing.

This chapter focuses on the ingestion + vector index layer. In later chapters, you will connect this to the generation step, grounding, citations, and evaluation. For now, think like a systems builder: design for repeatability, traceability, and controlled change.

Practice note for Ingest documents and normalize formats for retrieval readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement chunking strategies and metadata enrichment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Generate embeddings and build a vector index: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run retrieval experiments and tune top-k + filtering: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Package the indexing job for repeatable runs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Ingest documents and normalize formats for retrieval readiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement chunking strategies and metadata enrichment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Document ingestion patterns: batch, incremental, and CDC concepts

Ingestion is the controlled movement of documents from a source system into a retrieval-ready store. Start by classifying sources: object storage buckets (PDF/DOCX), wikis (HTML), ticketing systems, shared drives, and databases. Each class has different update behavior, which drives your ingestion pattern.

Batch ingestion is simplest: on a schedule, scan a location and reprocess everything. It’s acceptable for small corpora or early prototypes, but it becomes costly as you scale because you repeatedly parse and embed unchanged content. A common mistake is using batch forever and then being surprised by long indexing windows and exploding embedding costs.

Incremental ingestion processes only new or modified items since the last run. Practically, this means maintaining a watermark: a last-seen timestamp, an object ETag, or a content hash. For OCI Object Storage, you can use object metadata (last modified) plus ETag/versioning where available. Incremental runs should be idempotent: re-running the job should not create duplicates or inconsistent states.

CDC (Change Data Capture) concepts apply when your source is a system of record (database, content platform) that emits changes. The key idea is to ingest “events” (create/update/delete) rather than periodically scanning. Even if you cannot implement true CDC, you can emulate it by computing a stable content fingerprint per document and comparing it on each run. Do not ignore deletes—stale chunks remaining in the index are a major cause of hallucinated answers because the retriever keeps surfacing content that no longer exists.

Normalization belongs in ingestion: convert documents to a canonical text form (e.g., UTF-8), preserve headings and lists, extract tables carefully, and record extraction warnings. In OCI workflows, write the normalized text and a structured manifest (document_id, source_uri, extracted_at, hash) to Object Storage so downstream chunking and embedding can be rerun deterministically.

Section 3.2: Chunking and overlap: tradeoffs for recall and cost

Chunking turns normalized documents into retrieval units. The goal is to maximize the chance that the retriever returns a chunk containing the answer while minimizing noise and cost. There is no universal chunk size; you choose based on document style, question types, and token budgets.

A practical baseline is semantic-ish chunking by structure: split by headings, then by paragraphs, and only fall back to fixed token windows when sections are extremely long. Fixed-size chunking (e.g., 400–800 tokens) is easy to implement, but it can cut definitions in half or separate a procedure from its constraints. Structure-aware chunking tends to improve relevance because boundaries align with meaning.

Overlap (e.g., 10–20% of chunk length) reduces boundary loss: if the answer spans two chunks, overlap increases the odds at least one chunk contains the full context. The tradeoff is straightforward: overlap increases chunk count, embedding calls, index size, and sometimes near-duplicate retrieval results. A common mistake is applying large overlap everywhere; instead, use more overlap for narrative text and less for bullet-heavy specs where boundaries are already concise.

Engineering judgement: tune chunk size to your retrieval objective. If questions are “Where do I click?” or “What is the definition of X?”, smaller chunks often win. If questions require multi-step context (policy + exceptions), larger chunks may be necessary. Measure rather than guess: build a small golden set of queries and compare recall@k as you vary chunk size and overlap.

Finally, keep chunk identifiers stable. A good pattern is doc_id + section_path + chunk_index. Stable IDs make reindexing and deduplication possible, and they allow you to attach citations that are consistent across runs.

Section 3.3: Metadata design: source, access labels, freshness, and lineage

Metadata is what turns vector search from “similar text” into “correct text for this user, right now, from the right place.” You should design metadata deliberately before you embed anything, because changing metadata later can force large reindex operations.

Start with source metadata: source_uri (where it came from), source_type (pdf/wiki/db), title, author/owner if known, and a human-friendly citation label. This directly supports grounded answers with citations and enables debugging when retrieval returns surprising chunks.

Next is access control metadata. For enterprise QA, you must prevent retrieval of content the caller should not see. Add fields like access_group_ids, classification (public/internal/confidential), and optionally tenancy_id or business unit. The retrieval query should always apply filters based on the caller’s entitlements. A common mistake is applying access checks only at the UI or generation layer; by then you have already leaked content into the prompt or logs.

Freshness metadata lets you prefer the newest policy or exclude obsolete material. Store source_last_modified, ingested_at, and a computed content_version (hash). This supports ranking tweaks (boost recent) and enables you to find stale chunks during audits.

Lineage is how you make the pipeline testable: record the extractor version, chunker version, embedding model name/version, and indexing job run_id. When retrieval quality changes after a deployment, lineage allows you to correlate the change with pipeline modifications rather than guessing. In OCI, write the lineage fields into each indexed record and mirror them in your object-store manifests so you can reproduce an index build exactly.

Section 3.4: Embeddings: model choice, dimension, normalization, caching

Embeddings convert each chunk into a numeric vector that captures semantic similarity. Your embedding choices shape both retrieval quality and operational cost.

Model choice: prefer a model that matches your language and domain. If you have multilingual content, validate cross-language retrieval explicitly (query in English, answer in Japanese, etc.). If your documents contain many product codes or acronyms, test whether the model preserves those distinctions. In OCI Generative AI contexts, align with supported embedding models and note their intended use cases; exam scenarios often emphasize selecting appropriate models for retrieval versus generation.

Dimension affects index size and sometimes accuracy. Higher dimensions can represent nuance but cost more in storage and compute. Don’t assume “bigger is better”; run retrieval experiments with a fixed golden set and compare recall/precision at k. Also plan for model upgrades: store the embedding model identifier in metadata so you can run parallel indexes during migration.

Normalization: many vector databases and libraries assume cosine similarity; for cosine, L2-normalizing vectors is common. If your index uses inner product or Euclidean distance, ensure your embedding output and index metric match. A subtle but frequent bug is mixing normalized vectors with a distance metric that expects raw magnitudes, leading to degraded relevance that is hard to diagnose.

Caching and batching: embeddings are often the most expensive step. Cache by (chunk_text_hash, model_id) so reruns don’t re-embed unchanged chunks. Batch embedding calls to reduce overhead, but keep batches small enough to handle retries cleanly. Persist intermediate artifacts (chunk JSONL + embeddings) to Object Storage so you can rerun indexing without recomputing embeddings when only the index backend changes.

Section 3.5: Vector index options and query patterns on Oracle Cloud

On Oracle Cloud, you have multiple ways to host and query a vector index depending on scale, latency requirements, and the systems you already operate. The key is to separate concerns: (1) where vectors live, (2) how you filter by metadata, and (3) how you run similarity search efficiently.

A common enterprise pattern is to use an Oracle-managed datastore that supports vector search alongside rich filtering. In those setups, you model each chunk as a row/document with columns/fields for metadata plus a vector column/field for embeddings. This gives you two critical capabilities: hybrid querying (vector similarity + metadata predicates) and operational governance (backups, access policies, auditing). Another viable pattern is to keep vectors in a dedicated vector engine while storing metadata in a relational store; this can scale well but makes filtering and consistency harder because you must join results across systems.

Query patterns to practice for RAG: (1) top-k similarity for a user query embedding; (2) filtered top-k by access labels and source type; (3) time-bounded retrieval for “latest policy”; and (4) hybrid lexical + vector when exact terms matter (error codes, part numbers). If your platform supports it, hybrid search often improves precision for technical corpora because it blends semantic similarity with keyword constraints.

Tuning top-k is not “set it to 20 and forget it.” Larger k increases recall but also increases prompt length and the chance of injecting irrelevant context into generation. Start with k=5–10, then evaluate. If you frequently miss answers, increase k or improve chunking; if answers are verbose or off-topic, reduce k and add filters. Always log retrieval results (doc_id, score, metadata) so you can audit why a given answer was produced.

Section 3.6: Reindexing, deduplication, and drift: keeping the corpus healthy

Once your index exists, the real work begins: keeping it correct as the corpus evolves. Plan for reindexing as a first-class operation, not an emergency procedure.

Reindexing triggers include: new embedding model, changed chunking rules, metadata schema updates, or major source format changes (e.g., a wiki export layout). The safe approach is to build a new index in parallel, run retrieval regression tests against a golden set, then switch traffic. This avoids partial migrations where half the corpus is embedded differently.

Deduplication prevents repeated content from dominating retrieval. Duplicates come from mirrored sources, re-exported PDFs, and overlapping chunk strategies. Implement a two-layer defense: document-level dedupe using a normalized document hash, and chunk-level dedupe using a chunk_text_hash. When duplicates are legitimate (e.g., the same policy appears in two portals), preserve both but tag them with a canonical_source field so you can prefer the authoritative location.

Drift is the gradual mismatch between the index and reality: access rights change, documents are deleted, or “latest” guidance moves. Drift shows up as users seeing outdated answers or citations pointing to missing pages. Combat drift with scheduled incremental runs, delete propagation (tombstones), and freshness checks that alert when high-traffic documents haven’t been re-ingested recently.

Finally, package the indexing job for repeatable runs: configuration-driven source connectors, deterministic manifests, structured logs, and a run_id that ties together ingestion, chunking, embedding, and indexing. In OCI, this packaging supports operational handoff: you can run the job in a CI/CD pipeline or scheduled automation, capture logs centrally, and treat the vector index as a governed artifact rather than an ad-hoc cache.

Chapter milestones
  • Ingest documents and normalize formats for retrieval readiness
  • Implement chunking strategies and metadata enrichment
  • Generate embeddings and build a vector index
  • Run retrieval experiments and tune top-k + filtering
  • Package the indexing job for repeatable runs
Chapter quiz

1. Why does Chapter 3 emphasize designing ingestion and indexing as a repeatable pipeline rather than a one-time script?

Show answer
Correct answer: Enterprise corpora change frequently, so the index must stay aligned with the source of truth through batch/incremental processing and change tracking
The chapter highlights daily changes (revisions, moves, access changes) and the need for batch/incremental patterns and CDC so the index remains accurate and auditable.

2. What is the main trade-off described when choosing chunk sizes for retrieval?

Show answer
Correct answer: Larger chunks lower embedding count but can bury answers; smaller chunks improve pinpointing but increase index size and retrieval noise
Chunk size directly impacts both retrieval quality and operational cost (embedding count, index size, and noise).

3. According to the chapter, why is metadata considered essential in the ingestion + indexing layer?

Show answer
Correct answer: It enables filtering (e.g., access control, freshness, provenance) and supports confident citations/traceability
Metadata is described as the “third leg of the stool” because it enables critical filters and provenance/citation confidence.

4. What is the purpose of running retrieval experiments before connecting an LLM to generation?

Show answer
Correct answer: To tune relevance by varying top-k, applying filters, and comparing chunk sizes before generation is introduced
The chapter stresses tuning retrieval (top-k, filters, chunking) to improve relevance before any generation step is added.

5. Which outcome best matches the chapter’s definition of a well-built ingestion + vector index layer on OCI?

Show answer
Correct answer: A packaged indexing job that can run repeatedly (e.g., daily) and produces predictable artifacts and logs suitable for regression testing
The chapter’s practical outcome is a repeatable, auditable indexing job with predictable artifacts and logs for controlled change and testing.

Chapter 4: Assemble the RAG Application (Retriever + Generator)

In Chapters 1–3 you built the ingredients of Retrieval-Augmented Generation (RAG): data ingestion, chunking, embeddings, and a vector index. This chapter turns those ingredients into an application that behaves like an enterprise QA assistant: it retrieves the right evidence, uses the model to answer while staying grounded, and exposes reliable APIs for chat and document question answering.

A production RAG app is not “vector search + a prompt.” It is an orchestration of retrieval decisions (filters, query rewriting, hybrid search), relevance tuning (reranking hooks), generation controls (grounding, citations, refusals), and operational guardrails (timeouts, rate limits, logging). You will also add conversation memory in a way that helps the user without leaking sensitive context across sessions.

Throughout this chapter, keep one engineering principle in mind: every step should either improve relevance, improve safety, or improve reliability. If a step cannot be measured or observed (via logs/metrics), it will be hard to debug later. By the end, you will be able to run an end-to-end demo on a representative dataset and explain how each component maps to OCI Generative AI exam objectives and real-world implementation choices.

Practice note for Implement retrieval with filters, reranking hooks, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build prompting templates for grounded answers and tool-style behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create an API layer for chat + QA workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add conversation memory without leaking sensitive context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run end-to-end demos on a representative dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement retrieval with filters, reranking hooks, and citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build prompting templates for grounded answers and tool-style behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create an API layer for chat + QA workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add conversation memory without leaking sensitive context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run end-to-end demos on a representative dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Query rewriting and intent detection for better retrieval

Section 4.1: Query rewriting and intent detection for better retrieval

Retrieval quality often fails before vector search even runs: the user’s question may be ambiguous, missing nouns, or overloaded with chatty context. Query rewriting is a lightweight step that converts the raw message into a search-ready query that matches your document language. Pair it with intent detection so you can choose the right retrieval strategy (QA vs. policy lookup vs. troubleshooting vs. “how-to”).

Practical workflow: first, classify intent and constraints. For example, detect whether the user is asking for (a) a factual answer with citations, (b) a procedural checklist, (c) a comparison, or (d) an action request outside scope. Then rewrite the query into 1–3 concise search queries, preserving key entities (product names, dates, region, service). In an OCI setting, you might also map to metadata filters (compartment, doc type, business unit, effective date) to reduce noise and improve security boundaries.

  • Rewrite pattern: “User asks: ‘Does this apply to us in EMEA?’” → “EMEA applicability of <policy name>” and “regional exceptions EMEA <policy name>”.
  • Constraint extraction: pull out time (“as of 2025”), location (“EMEA”), system (“OCI Logging”), or document type (“runbook”).
  • Refuse early: if intent is disallowed (e.g., requesting secrets), short-circuit before retrieval to avoid spreading sensitive terms into logs and vector queries.

Common mistakes include rewriting that drops critical nouns (“it,” “this policy”) or expanding the query with hallucinated details. To avoid that, enforce a rule: rewritten queries must only use entities present in the conversation or in a controlled glossary. Track this in logs: raw query, rewritten query, detected intent, and any extracted filters. These artifacts become your debugging trail when retrieval seems “random.”

Section 4.2: Reranking and hybrid search concepts: lexical + vector

Section 4.2: Reranking and hybrid search concepts: lexical + vector

Vector search is strong at semantic similarity but can miss exact matches (error codes, part numbers, legal terms). Lexical search (BM25-style keyword matching) is strong at exactness but weak on paraphrase. Hybrid search combines both, typically by retrieving candidates from both systems and merging them, then reranking the merged set.

A practical pattern is a two-stage retriever: Stage 1 retrieves a broad candidate set (e.g., top 50) using vector similarity, lexical match, or both. Stage 2 reranks those candidates using a cross-encoder reranker or a model-based scoring hook. Even if you do not yet deploy a dedicated reranker model, design the hook now: it is an interface that takes (query, chunk) pairs and returns a relevance score. Your OCI implementation can start with a weighted blend of scores (vector score + keyword score + metadata boosts) and later upgrade to a learned reranker.

  • Metadata boosts: prefer newer documents, “official” sources, or specific doc types (e.g., “policy” outranks “wiki comment”).
  • Filters first, then rank: apply security and tenant filters before ranking so irrelevant documents never enter the candidate pool.
  • Diversity control: cap chunks per document to avoid returning 10 near-duplicate chunks from one PDF.

Citations depend on stable chunk identifiers. If reranking changes the top-k, your citation list should follow the reranked order, and you should keep enough text for quoting (e.g., 1–2 sentences) without exceeding prompt limits. A common mistake is retrieving too few candidates (top 5) and then attempting reranking; reranking cannot rescue missing evidence. Another mistake is using hybrid retrieval but forgetting to normalize scores, causing one subsystem to dominate and degrade results. Build a small tuning harness: run a golden set of questions, track Recall@K, and compare configurations (vector-only vs lexical-only vs hybrid + rerank).

Section 4.3: Prompt engineering for RAG: grounding, citations, refusals

Section 4.3: Prompt engineering for RAG: grounding, citations, refusals

The generator is where relevance becomes an answer, and where safety failures become user-facing incidents. Your prompt must clearly define grounded behavior: use the retrieved context, cite sources, and refuse when evidence is missing. Treat the prompt as a contract between your application and the model.

A robust RAG prompt typically includes: (1) role and task, (2) grounding rules, (3) citation format, (4) refusal criteria, and (5) response style. Keep the rules short and testable. For example: “If the answer is not explicitly supported by the provided sources, say you cannot confirm and ask a clarifying question.” This reduces hallucinations more effectively than adding more context.

  • Grounding block: provide retrieved chunks with IDs, titles, and URLs. Use a consistent delimiter so you can parse citations later.
  • Citation rules: require at least one citation per factual claim, or per paragraph in longer answers.
  • Refusal rules: refuse requests for secrets, personal data, or instructions that violate policy; also refuse when sources do not support the claim.

Engineering judgment: do not overload the prompt with 20 chunks “just in case.” Too much context increases confusion and can lower accuracy. Prefer fewer, higher-quality chunks after reranking. Another common mistake is mixing chat history directly into the context block; this can cause the model to cite user messages as “sources.” Separate “conversation” from “evidence,” and only the evidence block should be eligible for citations.

Finally, make refusals user-helpful: suggest what evidence is needed (“I don’t have a source covering EMEA exceptions; do you have the policy revision date?”) or propose a safe next step. In enterprise settings, refusals are part of user trust, not a failure mode to hide.

Section 4.4: Response shaping: JSON outputs, schemas, and structured extraction

Section 4.4: Response shaping: JSON outputs, schemas, and structured extraction

Many RAG apps fail not because the answer is wrong, but because the output is inconsistent and hard to integrate. Response shaping solves this by asking the model to produce structured outputs (often JSON) that your API layer can validate. This is especially important for tool-style behavior: returning “answer + citations + follow-ups + extracted fields” in a predictable format.

Define a schema that matches your product needs. A common pattern is:

  • answer_text: the final user-facing answer
  • citations: array of {chunk_id, title, url, quote}
  • confidence: low/medium/high based on evidence coverage
  • follow_up_questions: clarifiers when evidence is incomplete
  • extracted_entities: optional key-value fields (service name, region, version)

Then enforce it. Validate JSON strictly server-side; if the model returns invalid JSON, re-ask with a repair prompt (“Return only valid JSON that matches this schema…”). Avoid letting invalid outputs silently pass to clients, because that becomes a reliability bug disguised as “model variability.”

Structured extraction is also how you turn RAG into workflows. For example, a troubleshooting assistant can extract “error_code,” “component,” and “recommended_runbook,” then your app can fetch the runbook and run a second RAG pass. Keep extraction grounded: require that each extracted field include a citation or “unknown.” The common mistake is to treat extracted fields as authoritative even when the context is weak; instead, mark uncertain fields and request confirmation.

Section 4.5: Session and memory design: short-term, long-term, and summarization

Section 4.5: Session and memory design: short-term, long-term, and summarization

Conversation memory improves usability, but it is also a major source of data leakage. Design memory intentionally: what should persist, for how long, and under what security boundaries. Separate short-term context (recent turns) from long-term preferences (user role, preferred format) and from retrieved evidence (documents). Never treat memory as evidence.

A practical design uses three stores:

  • Short-term buffer: last N turns (e.g., 6–10) for coherence; redact sensitive tokens before storage.
  • Long-term profile: explicit user preferences (tone, format), stored with consent and scoped to the user/tenant.
  • Summary memory: periodically summarize older conversation into a compact, non-sensitive note (“User is investigating OCI Logging retention settings for project X”).

Summarization is not just compression; it is policy enforcement. Your summarizer prompt should explicitly drop secrets, credentials, and personal identifiers, and keep only task-relevant facts. Common mistakes include summarizing verbatim (which preserves secrets) or summarizing with invented details. To mitigate, require the summary to be derived only from the conversation, and run a lightweight PII/secret scan before storing it.

When you retrieve documents, do not store retrieved chunks in memory by default. That can accidentally persist licensed or sensitive text beyond its intended scope. Instead, store references (chunk IDs) and re-retrieve as needed. This also keeps citations consistent with the index and reduces prompt bloat over time.

Section 4.6: API and app patterns: REST endpoints, rate limits, and timeouts

Section 4.6: API and app patterns: REST endpoints, rate limits, and timeouts

Your RAG system becomes real when it is callable as a service. A clean API layer lets you support chat, single-shot QA, batch evaluation, and admin operations (index refresh, health checks). Start with a small set of REST endpoints and make reliability features first-class: timeouts, retries, idempotency, and rate limits.

A practical minimal API set:

  • POST /qa: takes {question, filters, session_id}; returns structured JSON with answer + citations.
  • POST /chat: takes {message, session_id}; uses memory + retrieval; returns assistant turn + citations.
  • POST /ingest: admin-only; triggers ingestion for new documents and returns job status.
  • GET /health: verifies index connectivity and model availability.

Timeout budgeting is the difference between a “demo” and a product. Assign budgets per stage: query rewrite (small), retrieval (bounded), rerank (bounded), generation (largest). If generation times out, return a partial response with citations and a message to retry, rather than hanging. Rate limits should protect both cost and stability; implement per-user and per-tenant quotas, and return clear error payloads for clients to handle gracefully.

To run an end-to-end demo on a representative dataset, script a few scenarios: (1) a straightforward factual question with citations, (2) a question requiring filters (region, date), (3) a query that needs rewriting, (4) a case where evidence is missing and the system must refuse or ask a clarifier. Log each stage (intent, rewritten query, retrieved chunk IDs, reranked top-k, prompt token count, latency). These logs are your practical link between exam objectives—retrieval, grounding, guardrails, evaluation—and a working OCI-hosted RAG application.

Chapter milestones
  • Implement retrieval with filters, reranking hooks, and citations
  • Build prompting templates for grounded answers and tool-style behavior
  • Create an API layer for chat + QA workflows
  • Add conversation memory without leaking sensitive context
  • Run end-to-end demos on a representative dataset
Chapter quiz

1. Why does Chapter 4 emphasize that a production RAG app is not simply “vector search + a prompt”?

Show answer
Correct answer: Because production RAG requires orchestration across retrieval decisions, relevance tuning, generation controls, and operational guardrails
The chapter frames production RAG as a coordinated system: retrieval (filters/query rewriting/hybrid), reranking, grounded generation with citations/refusals, and reliability guardrails.

2. Which set of components best matches the chapter’s categories of controls in a RAG application?

Show answer
Correct answer: Retrieval decisions, reranking hooks, generation controls, and operational guardrails
The chapter explicitly groups the work into retrieval decisions, relevance tuning via reranking hooks, generation controls (grounding/citations/refusals), and operational guardrails.

3. What is the primary purpose of adding citations in the RAG generator stage as described in Chapter 4?

Show answer
Correct answer: To ensure answers are grounded in retrieved evidence and auditable
Citations support grounded answers by tying claims back to retrieved evidence, aligning with the chapter’s focus on safety and reliability.

4. How should conversation memory be implemented according to the chapter’s guidance?

Show answer
Correct answer: Add memory in a way that helps the user while preventing sensitive context from leaking across sessions
Chapter 4 calls for conversation memory that improves the experience without leaking sensitive context between sessions.

5. What engineering principle should guide each step when assembling the RAG application in Chapter 4?

Show answer
Correct answer: Every step should either improve relevance, improve safety, or improve reliability, and be observable via logs/metrics
The chapter stresses measurable improvements (relevance/safety/reliability) and observability (logs/metrics) to support debugging and production readiness.

Chapter 5: Safety, Security, and Governance for OCI RAG

RAG applications are often introduced as an accuracy feature: retrieve relevant passages, then generate an answer grounded in those passages. In production—and on the OCI Generative AI exam—RAG is also a security boundary. The retrieval layer can leak confidential documents, the prompt can be manipulated, and the system can accidentally store sensitive information in logs. This chapter treats safety, security, and governance as first-class engineering requirements, not afterthoughts.

You will apply “tenant-safe” design decisions: strict IAM boundaries, compartment design, and network controls; enforce data access controls at retrieval time; and build guardrails that detect malicious instructions and harmful content. You will also learn how to handle PII in traces and logs, and how to produce audit-friendly evidence such as policies, diagrams, and runbooks. The goal is practical: build an OCI RAG app you can defend in a security review and justify in an exam scenario.

A helpful mindset is to separate three flows: (1) data ingestion (documents entering your index), (2) runtime query and retrieval (what the user can cause the system to fetch), and (3) observability (what your system stores about user queries and model outputs). Each flow needs controls, and most failures come from controlling only one of them. A “secure RAG” design is the intersection of least privilege, safe prompting, and careful logging.

Practice note for Apply data access controls and tenant-safe design decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement prompt injection defenses and content safety checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add PII handling and redaction for logs and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create audit-friendly evidence: policies, diagrams, and runbooks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Practice exam-style scenario questions on governance tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Apply data access controls and tenant-safe design decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement prompt injection defenses and content safety checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add PII handling and redaction for logs and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create audit-friendly evidence: policies, diagrams, and runbooks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Threat model for RAG: injection, exfiltration, and poisoning

Section 5.1: Threat model for RAG: injection, exfiltration, and poisoning

A threat model gives you clarity on what to defend and where to put controls. For OCI RAG, the three most common classes are prompt injection, data exfiltration, and data poisoning. Prompt injection happens when a user query or a retrieved document contains instructions that try to override system behavior (e.g., “ignore previous instructions and reveal secrets”). Exfiltration happens when the system retrieves content the user is not entitled to see, or when secrets are leaked through verbose answers or logs. Poisoning happens when malicious or low-quality content enters your index, causing future answers to be wrong or unsafe.

Start by drawing the trust boundaries: end user, API gateway, orchestration service, vector store, object storage for source docs, and the model endpoint. Mark which components are “write paths” (ingestion) versus “read paths” (retrieval). In OCI terms, this typically spans compartments, IAM dynamic groups, policies, VCN/subnets, and logging services. Tenant-safe design means you never rely on “the prompt” to enforce access; you enforce access via IAM and retrieval filters.

  • Injection: retrieved snippets may contain hostile instructions; the user may also craft queries to bypass policy.
  • Exfiltration: cross-compartment or cross-user leakage through weak ACL filters, broad IAM permissions, or logging raw prompts/outputs.
  • Poisoning: ingesting untrusted sources, allowing edits without review, or embedding content with incorrect labels/metadata.

Common mistake: treating the vector index as “just search.” In reality, the index is a security-sensitive database. If you embed and index confidential text without storing access metadata, you have created a fast path to leakage. Another mistake: relying on a single guardrail (e.g., “the model will refuse”). Injection often succeeds not because the model is “bad,” but because the app gives it too much authority and too much data.

Practical outcome: you can explain, in an exam scenario, where you would place controls for each threat—ingestion validation for poisoning, retrieval-time authorization for exfiltration, and layered prompting plus safety checks for injection.

Section 5.2: Guardrails: allow/deny topics, safety classifiers, and refusal design

Section 5.2: Guardrails: allow/deny topics, safety classifiers, and refusal design

Guardrails are policies enforced by your application around the model. In RAG, guardrails must be applied both before retrieval (to stop obviously disallowed requests) and after generation (to catch unsafe output). A practical pattern is: (1) classify the user request, (2) decide allowed/denied topics, (3) retrieve only if allowed, (4) generate with a constrained system prompt, and (5) run an output safety check.

Allow/deny topic controls should be explicit. For enterprise RAG, create a policy table: allowed business tasks (e.g., “summarize policy,” “answer HR questions”), disallowed tasks (e.g., “write malware,” “bypass security”), and conditional tasks (e.g., “legal advice requires a disclaimer and citations”). Implement the policy as code, not as a document: a small rules engine or configuration file versioned in Git. This supports repeatable behavior and auditability.

Safety classifiers can be model-based or rule-based. Rule-based checks catch straightforward issues (PII patterns, profanity, obvious jailbreak phrases). Model-based classifiers can detect nuanced harmful intent, but they must be tuned to reduce false positives for your domain. A common engineering judgment is to combine them: rules for high-precision blocking, classifier scores for risk-based decisions (allow, refuse, or require escalation).

  • Refusal design: refuse with a short reason, offer safe alternatives, and avoid repeating harmful content.
  • Don’t leak policy internals: refusal text should not reveal exact triggers or system prompts.
  • Consistent UX: users should see predictable outcomes across similar requests.

Common mistake: placing all guardrails in the system prompt and trusting compliance. Prompts are helpful, but they are not enforcement. Another mistake: only filtering outputs. If you retrieve sensitive text first, you may already have violated data-handling rules even if you refuse later. Practical outcome: you can describe a layered guardrail pipeline and justify why checks occur before and after retrieval.

Section 5.3: Secure retrieval: ACL filters, row-level security concepts, least privilege

Section 5.3: Secure retrieval: ACL filters, row-level security concepts, least privilege

Secure retrieval is the core “tenant-safe” control in RAG. The retrieval system must return only documents the caller is authorized to see, regardless of how the prompt is written. Treat your vector store like a database with row-level security: each chunk (or document) carries access metadata, and every query applies an authorization filter.

Implement this with two design steps. First, at ingestion time, attach metadata such as compartment, business unit, document classification, and allowed principals/groups. The key is that the metadata is derived from a trusted source (document management system, object storage prefix policies, or an ingestion service running with controlled IAM). Second, at query time, compute the caller’s entitlements from IAM (groups, claims, tenancy/compartment context) and apply filters to vector search. If your vector database does not support expressive filters, you must enforce the filter in an outer layer (retrieve candidates, then hard-filter) and accept the performance hit—security comes first.

  • Least privilege: the runtime service identity should have read access only to the index and documents it needs, not broad Object Storage access.
  • Separation of duties: ingestion roles differ from query roles; do not reuse the same dynamic group for both.
  • Network controls: keep retrieval components in private subnets; use service gateways/private endpoints where possible.

Common mistake: filtering only by “tenant” but not by department or project, which becomes a cross-team data leak. Another mistake: storing ACL metadata in a separate system and forgetting to join it at query time. The exam often frames this as a governance tradeoff: performance and simplicity versus strict authorization. The correct posture in regulated environments is deterministic authorization, even if it reduces recall or increases latency.

Practical outcome: you can articulate how to implement row-level security concepts for vector retrieval, and how IAM/compartment design supports those controls on OCI.

Section 5.4: Privacy and compliance basics: PII, retention, and logging minimization

Section 5.4: Privacy and compliance basics: PII, retention, and logging minimization

Privacy failures in RAG systems rarely come from the model itself; they come from what the application stores. Prompts, retrieved passages, and generated answers are highly likely to contain PII, credentials, internal identifiers, or confidential text. Your default posture should be logging minimization: store the least amount of data needed for troubleshooting and evaluation, and keep it for the shortest acceptable time.

Start by classifying your observability data. Separate metrics (counts, latencies, error rates), traces (request path and timings), and content logs (prompts, retrieved chunks, outputs). Metrics are usually safe; content logs are the highest risk. For PII handling, implement redaction at the edge: before writing any logs, detect common PII (emails, phone numbers, national IDs) and redact or hash them. Prefer irreversible hashing for identifiers you only need for correlation. If you must store text for debugging, use sampling, strict access control, and short retention.

  • Retention policy: set explicit retention for logs and stored conversations; document the rationale.
  • Data residency: ensure storage locations meet regulatory requirements (region selection matters).
  • Access: restrict who can view traces/content; treat it as sensitive data with audit trails.

Common mistake: enabling verbose request/response logging in production “temporarily” and never disabling it. Another mistake: capturing retrieved source passages verbatim in centralized logs, effectively duplicating confidential documents into a broader-access system. Practical outcome: you can describe a compliant logging strategy with PII redaction, minimal retention, and controlled access—an answer that maps cleanly to governance expectations in exam scenarios.

Section 5.5: Model risk and hallucinations: grounding checks and confidence signals

Section 5.5: Model risk and hallucinations: grounding checks and confidence signals

Hallucinations are a governance issue because they can create business risk: wrong procedures, incorrect policy advice, or fabricated citations. In RAG, you reduce hallucinations by increasing grounding and by signaling uncertainty to downstream systems and users. The objective is not “zero hallucinations,” but controlled behavior with measurable quality.

Grounding checks begin at retrieval. If retrieval returns low-relevance chunks, generation becomes guesswork. Use practical confidence signals: top-k similarity scores, score gaps between the first and subsequent results, and the presence/absence of required policy sections. Establish thresholds: if similarity is below a configured value, the assistant should ask a clarifying question or refuse with “I don’t have enough information in the approved sources.” This is an engineering judgment that trades answer rate for correctness.

Require citations and enforce them. A robust approach is to include document IDs and chunk identifiers in the context and instruct the model to cite them. Then add a post-generation validator that checks whether the answer contains citations and whether cited chunks were actually retrieved. If validation fails, regenerate with stricter instructions or return a safe fallback. Another practical technique is “quote then explain”: include short quoted spans from sources, then the model’s explanation. This improves auditability and user trust.

  • Refuse when ungrounded: design refusals for missing evidence, not just unsafe topics.
  • Regression tests: track hallucination rate on a golden set as you tune prompts and retrieval.
  • Change control: prompt updates are production changes; version and review them.

Common mistake: treating a good-looking answer as “correct” without verifying it is supported by retrieved content. Another mistake: measuring only user satisfaction, not factuality/grounding. Practical outcome: you can implement grounding validation and confidence-based behaviors that reduce risk and support exam objectives around guardrails, citations, and evaluation.

Section 5.6: Governance artifacts for exams: controls mapping and documentation

Section 5.6: Governance artifacts for exams: controls mapping and documentation

Governance is how you prove you did the right things. On the exam, scenario questions often ask which control or artifact best addresses an auditor, a security team, or a compliance requirement. In real projects, the same artifacts reduce friction and speed approvals. Your goal is “audit-friendly evidence”: documents that map risks to controls and show operational readiness.

Create a controls map that ties each risk area to specific OCI and application controls. Example: “prevent cross-user data leakage” maps to compartment strategy, IAM policies, dynamic groups for runtime identity, and retrieval-time ACL filtering. “Detect prompt injection” maps to input classifiers, system prompt constraints, and refusal behaviors. “Protect PII in logs” maps to redaction middleware, restricted log access, and retention settings. Keep the map short and actionable—auditors and exam graders look for clarity, not volume.

  • Architecture diagram: include trust boundaries, data stores, and where filtering/redaction occurs.
  • Runbooks: incident response steps for leakage reports, model misbehavior, and ingestion poisoning.
  • Operational checklists: prompt/version change review, index rebuild procedure, access review cadence.

Common mistake: producing a beautiful diagram with no indication of enforcement points (where authorization is checked, where secrets are stored, where logs are redacted). Another mistake: writing policies that do not match implementation—an exam trap is choosing a “policy” answer when the scenario demands a technical enforcement control. Practical outcome: you can select and justify governance tradeoffs, and you can produce artifacts that demonstrate security, privacy, and reliability for OCI RAG systems.

Chapter milestones
  • Apply data access controls and tenant-safe design decisions
  • Implement prompt injection defenses and content safety checks
  • Add PII handling and redaction for logs and traces
  • Create audit-friendly evidence: policies, diagrams, and runbooks
  • Practice exam-style scenario questions on governance tradeoffs
Chapter quiz

1. In Chapter 5, why is the retrieval layer treated as a security boundary in a production OCI RAG app?

Show answer
Correct answer: Because it can expose confidential documents if access controls aren’t enforced at retrieval time
The chapter emphasizes that retrieval can leak sensitive content, so controls must be enforced at retrieval time as part of a tenant-safe design.

2. Which set of controls best represents the chapter’s “tenant-safe” design decisions?

Show answer
Correct answer: Strict IAM boundaries, compartment design, and network controls
Tenant-safe design is described as combining IAM boundaries, compartment structure, and network controls to prevent cross-tenant or unauthorized access.

3. What is the main purpose of implementing prompt injection defenses and content safety checks in an OCI RAG system?

Show answer
Correct answer: To detect malicious instructions and reduce harmful content in generated outputs
The chapter frames guardrails as mechanisms to detect malicious instructions and harmful content, addressing prompt manipulation risks.

4. The chapter recommends separating controls across three flows. Which flow is most directly about preventing sensitive data from being stored in logs and traces?

Show answer
Correct answer: Observability (what the system stores about user queries and model outputs)
Observability covers what gets recorded in logs/traces, where PII handling and redaction are needed to avoid storing sensitive information.

5. In an exam-style governance scenario, which approach best matches the chapter’s definition of a “secure RAG” design?

Show answer
Correct answer: Use least privilege, safe prompting, and careful logging together across ingestion, retrieval, and observability
The chapter defines secure RAG as the intersection of least privilege, safe prompting, and careful logging, applied across all three flows.

Chapter 6: Testing, Evaluation, Deployment, and Final Exam Readiness

A RAG application is only “done” when it is measurable, testable, deployable, and operable. In earlier chapters you built ingestion, chunking, embeddings, vector indexing, retrieval, and grounded generation. Now you will turn that pipeline into an engineering system: you will define what “good” means, prove it with offline metrics, protect it with regression tests, ship it with safe deployment patterns, and run it with observability and rollback. This chapter also aligns those activities to OCI Generative AI exam objectives so you can reason through scenarios under time pressure.

A practical mental model is to treat your RAG app as two coupled subsystems: retrieval and generation. Retrieval determines whether the model can possibly answer correctly; generation determines whether it answers faithfully and safely given the retrieved evidence. Your evaluation set, metrics, and tests must therefore separate retrieval failures (no relevant chunks found) from generation failures (hallucination, missing citations, unsafe phrasing), because the fixes are different. The goal is to create an improvement loop you can run repeatedly: adjust chunking/indexing/ranking, rerun retrieval metrics; adjust prompts/guardrails, rerun faithfulness and citation checks; then verify latency and cost constraints with caching and batching.

Finally, you will operationalize the service on OCI with clear environment separation, configuration discipline, and observability. That last step is where many RAG teams stumble: they validate locally, then discover in production that network egress, IAM policies, rate limits, and logging choices affect latency and failure modes. By the end of this chapter, you should be able to ship a RAG service that you can test automatically, monitor confidently, and improve intentionally—and you will have a concrete exam readiness plan to close gaps quickly.

Practice note for Build an evaluation set and compute offline RAG metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create automated regression tests for retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deploy the service with observability and rollback strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Optimize latency and cost with caching and batching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a final readiness sprint: timed practice and gap closure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build an evaluation set and compute offline RAG metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create automated regression tests for retrieval and generation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deploy the service with observability and rollback strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Evaluation design: golden sets, user stories, and acceptance criteria

Section 6.1: Evaluation design: golden sets, user stories, and acceptance criteria

Start evaluation by deciding what you are trying to prove. A RAG system is not evaluated by “does it sound good,” but by whether it reliably supports specific user workflows. Convert stakeholder needs into user stories (e.g., “As a support engineer, I need the precise troubleshooting steps for error X with references to the official runbook”). For each story, define acceptance criteria that are checkable: must include a citation, must name the exact policy clause, must answer within N seconds, must refuse to answer when sources are missing.

From those stories, build a golden set (also called an evaluation set): a curated list of questions, expected answer traits, and expected supporting sources. Keep it small enough to maintain (often 50–200 items) but representative: include easy lookups, multi-hop questions that require combining two documents, ambiguous queries, and “negative” queries where the answer should be “not found” or “requires human review.” Store metadata alongside each item: document IDs, section anchors, effective dates, and whether the query is time-sensitive.

  • Coverage: Include all major document types (policies, runbooks, tickets, FAQs), plus edge cases like scanned PDFs or tables.
  • Stability: Version the golden set in Git; tie each question to a snapshot of the corpus so regressions are attributable to changes.
  • Separability: Tag each item as retrieval-critical, generation-critical, or both, so you can isolate failures.

Common mistakes include using only “happy path” questions, writing expected answers that are too vague to score, and changing the corpus without updating golden sources. A practical outcome of this section is a living evaluation artifact that enables offline testing before you deploy: your team can argue about acceptance criteria once, then let metrics and tests enforce them continuously.

Section 6.2: Metrics for RAG: recall@k, faithfulness, citation coverage, latency

Section 6.2: Metrics for RAG: recall@k, faithfulness, citation coverage, latency

Offline RAG metrics are your fast feedback loop. Begin with retrieval metrics because they cap system performance: if relevant content is not retrieved, generation cannot succeed. The most practical retrieval metric is recall@k: for each golden question, does the top-k retrieved set contain at least one chunk that maps to the expected source section? Choose k based on your prompt context budget (often 3–8). Track recall@1 and recall@5 separately; recall@1 reflects ranking quality, while recall@5 reflects embedding/index coverage.

Then measure generation quality with faithfulness (sometimes called groundedness): does the answer’s claims match the retrieved text? You can approximate this offline by extracting atomic claims from the answer and verifying they are entailed by the retrieved passages, using either deterministic heuristics (string/number matching, citation-required rules) or an LLM-as-judge rubric that checks “supported vs unsupported.” Keep the rubric strict and consistent; a permissive judge hides hallucinations.

Citation coverage is a metric that links your guardrails to user trust: what fraction of key answer statements have citations, and do those citations point to the correct chunks? A simple operational rule is “every paragraph must have at least one citation,” but stronger is “every factual claim must be cited.” Track missing citations as a separate failure class from wrong citations.

Finally, measure latency end-to-end and by stage: retrieval time, rerank time (if used), model generation time, and any post-processing (citation formatting, safety filtering). Combine this with cost metrics: token usage, embedding calls, vector queries. This is where caching and batching become engineering levers: cache embeddings for repeated queries; cache retrieval results for popular questions; batch embedding during ingestion; and batch vector inserts. The practical outcome is a scorecard you can run on every change: recall@k, faithfulness, citation coverage, and p95 latency, with thresholds tied to the acceptance criteria you defined in Section 6.1.

Section 6.3: Test harnesses: unit, integration, and prompt regression testing

Section 6.3: Test harnesses: unit, integration, and prompt regression testing

Metrics tell you “how good,” but tests tell you “what broke.” Build an automated test harness that runs in CI and covers three layers: unit tests, integration tests, and prompt regression tests. Unit tests validate deterministic code paths: chunking boundaries, metadata extraction, PII redaction, citation formatter, and query normalization. These tests should not call external OCI services; use fixtures and snapshots.

Integration tests validate the RAG pipeline across service boundaries: ingest a small known corpus, embed, index, retrieve, and generate. Keep it small and stable so failures are actionable. In OCI terms, integration tests often run against a non-production compartment with constrained policies and budgets. Validate IAM assumptions explicitly: a common failure is that local credentials have broad permissions, while the CI runtime or deployment runtime does not.

Prompt regression testing is uniquely important in RAG because prompt edits can improve one scenario while degrading another. Use your golden set as a regression suite: run the same prompts and compare outputs with tolerant assertions. Avoid brittle string equality; instead assert properties such as “contains a citation,” “does not mention forbidden topics,” “matches expected document ID,” “answers ‘unknown’ when retrieval confidence is low.” Keep a “known issues” allowlist but timebox it—allowlists tend to grow and silently normalize bad behavior.

  • Determinism controls: fix temperature/top_p for tests; log model version and parameters.
  • Failure triage: when a test fails, capture retrieved chunks, prompt, model response, and scoring output as an artifact.
  • Regression gates: block merges if recall@k or faithfulness drops beyond a threshold.

The practical outcome is confidence: you can refactor chunking, switch an embedding model, or tune top-k and reranking without guessing. Your harness will tell you if retrieval degraded, if citations broke, or if generation started hallucinating.

Section 6.4: Deployment patterns: environments, CI/CD basics, and configuration

Section 6.4: Deployment patterns: environments, CI/CD basics, and configuration

Deploying a RAG service on OCI is primarily about controlling change. Use multiple environments—at minimum dev, staging, prod—with separate compartments, network controls, and policies. Staging should mirror production shape: same subnet types, similar access to OCI Generative AI endpoints, same logging and rate limits. If staging is “too small” or “too permissive,” you will learn the wrong lessons.

Adopt CI/CD basics that favor reproducibility: build an immutable artifact (container image), run unit tests and golden-set regressions, then deploy to staging with infrastructure-as-code (Terraform/OCI Resource Manager). Promote to production only if staging passes SLO and correctness gates. Configuration should be externalized (environment variables or OCI Vault/Secrets), not baked into images: model OCID, top-k, rerank flag, max context tokens, safety thresholds, and logging verbosity.

Plan rollback as a first-class feature. Keep the prior version deployable, and ensure configuration changes are versioned too. Two common rollback pitfalls in RAG are (1) changing the index schema or embedding model without a parallel index, and (2) re-chunking the corpus without preserving prior chunk IDs, which breaks citations. A safer pattern is blue/green or canary deployment: route a small percentage of traffic to the new version, compare metrics (latency, refusal rate, citation coverage), then ramp up.

Optimize latency and cost during deployment design, not after. Add caching layers deliberately: cache embeddings for repeated queries, cache retrieval results for frequent questions, and consider response caching when the corpus is stable and the query is identical. Use batching in ingestion pipelines (embedding calls, vector inserts) and consider asynchronous ingestion so production traffic is not impacted. The practical outcome is a deployment process you can repeat under exam-style constraints: least privilege IAM, predictable networking, observable services, and a rollback story when changes regress quality.

Section 6.5: Operations: dashboards, alerting, incident playbooks, and SLOs

Section 6.5: Operations: dashboards, alerting, incident playbooks, and SLOs

Once deployed, you need to know whether the system is healthy and whether answers remain trustworthy. Build dashboards that reflect both platform health and RAG-specific quality signals. At the platform layer, track request rate, error rate, and latency (p50/p95/p99). At the RAG layer, track retrieval confidence, top-k hit distribution, citation coverage, refusal rate, and “no relevant context found” frequency. Use structured logs so you can correlate each answer with retrieved chunk IDs and model parameters.

Alerting should be tied to SLOs (service level objectives). Define a small set that matters: p95 latency under a threshold, error rate under a threshold, and a quality SLO such as “citation coverage above X%” or “faithfulness violations below Y%,” measured via periodic offline sampling. Avoid alert storms by using burn-rate alerts for SLOs rather than raw metric spikes.

Create incident playbooks that map symptoms to likely causes and actions. Example: if latency spikes, check model endpoint latency vs vector search latency; if retrieval recall drops, check index freshness and ingestion jobs; if faithfulness drops, inspect prompt changes, context truncation, or top-k reductions. Include rollback steps and communication templates. Treat “model changed behavior” as an expected incident class: log model versions, and schedule periodic re-evaluations of the golden set.

  • Data protection: log redaction, access controls for transcripts, and Vault-managed secrets.
  • Grounding enforcement: reject answers when retrieval returns low-confidence context; prefer safe refusals over speculative answers.
  • Regression cadence: nightly golden-set runs; weekly expanded evaluations on a larger sample.

The practical outcome is operational maturity: you can detect regressions quickly, respond consistently, and demonstrate to auditors and stakeholders that the system is controlled, monitored, and aligned to enterprise expectations.

Section 6.6: Exam strategy: scenario reasoning, elimination tactics, and review plan

Section 6.6: Exam strategy: scenario reasoning, elimination tactics, and review plan

The OCI Generative AI exam is scenario-heavy: you are expected to choose the most appropriate action given constraints (security, cost, latency, correctness). Anchor your reasoning in the lifecycle you practiced: define objectives → implement RAG pipeline → add guardrails → evaluate → deploy → operate. When reading a question, identify which stage is failing. If users report wrong answers with correct sources retrieved, it is a generation/faithfulness issue; if answers are vague and uncited, it is likely prompt/citation enforcement; if answers are “I don’t know” too often, it may be retrieval recall, chunking, or top-k.

Use elimination tactics grounded in OCI best practices. Prefer least privilege IAM and compartment separation; prefer staging that mirrors production; prefer observability with logs/metrics; prefer rollback-capable deployments. When multiple choices appear plausible, choose the one that reduces risk while preserving measurability—for example, adding a golden set and offline metrics is often a better first step than changing models blindly.

Run a final readiness sprint over 5–7 days. Timebox daily sessions: (1) review one objective area (IAM/networking, RAG components, evaluation/operations), (2) implement or re-implement a small hands-on task (e.g., compute recall@k on a golden set, add a prompt regression gate, configure a canary rollout), and (3) record gaps as “facts to memorize” vs “skills to practice.” Your review plan should end with a timed full practice test and a focused gap closure day where you revisit only the missed objective categories.

The practical outcome is exam confidence rooted in engineering habit: you will recognize patterns (retrieval vs generation vs operations), select the safest OCI-aligned approach, and justify choices using measurable criteria like recall@k, citation coverage, latency, and SLOs.

Chapter milestones
  • Build an evaluation set and compute offline RAG metrics
  • Create automated regression tests for retrieval and generation
  • Deploy the service with observability and rollback strategy
  • Optimize latency and cost with caching and batching
  • Run a final readiness sprint: timed practice and gap closure
Chapter quiz

1. Why does Chapter 6 recommend treating a RAG app as two coupled subsystems (retrieval and generation) when evaluating quality?

Show answer
Correct answer: Because retrieval determines if relevant evidence is available, while generation determines whether the response is faithful and safe given that evidence
The chapter emphasizes separating retrieval failures (no relevant chunks) from generation failures (hallucinations, missing citations, unsafe phrasing) since they require different fixes.

2. A user report shows the model produced a plausible answer that lacks citations and appears invented, even though relevant chunks were retrieved. What type of failure is this and what should you adjust first?

Show answer
Correct answer: Generation failure; adjust prompts and guardrails first
If evidence was retrieved but the answer is unfaithful or missing citations, it’s a generation failure; the chapter suggests improving prompts/guardrails and rerunning faithfulness/citation checks.

3. What is the primary purpose of building an evaluation set and computing offline RAG metrics in Chapter 6?

Show answer
Correct answer: To define what “good” means and prove improvements before deploying changes
The chapter frames evaluation and offline metrics as the way to make the system measurable and to support an improvement loop before shipping changes.

4. Which approach best matches the chapter’s improvement loop for a RAG system?

Show answer
Correct answer: Adjust chunking/indexing/ranking and rerun retrieval metrics; adjust prompts/guardrails and rerun faithfulness/citation checks; then verify latency and cost with caching and batching
The chapter lays out an iterative loop that separately validates retrieval and generation, then checks operational constraints like latency and cost using caching and batching.

5. According to Chapter 6, why do many RAG teams encounter issues after validating locally but moving to production on OCI?

Show answer
Correct answer: Because production factors like network egress, IAM policies, rate limits, and logging choices affect latency and failure modes
The chapter highlights operational realities on OCI—egress, IAM, rate limits, and logging—that can change performance and reliability compared to local testing.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.