HELP

+40 722 606 166

messenger@eduailast.com

Career Transition Lab: Local LLM Assistant with Ollama + FastAPI

Career Transitions Into AI — Beginner

Career Transition Lab: Local LLM Assistant with Ollama + FastAPI

Career Transition Lab: Local LLM Assistant with Ollama + FastAPI

Go from AI-curious to shipping a local LLM assistant you can demo.

Beginner ollama · docker · fastapi · local-llm

Build a local LLM assistant—and a credible AI portfolio project

This course is a short, book-style lab for career switchers who want a practical, demonstrable AI project: a local LLM assistant you can run on your own machine. Instead of relying on paid hosted APIs, you’ll use Ollama for local model serving, Docker for repeatable environments, and FastAPI to expose a clean backend API with streaming chat responses. By the end, you’ll have a project that looks and feels like real work: clear API contracts, sensible guardrails, reproducible setup, and an evaluation approach you can explain in interviews.

The course is designed for beginners who can write basic Python and use the terminal, but who may be new to LLM app engineering. Each chapter builds on the previous one: you’ll start with a career-focused blueprint and a working baseline, then add model discipline, a proper API layer, containerization, retrieval grounding, and finally testing/evaluation plus portfolio packaging.

What you’ll build

You’ll implement a local assistant service with:

  • Ollama-managed local models (pulled, run, and tuned consistently)
  • A FastAPI backend with typed request/response models and OpenAPI docs
  • Streaming responses for a modern chat feel
  • Docker + docker-compose so anyone can run it with one command
  • Optional RAG (retrieval-augmented generation) using local documents to ground answers
  • Basic guardrails (validation, timeouts, error handling, simple safety policies)
  • Lightweight evaluation to track quality and prevent regressions

Why local LLMs are a strong career transition project

Local inference forces you to think like an engineer: model selection, latency constraints, memory limits, and reproducible deployment. These are highly transferable skills for AI-adjacent roles such as AI engineer, ML platform associate, backend engineer for LLM features, or product-minded prototyper. You’ll also learn how to communicate tradeoffs (privacy vs performance, cost vs quality, RAG vs fine-tuning), which is exactly what interviewers probe for.

How the course is structured (6 chapters, no fluff)

Each chapter ends in a tangible checkpoint. You’ll gradually evolve a basic local chat into a containerized assistant API with retrieval grounding and a portfolio-ready delivery. The goal is not just “it runs on my laptop,” but “a reviewer can reproduce it, understand it, and see engineering judgment.”

Who this is for

  • Career changers building a first AI project that can be demoed reliably
  • Software developers new to LLM tooling who want a concrete, local-first stack
  • Analysts or data folks who want to move toward AI application engineering

Get started

If you want a structured path from setup to a polished demo, start here and ship chapter by chapter. Register free to track your progress, or browse all courses to compare learning paths.

What You Will Learn

  • Explain how local LLM inference works and when to choose it over hosted APIs
  • Run and manage models with Ollama, including model selection and prompt templates
  • Containerize an LLM-powered service with Docker for repeatable local deployment
  • Build a FastAPI backend that streams tokens and exposes chat-style endpoints
  • Add guardrails: input validation, timeouts, retries, and basic safety filters
  • Implement lightweight RAG using local documents to ground answers
  • Measure quality with simple eval sets and latency/cost-style metrics for local apps
  • Package a portfolio-ready assistant with README, demo scripts, and deployment notes

Requirements

  • Basic Python (functions, modules, virtual environments)
  • Comfort using the command line (cd, ls, environment variables)
  • A computer with 16GB RAM recommended; Apple Silicon or modern x86 CPU
  • Docker Desktop installed
  • Git installed (GitHub account helpful but not required)

Chapter 1: Your Career Transition Blueprint + Local LLM Fundamentals

  • Define your target role and portfolio narrative for a local LLM assistant
  • Set up dev environment: Python, Git, Docker, and project structure
  • Understand local LLM constraints: latency, memory, privacy, and tradeoffs
  • Create a baseline CLI chat to validate end-to-end inference
  • Checkpoint: publish a clean repo with a working local chat demo

Chapter 2: Ollama Deep Dive—Models, Prompts, and Reliable Inference

  • Install and run Ollama; pull and compare 2-3 models
  • Design a prompt template and test behaviors with a mini test set
  • Tune generation settings for stability and speed
  • Build a small Python wrapper client for Ollama requests
  • Checkpoint: document model choice and reproducible commands

Chapter 3: FastAPI Assistant API—Chat, Streaming, and Contracts

  • Create FastAPI service with health checks and configuration
  • Implement /chat endpoint with conversation memory
  • Add streaming responses for better UX
  • Define request/response models with Pydantic and OpenAPI docs
  • Checkpoint: a documented API that others can call locally

Chapter 4: Dockerize the Stack—Ollama + FastAPI for One-Command Runs

  • Write a Dockerfile for the FastAPI service
  • Create docker-compose for FastAPI + Ollama with volumes
  • Add environment configuration and secrets handling patterns
  • Optimize container startup and local dev workflows
  • Checkpoint: run the full assistant stack with docker compose up

Chapter 5: Add Retrieval (RAG) + Guardrails for a Portfolio-Ready Assistant

  • Ingest local docs and chunk text for retrieval
  • Implement embeddings and a simple vector store option
  • Compose prompts that cite retrieved context
  • Add guardrails: input constraints, content policies, and rate limiting
  • Checkpoint: assistant answers grounded questions with citations

Chapter 6: Ship It—Testing, Evaluation, and Career Packaging

  • Create smoke tests and contract tests for the API
  • Build a small evaluation set and measure quality/latency
  • Prepare a demo script and screenshots for your portfolio
  • Write a deployment and usage guide for reviewers
  • Checkpoint: publish a polished project that supports your career transition

Sofia Chen

Senior Machine Learning Engineer, LLM Applications

Sofia Chen builds production LLM features for developer tools, focusing on reliable inference, evaluation, and API design. She has coached career switchers into AI roles by helping them ship small, defensible portfolio projects with strong engineering fundamentals.

Chapter 1: Your Career Transition Blueprint + Local LLM Fundamentals

This course is built around a single career lever: you will ship a local LLM assistant end-to-end (model runtime, API service, packaging, and guardrails) and present it as a portfolio narrative that makes sense to hiring managers. “Local” is not a gimmick—running inference on your own machine changes the engineering constraints, the risk profile, and the kinds of problems you can solve for employers (privacy, cost control, offline workflows, and predictable deployments).

In this chapter you will define your target role and a story for your portfolio, set up a development environment (Python, Git, Docker, and a clean project structure), and learn the fundamentals and tradeoffs of local inference. You’ll also create a baseline CLI chat to validate end-to-end inference before you touch FastAPI, streaming, or retrieval. That simple chat demo becomes your first publishable checkpoint: a clean repository that any reviewer can clone and run.

Throughout the course, keep a practical mindset: you are not “learning LLMs,” you are learning how to build reliable LLM-powered software. Your strongest signal will be judgement—choosing local vs hosted, selecting an appropriate model, building with repeatability, and documenting your decisions.

Practice note for Define your target role and portfolio narrative for a local LLM assistant: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up dev environment: Python, Git, Docker, and project structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand local LLM constraints: latency, memory, privacy, and tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a baseline CLI chat to validate end-to-end inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: publish a clean repo with a working local chat demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define your target role and portfolio narrative for a local LLM assistant: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up dev environment: Python, Git, Docker, and project structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand local LLM constraints: latency, memory, privacy, and tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a baseline CLI chat to validate end-to-end inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Career outcomes and what hiring managers look for

Hiring managers rarely hire for “LLM enthusiasm.” They hire for outcomes: a working system, clear problem framing, and evidence you can ship safely. For a career transition, your best strategy is to target a role family (e.g., backend engineer with AI, AI product engineer, ML/LLM engineer, platform engineer) and craft a portfolio narrative that matches the responsibilities of that role.

In this course, the portfolio artifact is a local LLM assistant with an API, containerization, and basic guardrails. That maps cleanly to real job requirements: integrating model inference into services, handling latency and failures, designing APIs, and documenting deployment. When a reviewer opens your repository, they should quickly understand: (1) what the assistant does, (2) how to run it locally, (3) why you chose local inference, and (4) what tradeoffs you accepted.

  • What hiring managers look for: reproducible setup, minimal “magic,” sensible defaults, and visible error handling.
  • Signals that you can work on a team: clear README, consistent formatting, meaningful commits, and an explicit license.
  • Signals of engineering judgement: model choice justified by hardware constraints, timeouts and retries, and a plan for safety/abuse.

Common mistake: building an impressive demo that can’t be run without hours of troubleshooting. Another mistake: burying your intent. State your target role in the README (“Built as a local-first assistant to demonstrate inference + API + packaging”) and add a short “Design decisions” section explaining your local vs hosted choice, model selection, and constraints.

Section 1.2: Local vs hosted LLMs—decision framework

Choosing local inference versus a hosted API is not ideological; it’s a product and operations decision. Hosted LLM APIs (OpenAI, Anthropic, etc.) typically offer best-in-class quality, managed scaling, and fast iteration. Local inference (via Ollama in this course) offers privacy control, offline capability, predictable cost, and tighter control over deployment environments.

Use a simple decision framework based on five axes:

  • Data sensitivity: If prompts contain regulated or confidential data, local inference can reduce exposure and simplify compliance—though you still need local security practices.
  • Cost and predictability: Hosted APIs are variable cost per token; local inference shifts cost to hardware and electricity. For heavy internal usage, local can be cheaper.
  • Latency and UX: Hosted models can be faster and higher quality; local can be slower, but streaming tokens can keep UX responsive.
  • Quality requirements: If you need frontier reasoning, hosted may be necessary. If you need a controllable assistant for internal docs, local can be sufficient.
  • Operational maturity: Hosted reduces ops burden; local requires versioning models, managing memory, and containerizing services.

Practical workflow: start with local for development if your goal is learning and reproducibility, then keep an escape hatch to hosted for quality benchmarks. A good portfolio repository documents this explicitly (“Local-first. Optional hosted adapter later.”). Common mistake: picking a model that barely fits your machine and blaming “LLMs are slow.” Instead, treat the runtime constraint as part of the design: model size, context length, and concurrency should match your environment.

Section 1.3: Hardware sizing and performance expectations

Local LLM constraints define your engineering reality: memory is the gate, latency is the tax, and privacy is the reward. You need basic sizing intuition to choose models responsibly in Ollama. The two most common limiting factors are RAM/VRAM capacity and sustained token generation speed (tokens/second).

As a rule of thumb, larger parameter models need more memory, but quantization reduces memory usage at some cost to quality. A smaller, well-chosen model running reliably is better than a large model that swaps memory or crashes. Expect these practical tradeoffs:

  • Latency: First token time can be noticeable locally; streaming output helps user experience even if total time is longer.
  • Memory: If the model does not fit comfortably, you’ll see slowdowns or failures. Leave headroom for the OS, Docker, and your Python service.
  • Context length: Larger context windows increase memory and compute. Don’t default to the maximum—choose what your use case needs.
  • Concurrency: One local model instance may handle only a small number of simultaneous chats before performance degrades.

Practical expectation setting for your demo: optimize for “works every time” rather than “largest model available.” Your baseline CLI chat (coming later in this chapter) is your measurement tool. Record simple metrics in your README: model name, quantization, approximate tokens/sec on your machine, and peak memory usage if you can. Common mistake: ignoring thermal throttling on laptops and then being surprised by inconsistent performance. When testing, keep conditions consistent (plugged in, similar background load).

Section 1.4: Project scaffolding and dependency management

Your goal is a repository that runs the same way for you, a reviewer, and future-you. That means a predictable project layout, pinned dependencies, and a development environment that anticipates the next chapters (FastAPI, streaming, Docker, and lightweight RAG). Start with Python, Git, and Docker installed, then scaffold a small but disciplined structure.

A practical starter layout looks like this:

  • app/ — your Python package (later: FastAPI routes, services, RAG utilities).
  • scripts/ — one-off helpers (e.g., run CLI chat, download sample docs).
  • tests/ — even a few smoke tests signal seriousness.
  • docker/ or a top-level Dockerfile — for repeatable local deployment.
  • .env.example — documented configuration defaults without secrets.

Dependency management is where many transitioners stumble. Pick one method and stick to it (requirements.txt with pip-tools, Poetry, or uv). Pin versions for anything runtime-critical. For this chapter’s CLI chat, keep dependencies minimal: an HTTP client (or the official Ollama client if you use it), and a small CLI loop. Save the heavier web dependencies for later chapters so failures are easier to isolate.

Common mistake: mixing global Python packages with project packages and losing track of what’s installed. Use a virtual environment, commit a lockfile if your tool supports it, and document one canonical setup path in the README. Your future Docker container will thank you.

Section 1.5: Prompting basics for assistants (system/user/assistant roles)

Before you build an API, validate that you can reliably “steer” the model. Assistant prompting is easiest to reason about when you separate messages by role: system (rules and identity), user (the request), and assistant (the model’s responses). This structure becomes the foundation for chat endpoints and later for RAG grounding.

In Ollama, you will choose a model and define a baseline prompt template. Your baseline CLI chat should do three things: (1) send a system message that defines the assistant behavior, (2) keep a short conversation history, and (3) print streamed tokens so you can observe latency and truncation. Keep the system message short and testable, for example: “You are a local-first coding assistant. Ask clarifying questions if requirements are missing. Do not invent files or commands that were not discussed.”

  • Engineering judgement: Avoid giant prompts that try to solve safety, style, and domain expertise all at once. Start minimal, then add constraints when you observe failure modes.
  • Common mistake: putting all instructions in the user message. System messages are more reliable for stable behavior.
  • Practical outcome: a repeatable prompt template you can later share across CLI, FastAPI, and Docker deployments.

When testing prompts, change one variable at a time: model choice, temperature (if exposed), context length, or the system message. Record what changed and why. This is portfolio-grade practice because it mirrors real production debugging.

Section 1.6: Repo hygiene—README, .env patterns, and licensing

Your checkpoint for this chapter is to publish a clean repository with a working local chat demo. “Clean” is not aesthetic; it’s operational. A reviewer should be able to clone, run one or two commands, and see the assistant respond via Ollama locally. This is where README quality becomes a career skill.

Your README should include: a one-paragraph project goal, prerequisites (Ollama, Python, Docker if used), setup steps, a “Run the CLI chat” section, and a small troubleshooting section (model not found, port conflicts, slow performance). Add a short “Design decisions” section that explains why you chose local inference and what limitations exist on typical hardware.

Use .env patterns correctly: never commit secrets. Instead, commit .env.example with safe defaults (e.g., OLLAMA_HOST, model name, request timeout). Add .gitignore entries for .env, __pycache__, virtualenv folders, and local data directories. Keep configuration values centralized so later you can reuse them in FastAPI and Docker without duplication.

  • Licensing: choose an explicit license (MIT/Apache-2.0 are common for portfolios). Without a license, many teams treat the code as “all rights reserved.”
  • Commit hygiene: make small, meaningful commits: “Add CLI chat loop,” “Document setup,” “Add .env.example.”

Common mistake: shipping a demo that depends on your personal machine state (downloaded models, untracked files, or undocumented environment variables). Treat your repository as a product: if it cannot be reproduced, it does not count. This mindset is the backbone for the next chapters, where you will containerize the service and expose a streaming FastAPI backend.

Chapter milestones
  • Define your target role and portfolio narrative for a local LLM assistant
  • Set up dev environment: Python, Git, Docker, and project structure
  • Understand local LLM constraints: latency, memory, privacy, and tradeoffs
  • Create a baseline CLI chat to validate end-to-end inference
  • Checkpoint: publish a clean repo with a working local chat demo
Chapter quiz

1. Why does running an LLM locally meaningfully change the "engineering constraints" compared to using a hosted model?

Show answer
Correct answer: It introduces tradeoffs like latency and memory limits while improving privacy, cost control, offline workflows, and deployment predictability
The chapter emphasizes local inference changes constraints (latency/memory) and risk profile while enabling privacy, cost control, offline use, and predictable deployments.

2. What is the main portfolio signal the course aims to produce for hiring managers?

Show answer
Correct answer: An end-to-end shipped local LLM assistant with a clear portfolio narrative and documented decisions
The course is built around shipping an end-to-end local assistant and presenting it as a hiring-manager-friendly story.

3. Why build a baseline CLI chat before adding FastAPI, streaming, or retrieval?

Show answer
Correct answer: To validate end-to-end inference with the simplest possible working demo before adding complexity
The chapter frames the CLI chat as a minimal validation of end-to-end inference prior to more advanced components.

4. Which checkpoint best reflects the chapter’s definition of a "publishable" first milestone?

Show answer
Correct answer: A clean repository that a reviewer can clone and run to use a working local chat demo
The chapter states the first publishable checkpoint is a clean repo with a working local chat demo.

5. According to the chapter, what mindset and skill will be the strongest signal of competence throughout the course?

Show answer
Correct answer: Judgement in choosing approaches (local vs hosted, model selection), building repeatably, and documenting decisions to create reliable software
The chapter stresses a practical mindset—building reliable LLM-powered software—and highlights judgement, repeatability, and documentation as key signals.

Chapter 2: Ollama Deep Dive—Models, Prompts, and Reliable Inference

This chapter turns “it runs on my machine” into “it runs reliably, repeatably, and explainably.” You’ll go beyond pulling a model and asking questions: you’ll learn how Ollama manages models locally, how to choose between 2–3 candidates, how to build prompts that stay stable across turns, and how to tune sampling for speed without chaotic outputs. You’ll also write a tiny Python wrapper so your FastAPI backend can call Ollama consistently, and you’ll finish with a checkpoint: a short, reproducible runbook that documents your model choice and the exact commands you used.

Local inference is a product decision as much as a technical one. You trade cloud convenience for control: predictable costs, offline operation, and tighter data boundaries. But you also inherit responsibility for model selection, resource constraints, and operational guardrails (timeouts, retries, and safety filters). The goal here is to develop engineering judgment: know what to measure, what to standardize, and what failure modes to expect before you build an assistant people depend on.

The workflow in this chapter is deliberately iterative. First, install and run Ollama. Next, pull and compare a few models. Then, design a prompt template and test it with a mini test set (a handful of representative prompts and expected traits). After that, tune generation settings to stabilize behavior. Finally, wrap inference calls in a small client so your API layer can stream or respond in a predictable, debuggable way.

Practice note for Install and run Ollama; pull and compare 2-3 models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a prompt template and test behaviors with a mini test set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune generation settings for stability and speed: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a small Python wrapper client for Ollama requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: document model choice and reproducible commands: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Install and run Ollama; pull and compare 2-3 models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a prompt template and test behaviors with a mini test set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune generation settings for stability and speed: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a small Python wrapper client for Ollama requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Ollama architecture and local model lifecycle

Ollama is a local model runner that exposes a simple interface for downloading, storing, and serving LLMs on your machine. Conceptually, it sits between your application (CLI, Python, FastAPI) and a model runtime optimized for local inference. The key idea: you don’t “install” a model like a Python package; you pull a model artifact (weights + metadata), keep it in a local store, and then run inference requests against it. This lifecycle matters because repeatability depends on pinning model versions and controlling how they’re invoked.

Start by installing Ollama for your OS, then verify the daemon is running. Pull a model with a command like ollama pull llama3.1 (names vary by registry and version). Run it interactively with ollama run llama3.1 to validate that your GPU/CPU path is working and that token generation is fast enough for your use case. In practice, the “first token latency” is a big usability factor: if it’s too slow, you’ll need a smaller model, more quantization, or a different runtime setup.

  • Model store: Pulled models live locally; ensure you have disk space and a cleanup plan.
  • Reproducibility: Prefer explicit model tags/versions so “the same prompt” means the same model tomorrow.
  • Operational habit: Keep a small runbook: install steps, pull commands, and how to verify with a known prompt.

Common mistake: treating model upgrades as “transparent.” Even minor model revisions can change formatting and compliance. For a career-transition portfolio project, document your lifecycle clearly: which model you chose, why, and the exact commands required to reproduce your environment.

Section 2.2: Choosing models (size, context, instruction tuning)

Model choice is where many local projects succeed or fail. Bigger isn’t automatically better: a 70B model might be “smarter,” but if your machine can’t serve it with acceptable latency, your assistant becomes unusable. Start by comparing 2–3 models that fit your hardware constraints—often a small (e.g., ~7–8B), a medium (e.g., ~13–14B), and an alternative architecture or tuning. Pull each candidate and run the same mini test set to compare output quality and stability.

Evaluate three practical dimensions. First is size: larger models tend to follow instructions more reliably and hallucinate less on complex tasks, but they cost more memory and time. Second is context length: if you plan to do lightweight RAG later (grounding answers in local documents), context window becomes important because you’ll be injecting excerpts. Third is instruction tuning: choose instruction-tuned or chat-tuned variants for assistant behavior; base models may require more careful prompting and can drift into unhelpful completions.

  • Speed target: Decide upfront what “good enough” feels like (e.g., response starts within 1–2 seconds and streams steadily).
  • Task fit: Use prompts that resemble your real assistant tasks (summarize a resume bullet, draft an email, critique a project plan).
  • Failure sensitivity: If your assistant must follow formatting rules (JSON, bullet lists), prefer models known for instruction compliance.

Engineering judgment: choose the smallest model that reliably passes your mini test set. This keeps costs (compute, battery, thermal throttling) manageable and makes Dockerized local deployment more predictable later.

Section 2.3: Prompt templates and conversation state

A prompt template is your assistant’s “contract.” It defines role, scope, tone, and constraints so the model behaves consistently across requests. Without a template, you’ll see drift: verbose answers when you want concise ones, refusal to follow formatting, or context confusion after a few turns. A solid template typically includes: system instructions (non-negotiable rules), developer instructions (task framing), user message, and optional context blocks (retrieved text, policies, or examples).

Conversation state is equally important. Chat-style inference works by providing prior messages back to the model each turn. Locally, you control exactly what gets sent—this is a feature and a risk. If you naively append everything forever, you’ll hit context limits and slow inference. If you trim too aggressively, the assistant forgets key constraints. A practical approach is to keep a short, structured memory: recent turns plus a running summary of important facts.

Build a mini test set (5–10 prompts) that probes your template’s behavior. Include at least: a request for structured formatting, a refusal-sensitive request (to see if it handles safety constraints), a long-input request (to test truncation), and a “follow-up question” that depends on prior context. Run this test set against each model and template revision. Treat prompt design like code: version it, document changes, and avoid ad-hoc edits that aren’t tested.

  • Common mistake: Putting contradictory instructions (e.g., “be concise” and “provide exhaustive detail”).
  • Practical outcome: A template file you can reuse in your FastAPI service, plus a small set of test prompts stored in your repo.
Section 2.4: Sampling parameters (temperature, top_p, max tokens)

Sampling parameters are your main levers for stability, creativity, and speed. Many “model quality” complaints are actually parameter misconfiguration. If your assistant must be reliable (career guidance, planning, summarizing), you generally want lower randomness. If you’re brainstorming, you can increase randomness—but do it deliberately and test the results.

Temperature controls how deterministic the model is. Lower values (e.g., 0.1–0.3) reduce variance and make outputs more repeatable; higher values (e.g., 0.7–1.0) increase creativity but also increase the chance of format breaks and hallucinations. top_p (nucleus sampling) restricts choices to a probability mass; many teams keep temperature moderate and use top_p (e.g., 0.9) to avoid extreme tokens. max tokens (or max output) caps response length—critical for latency and for preventing the model from rambling when it’s unsure.

  • Stability preset (recommended for APIs): temperature 0.2, top_p 0.9, max tokens tuned per endpoint.
  • Brainstorm preset: temperature 0.8, top_p 0.95, shorter max tokens to keep ideas punchy.
  • Speed note: Lower max tokens often improves perceived performance more than any other tweak.

Common mistake: increasing temperature to “fix” bland responses when the real issue is prompt clarity. First tighten the instructions and examples; then adjust sampling. Record your final settings in your checkpoint runbook so teammates (or future you) can reproduce behavior exactly.

Section 2.5: Structured outputs with JSON and schema expectations

Once you connect a model to FastAPI, you’ll quickly want structured outputs: JSON objects you can validate, store, and render. The challenge is that LLMs are probabilistic text generators; they can produce trailing commentary, invalid quotes, or partial JSON when token limits hit. Your job is to design for “JSON-first” behavior and enforce it with validation and retries.

Start by specifying the schema in your prompt template. Be explicit: required keys, allowed values, and constraints (string length, enum choices). Ask for only JSON, no prose. Then implement a post-processing step that parses JSON and validates it (Pydantic is a good fit in Python). If parsing fails, retry once with a corrective message that includes the validation error and repeats the schema. Keep retries limited to avoid infinite loops and runaway latency.

  • Prompt tip: Provide a minimal example of valid JSON that matches your schema.
  • Guardrail: Set max tokens high enough for the full JSON response, or you’ll get truncated objects.
  • Safety: Never execute generated code or trust generated URLs; treat JSON as untrusted input until validated.

Practical outcome: you can now build endpoints like /chat (free-form) and /extract (strict JSON) with different templates and sampling presets. This separation keeps your system predictable and easier to debug.

Section 2.6: Debugging inference issues and common failure modes

When local inference fails, symptoms can look similar—timeouts, garbled outputs, or sudden slowness—but root causes differ. Build a debugging checklist and apply it systematically. First, confirm whether the issue is resource-related (RAM/VRAM exhaustion, CPU throttling), prompt-related (too long, conflicting instructions), or transport-related (client timeouts, streaming handling). Logging is essential: record model name/version, prompt length (tokens or characters), sampling settings, and latency for first token and full completion.

Write a small Python wrapper client for Ollama requests to standardize calls. This wrapper should: set timeouts, support retries with backoff for transient failures, and optionally stream tokens for responsive UIs. Even if your first FastAPI version is minimal, your wrapper becomes the single place to handle request formatting, error mapping, and observability fields (request id, duration, token counts if available).

  • Failure mode: Output truncation → increase max tokens or shorten context; add “stop” sequences if supported.
  • Failure mode: Repetitive loops → lower temperature, adjust top_p, or tighten instructions.
  • Failure mode: JSON invalid → enforce schema, parse+validate, retry once with corrective prompt.
  • Failure mode: Slow after a few runs → check thermal throttling, memory pressure, and background processes.

Checkpoint: document your final model choice and the reproducible commands: installation notes, ollama pull lines, the exact model tag, your prompt template file path, and your generation settings. This “paper trail” is what turns an experiment into an engineering artifact you can confidently containerize and expose through a streaming FastAPI service in the next stage.

Chapter milestones
  • Install and run Ollama; pull and compare 2-3 models
  • Design a prompt template and test behaviors with a mini test set
  • Tune generation settings for stability and speed
  • Build a small Python wrapper client for Ollama requests
  • Checkpoint: document model choice and reproducible commands
Chapter quiz

1. What is the main shift in mindset Chapter 2 aims to achieve when moving from basic local inference to a dependable assistant?

Show answer
Correct answer: From “it runs on my machine” to “it runs reliably, repeatably, and explainably”
The chapter emphasizes reliability, repeatability, and explainability rather than one-off success.

2. Why does the chapter have you pull and compare 2–3 models instead of committing to the first model that works?

Show answer
Correct answer: To make a grounded model choice based on measured behavior and constraints
Comparing a few candidates supports engineering judgment around selection under local resource constraints and desired behavior.

3. What is the purpose of creating a prompt template and validating it with a mini test set?

Show answer
Correct answer: To ensure prompt behavior stays stable across turns and can be checked against representative cases
A template plus mini test set helps standardize and verify consistent behavior on representative prompts.

4. According to the chapter, why tune generation settings during local inference development?

Show answer
Correct answer: To stabilize outputs and improve speed without chaotic behavior
The chapter highlights tuning sampling to balance stability and speed while avoiding erratic outputs.

5. What is the practical reason for writing a small Python wrapper client for Ollama requests in this chapter?

Show answer
Correct answer: So the FastAPI backend can call Ollama consistently in a predictable, debuggable way
A wrapper standardizes inference calls for your API layer, supporting predictable responses/streaming and easier debugging.

Chapter 3: FastAPI Assistant API—Chat, Streaming, and Contracts

In Chapter 2 you proved you can run a model locally with Ollama. In this chapter, you turn that capability into a small, reliable service that other tools (a CLI, a web UI, a teammate’s script) can call. The goal is not “a demo endpoint that works once,” but an API that behaves predictably: it validates inputs, documents its contract, streams tokens for a good user experience, and fails gracefully when the model is slow or unavailable.

Think like an engineer building a product surface area. A local LLM is a dependency that can be heavy, variable in latency, and occasionally error-prone. FastAPI gives you a clean way to wrap that dependency behind a consistent interface. The key is to decide what your API promises: what a request looks like, what responses look like (including errors), how streaming is delivered, and how conversation memory is stored and scoped. Those promises are your “contracts,” and they matter even when you are the only user—because future-you will integrate this service into other projects.

We’ll build up from a health-checked FastAPI service with configuration, then add a /chat endpoint that supports conversation context, then add streaming so the UI can render partial output immediately. Along the way you’ll define Pydantic models that appear automatically in OpenAPI docs, and you’ll learn where common mistakes happen: blocking calls in async routes, over-trusting user input, letting requests run forever, and returning inconsistent error shapes.

Practice note for Create FastAPI service with health checks and configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement /chat endpoint with conversation memory: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add streaming responses for better UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define request/response models with Pydantic and OpenAPI docs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: a documented API that others can call locally: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create FastAPI service with health checks and configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement /chat endpoint with conversation memory: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add streaming responses for better UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define request/response models with Pydantic and OpenAPI docs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: FastAPI project layout and dependency injection basics

A clean project layout pays off quickly because an LLM assistant API mixes concerns: HTTP routing, configuration, model calling, memory, and safety checks. A practical layout for this chapter looks like: app/main.py (FastAPI app), app/api/chat.py (routes), app/core/config.py (settings), app/services/ollama_client.py (model calls), and app/services/memory.py (conversation storage). Keeping these separated lets you test and replace parts without rewriting your endpoints.

Use dependency injection (DI) to pass shared resources into routes. In FastAPI, DI is usually done with Depends(). For example, you can inject a Settings object and an OllamaClient into your /chat handler. DI prevents “global variable soup” and makes it obvious what an endpoint needs to run. It also makes it easier to swap implementations later (for example, replacing in-memory conversation storage with Redis).

Start with operational basics: a /healthz endpoint that returns a simple JSON payload, plus a /readyz endpoint that can optionally check whether Ollama is reachable. Health checks are not just for Kubernetes; they help you debug quickly when something fails. A common mistake is to do expensive checks on every health request. Instead, keep /healthz cheap, and reserve deeper checks for /readyz or an admin-only path.

  • Create FastAPI(title=..., version=...) and include routers under a prefix like /v1.
  • Add structured logging early. Even basic request IDs in logs help when debugging streaming endpoints.
  • Keep environment-based configuration (model name, host, timeout) out of code and in settings.

The practical outcome of this section is a service skeleton you can run locally with uvicorn app.main:app --reload and confidently tell whether it’s up, whether it’s ready, and which pieces are responsible for what.

Section 3.2: Pydantic models and validation for safer prompts

Your assistant is only as stable as its inputs. Without validation, clients can send malformed JSON, absurdly long prompts, or unexpected roles that break your logic. Pydantic models are your first guardrail: they define the request/response contract and generate OpenAPI docs automatically. For a chat-style endpoint, a typical request model includes a messages list, each with role and content, plus optional fields like conversation_id, temperature, and max_tokens.

Validation is not only about types; it’s about constraints that reflect engineering judgment. For example, enforce min_length for message content (avoid empty messages), cap max_length (protect memory and latency), and restrict roles to a known set (system, user, assistant). Consider adding a rule that only the first message may be system, or that a request must include at least one user message.

It’s also the right place to implement lightweight “prompt safety” rules. You are not building a full policy engine, but you can block obviously dangerous payloads (e.g., attempts to exfiltrate secrets from server prompts) or strip control characters. Another common mistake is to silently modify user input in ways that surprise clients. Prefer explicit validation errors (HTTP 422) with clear messages so callers can fix their requests.

  • Define request models: ChatRequest, ChatMessage.
  • Define response models: ChatResponse with assistant_message, conversation_id, and basic usage/latency metrics.
  • Use field constraints (constr, conint) and custom validators for cross-field rules.

The practical outcome is an API that “fails fast” with readable errors, protects your service from runaway payloads, and makes your assistant’s behavior easier to reason about—especially when you later add retrieval-augmented generation (RAG) or tool calling.

Section 3.3: Calling Ollama from FastAPI (sync/async patterns)

Ollama exposes a local HTTP API, so your FastAPI service is essentially an HTTP-to-HTTP adapter with added contracts, memory, and safety. The main decision is whether your route handlers will be synchronous or asynchronous—and whether the underlying HTTP client is blocking. If you write async def routes but call Ollama using a blocking library, you may stall the event loop and degrade concurrency. This is a classic mistake when adding streaming later.

A practical approach is to implement an OllamaClient service with both sync and async methods. For sync, requests is straightforward and fine for low concurrency. For async, use httpx.AsyncClient so multiple requests can overlap while waiting on tokens. Keep the Ollama URL, model name, and timeout in configuration so you can swap models (e.g., llama3 vs. mistral) without code changes.

Conversation memory usually means “send prior messages back to the model.” You can store a short history per conversation_id in-memory for this chapter (a dictionary plus timestamps). Engineering judgment matters here: cap how many turns you keep (e.g., last 10 messages) and cap total characters. Otherwise, long chats will become slower and more expensive. Also decide whether the server generates conversation_id when missing, and whether clients can reset memory.

  • Build a prompt payload from validated messages and your server-side system prompt template.
  • Attach conversation history in a predictable order; do not “mix” user and assistant messages incorrectly.
  • Return consistent metadata: model used, latency, and conversation_id.

The practical outcome is a working /chat endpoint that produces responses deterministically from a known request shape, while giving you enough flexibility to tune performance and model choice as you iterate.

Section 3.4: Server-sent events or streaming token strategies

Streaming is a user-experience upgrade that also changes how you think about HTTP responses. Instead of waiting 5–30 seconds for a full completion, clients can render tokens as they arrive. This is especially important for local inference, where latency can vary based on CPU/GPU load. In FastAPI, the simplest approach is server-sent events (SSE) using text/event-stream with a StreamingResponse.

There are two common token strategies. First, “true token streaming,” where you pass stream=true to Ollama and forward each chunk as it arrives. Second, “simulated streaming,” where you request the full response and then yield it in small pieces. True streaming is preferred: it reduces time-to-first-token and supports cancellation. Simulated streaming can be acceptable for prototypes but wastes latency and memory.

Design your stream protocol deliberately. SSE typically emits events like event: token with JSON payloads containing delta text, and a final event: done containing the full assembled message and metadata. A common mistake is to stream plain text without framing; clients then struggle to detect completion or parse errors. Another mistake is to forget to flush or to buffer too much, which makes “streaming” feel like a delayed batch response.

  • Choose SSE for browser-friendly streaming; consider WebSockets only if you need bidirectional interactivity.
  • Send periodic keep-alives for long generations to avoid proxy timeouts.
  • Support client cancellation by handling disconnects and stopping upstream reads.

The practical outcome is a streaming /chat/stream (or a query flag like ?stream=true) that feels responsive, is easy to consume from a UI, and is robust under variable local inference speed.

Section 3.5: Error handling, timeouts, and graceful degradation

Local LLM services fail in predictable ways: Ollama might not be running, the model might not be pulled, generation can be slow, and the machine can run out of memory. Your API should treat these as normal conditions, not surprises. Start with explicit timeouts on outbound Ollama calls. Without them, a single request can hang until the client gives up, tying up server resources. Pair timeouts with clear HTTP status codes and consistent error shapes.

Define a small error schema, for example: {"error": {"code": "UPSTREAM_TIMEOUT", "message": "...", "retryable": true}}. Use it everywhere: validation errors, upstream failures, and internal exceptions. When you stream, errors need special care: you may already have sent partial data. In SSE, you can emit an event: error before closing the stream, so clients can show a useful message instead of a silent failure.

Retries are a judgment call. Retrying a long generation often makes things worse, but retrying a transient network error to the local Ollama process can help. If you implement retries, keep them conservative (e.g., 1 retry, short backoff) and never retry non-idempotent operations without thinking. Also implement graceful degradation paths: if streaming fails mid-way, you might fall back to returning the partial text you already received with a partial=true flag, rather than discarding everything.

  • Set per-request deadlines (client timeout) and upstream deadlines (Ollama timeout).
  • Translate upstream errors into stable API errors; do not leak raw stack traces.
  • Log enough context (model, conversation_id, latency) to debug performance issues.

The practical outcome is an assistant API that feels dependable. Even when something goes wrong, callers get predictable signals about what happened and what to do next.

Section 3.6: API usability—OpenAPI, examples, and curl recipes

An internal API is only useful if other people (or other programs) can call it without reading your source code. FastAPI’s automatic OpenAPI docs are a major advantage—if you feed them good models and examples. Add example payloads directly in your Pydantic schemas and route decorators so the interactive docs show realistic chat requests and streaming usage. Document what conversation_id means, whether memory persists across restarts, and what limits you enforce (message length, number of turns, timeout).

Provide two primary endpoints: a non-streaming POST /v1/chat for simple clients, and a streaming variant (either POST /v1/chat/stream or POST /v1/chat?stream=true). Include /healthz and /readyz so local callers can programmatically detect availability. This checkpoint matters for career transition projects: you are demonstrating that you can build “callable” services with contracts, not just notebooks.

Include copy-paste recipes. A minimal curl example for non-streaming might post JSON with messages, while streaming might use -N to disable buffering and show events in real time. Also be explicit about content types: application/json for regular requests and text/event-stream for SSE responses. A common mistake is to ship an API that works in Swagger UI but lacks real-world examples for terminals, Python scripts, or front-end fetch calls.

  • Use tags and summaries so endpoints are grouped cleanly in docs.
  • Document error responses and include examples of upstream failure messages.
  • Version your API under /v1 to keep freedom to change later.

The practical outcome is your chapter checkpoint: a documented local assistant API that others can call, confidently integrate, and debug—with clear contracts for chat, streaming, and operational health.

Chapter milestones
  • Create FastAPI service with health checks and configuration
  • Implement /chat endpoint with conversation memory
  • Add streaming responses for better UX
  • Define request/response models with Pydantic and OpenAPI docs
  • Checkpoint: a documented API that others can call locally
Chapter quiz

1. Why does Chapter 3 emphasize building an API that "behaves predictably" rather than a one-off demo endpoint?

Show answer
Correct answer: Because other tools and future integrations depend on stable input validation, documented contracts, and consistent errors
The chapter frames the API as a product surface area used by CLIs, UIs, and scripts, so it must validate, document, and fail consistently.

2. In the chapter’s framing, what are the API’s "contracts"?

Show answer
Correct answer: The promised shapes and behavior of requests, responses (including errors), streaming delivery, and how conversation memory is scoped
Contracts are the explicit promises the API makes about inputs/outputs, streaming semantics, error shapes, and memory handling.

3. What is the primary user-experience reason for adding streaming responses to the /chat endpoint?

Show answer
Correct answer: So the client can render partial output immediately while the model is still generating
Streaming lets clients display tokens incrementally instead of waiting for the full completion.

4. How do Pydantic models help achieve the chapter’s goal of a reliable assistant API?

Show answer
Correct answer: They validate inputs and define request/response shapes that appear automatically in OpenAPI documentation
Pydantic enforces schemas and produces clear, documented contracts via OpenAPI.

5. Which situation best matches the chapter’s warning about common mistakes in FastAPI LLM services?

Show answer
Correct answer: An async route makes blocking calls and allows requests to run forever, causing unreliable behavior when the model is slow or unavailable
The chapter flags pitfalls like blocking calls in async routes and letting requests run indefinitely, especially with variable-latency model dependencies.

Chapter 4: Dockerize the Stack—Ollama + FastAPI for One-Command Runs

In Chapter 3 you proved your assistant works on your machine. In this chapter you make it work on any machine—reliably, repeatedly, and with one command. Docker is the lever that turns “it runs on my laptop” into a portable, shareable, team-friendly system. For career transitions, this is not a nice-to-have: employers want reproducible environments, clear configuration boundaries, and deployable artifacts.

The stack you’re containerizing has two moving parts: a FastAPI service that exposes your chat endpoints and streams tokens, and an Ollama runtime that hosts the model. Docker lets each component have its own filesystem, dependencies, and lifecycle, while docker compose ties them together with a shared network and persistent volumes.

As you build this, keep an engineering mindset: choose defaults that minimize surprises (pinned versions, explicit ports, explicit volumes), avoid storing secrets in images, and structure compose files so development is fast while production is predictable. By the end, you’ll run docker compose up and get a working assistant stack—API, model runtime, and persistence—without manual setup steps.

Practice note for Write a Dockerfile for the FastAPI service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create docker-compose for FastAPI + Ollama with volumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add environment configuration and secrets handling patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Optimize container startup and local dev workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: run the full assistant stack with docker compose up: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a Dockerfile for the FastAPI service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create docker-compose for FastAPI + Ollama with volumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add environment configuration and secrets handling patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Optimize container startup and local dev workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: run the full assistant stack with docker compose up: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Docker essentials for AI apps (images, volumes, networks)

Section 4.1: Docker essentials for AI apps (images, volumes, networks)

AI applications amplify normal Docker concerns because models are large, startup can be slow, and state matters. Three Docker primitives drive most of your design: images, volumes, and networks.

Images are immutable templates. Your FastAPI image should contain only your app code and Python dependencies. Avoid baking model files into the image; it makes builds huge and rebuilds painful. Treat the image as “the code + runtime,” not “the data.” Pin base images (for example, python:3.11-slim) so rebuilds don’t drift.

Volumes are essential for local LLM work. Ollama stores downloaded models and runtime state under its data directory (commonly /root/.ollama inside the container). Without a volume, every container recreation forces a model re-download, which is slow and often mistaken for “the stack is broken.” In compose, bind a named volume to that directory so models persist across restarts. Similarly, if you add lightweight RAG later, mount your local documents directory read-only into the API container (or a separate indexer container) instead of copying documents into the image.

Networks are how containers talk. Compose automatically creates a private network and registers services by name, so your FastAPI container can reach Ollama at http://ollama:11434. The common mistake is pointing your app to localhost. Inside a container, localhost refers to that container, not your host and not other services. Use the compose service name as the hostname.

  • Judgement call: keep persistence explicit. If a service needs data across restarts, use a named volume and document it.
  • Common mistake: binding host directories with mismatched permissions (especially on Linux) and then debugging “mysterious” write failures.
  • Practical outcome: after this section, you should know where model state lives and how services will resolve each other’s hostnames.

With these basics, you’re ready to containerize the API cleanly and keep the model runtime stateful without turning your application image into a giant artifact.

Section 4.2: Containerizing FastAPI (Uvicorn/Gunicorn choices)

Section 4.2: Containerizing FastAPI (Uvicorn/Gunicorn choices)

Your FastAPI container should build fast, start fast, and expose a predictable port. A straightforward Dockerfile is usually best for this course: install dependencies, copy code, and run Uvicorn. Later, you can choose Gunicorn for multi-worker production deployments, but be careful with streaming token endpoints.

A practical Dockerfile pattern looks like this:

  • Use python:3.11-slim (or similar) as the base.
  • Set WORKDIR (for example, /app).
  • Copy dependency files first (pyproject.toml/poetry.lock or requirements.txt) to maximize build caching.
  • Install dependencies, then copy the rest of the application code.
  • Run uvicorn app.main:app --host 0.0.0.0 --port 8000.

For token streaming, Uvicorn is often the simplest and most predictable in containers. If you adopt Gunicorn, use the Uvicorn worker class (for example, -k uvicorn.workers.UvicornWorker) and test streaming carefully. Multi-worker setups can introduce subtle issues: streaming responses may behave differently across workers, and you may need sticky sessions if you add websockets or stateful connections.

Two common mistakes: (1) binding to 127.0.0.1 inside the container, which makes the service unreachable from outside; always bind to 0.0.0.0. (2) forgetting to expose or map the port in compose, then assuming the server didn’t start.

Configuration belongs in environment variables, not hard-coded constants. In this chapter you’ll pass OLLAMA_BASE_URL, request timeouts, and safety/validation settings via compose. That keeps the image reusable in dev, testing, and production-like runs without rebuilding.

Practical outcome: you can build your FastAPI image once and run it anywhere, and you’ve made an intentional server choice (Uvicorn now; Gunicorn later when you need more concurrency).

Section 4.3: Ollama in containers—models, persistence, and ports

Section 4.3: Ollama in containers—models, persistence, and ports

Ollama is the model runtime in your stack. When running it in a container, you care about three things: how the API is reached, where models are stored, and how to manage model downloads without slowing down your workflow.

By default, Ollama listens on port 11434. In compose, you’ll typically publish it to the host (for example, 11434:11434) so you can test it directly with curl from your machine. Even if you don’t publish it, your FastAPI service can still reach it internally using the compose network. Publishing is mainly for developer convenience.

Persistence is non-negotiable: mount a named volume to Ollama’s data directory so model pulls survive container recreation. Without that volume, every docker compose up after a cleanup triggers a fresh download. This is one of the most frequent “why is this taking forever?” moments for first-time local LLM users.

Model management should be intentional. Decide which model(s) you need for your assistant’s role (smaller for speed, larger for quality), and pull them once. A practical pattern is to keep the model name in an environment variable like OLLAMA_MODEL so switching models doesn’t require code changes. You can also separate “startup” from “model warmup”: start the containers first, then run a one-time ollama pull step against the running service to populate the volume.

  • Judgement call: avoid pulling models during image build. Builds should be deterministic and fast; model downloads are large, can fail, and don’t belong in the FastAPI image.
  • Common mistake: pointing your API to http://localhost:11434 from inside the container. In compose, use http://ollama:11434.

Practical outcome: you’ll have a stable Ollama container whose model cache persists, with ports and URLs that behave consistently across restarts and across machines.

Section 4.4: docker-compose orchestration and health checks

Section 4.4: docker-compose orchestration and health checks

docker-compose.yml (or compose.yml) is where the stack becomes one command. You define two services—api and ollama—attach volumes, map ports, and inject configuration. Compose also gives you a shared DNS-based network so services can find each other by name.

A well-structured compose file does more than start containers; it expresses operational intent. Add a health check for Ollama so the API doesn’t hammer it during startup. Health checks are especially useful because model runtimes can be “running” but not yet ready to respond. A simple health check can call an Ollama endpoint (for example, a tags/list endpoint) and retry until it succeeds.

Then, use depends_on with a health condition (supported in modern Compose implementations) to delay API startup until Ollama is healthy. Even with that, your API should still implement timeouts and retries when calling Ollama—health checks reduce failures, they don’t eliminate them.

Mount volumes explicitly:

  • A named volume for Ollama model storage (persistent).
  • An optional bind mount for your FastAPI code in dev (live reload), but not in production mode.
  • An optional read-only bind mount for documents if you’re preparing for lightweight RAG (for example, ./docs:/docs:ro), which keeps grounding material outside the image.

Practical outcome: “run the full assistant stack” becomes docker compose up, and you gain resilience against race conditions where the API starts before the model runtime is actually ready.

Section 4.5: Development vs production compose profiles

Section 4.5: Development vs production compose profiles

One compose file can support both a rapid local development loop and a production-like run, but only if you separate concerns. Compose profiles are a clean way to do this: define a dev profile that enables hot reload and bind mounts, and a default (or prod) profile that runs immutable containers with conservative settings.

In development, prioritize feedback speed:

  • Mount your source directory into the API container and run Uvicorn with --reload so code changes take effect immediately.
  • Expose ports openly (API on 8000, Ollama on 11434) so you can test with local tools.
  • Use environment variables via an .env file for non-sensitive config (model name, log level), but keep secrets out of Git.

In production-like mode, prioritize predictability and safety:

  • No bind mounts of application code; run the built image only.
  • Consider not publishing the Ollama port to the host if only the API needs it.
  • Set explicit CPU/memory expectations where supported, and set stricter timeouts.

For configuration and secrets handling, adopt patterns you can explain in an interview: environment variables for runtime config, .env for local developer convenience, and Docker secrets (or your platform’s secret manager) for real credentials. Even if your assistant is fully local today, practicing “secrets hygiene” now prevents bad habits later.

Practical outcome: you can switch between fast iteration and a stable deployment posture with a single flag, without editing YAML each time.

Section 4.6: Troubleshooting: permissions, ports, and resource limits

Section 4.6: Troubleshooting: permissions, ports, and resource limits

When Dockerizing AI workloads, failures cluster around three areas: filesystem permissions, port confusion, and resource limits. Knowing the patterns saves hours.

Permissions: If Ollama can’t write to its model directory, you’ll see repeated download failures or corrupted caches. This often happens with bind mounts on Linux where host directory ownership doesn’t match the container user. Prefer named volumes for Ollama storage because Docker manages permissions more predictably. If you must use a bind mount, verify ownership and consider running the container with a user that can write to the mount point.

Ports and hostnames: If your API can’t reach Ollama, check the URL from the API container’s perspective. Inside compose, the correct host is usually the service name (ollama), not localhost. If you can curl http://ollama:11434 from inside the API container but not from your host, you likely forgot to publish the port. If you can reach it from the host but not from the API container, you likely misconfigured the base URL.

Resource limits: Local inference is heavy. If the model loads slowly or requests time out, check CPU/RAM availability. Containers don’t magically create compute; they compete for the same host resources. On Docker Desktop, ensure the VM has enough memory allocated. Also watch for timeouts: your FastAPI client calls to Ollama should set a realistic read timeout for streaming responses. Too short looks like “random failures,” especially on first token generation when the model is cold.

  • Practical debugging workflow: check container logs, then exec into the API container and test connectivity with curl, then validate volume mounts with docker volume inspect and directory listings.
  • Common mistake: assuming a health check guarantees readiness for a specific model. Even after Ollama is up, the first request can be slower due to model load.

Checkpoint outcome: you can bring the entire assistant up with docker compose up, verify the API endpoint responds, and confirm Ollama’s models persist across restarts—making your local LLM assistant a reproducible artifact instead of a fragile setup.

Chapter milestones
  • Write a Dockerfile for the FastAPI service
  • Create docker-compose for FastAPI + Ollama with volumes
  • Add environment configuration and secrets handling patterns
  • Optimize container startup and local dev workflows
  • Checkpoint: run the full assistant stack with docker compose up
Chapter quiz

1. What is the main reason Chapter 4 introduces Docker for the Ollama + FastAPI assistant stack?

Show answer
Correct answer: To make the stack portable and reproducible across machines with a one-command run
The chapter emphasizes turning “runs on my laptop” into a reliable, shareable system that starts with one command.

2. Why does the chapter recommend containerizing FastAPI and Ollama as separate services rather than bundling them into one container?

Show answer
Correct answer: It allows each component to have its own filesystem, dependencies, and lifecycle while compose connects them
Docker keeps each service isolated, and docker compose ties them together via networking and shared resources like volumes.

3. In the chapter’s recommended compose setup, what is the purpose of using persistent volumes?

Show answer
Correct answer: To preserve important state (such as model/runtime data) across container restarts
Persistent volumes keep data from being lost when containers are recreated, supporting repeatable runs without re-setup.

4. Which configuration practice best matches the chapter’s guidance on minimizing surprises and keeping deployments predictable?

Show answer
Correct answer: Use pinned versions and explicit ports/volumes instead of relying on implicit defaults
The chapter calls for pinned versions and explicit configuration boundaries to make behavior consistent across machines.

5. Which approach aligns with the chapter’s recommendation for handling secrets when Dockerizing the stack?

Show answer
Correct answer: Keep secrets out of images and provide them via environment configuration patterns
The chapter explicitly warns against storing secrets in images and promotes clear environment configuration boundaries.

Chapter 5: Add Retrieval (RAG) + Guardrails for a Portfolio-Ready Assistant

By now you have a local LLM running via Ollama and a FastAPI service that can stream chat responses. That is a strong prototype, but it still has a common portfolio problem: it can sound confident while being wrong, and it can’t reliably answer questions about your data (resume, project docs, internal notes, PDFs, readme files). This chapter upgrades your assistant into something you can responsibly demo: it will ground answers in local documents using lightweight Retrieval-Augmented Generation (RAG), cite the context it used, and apply practical guardrails (input limits, content policy checks, timeouts, retries, and rate limiting).

The engineering mindset here is simple: don’t ask the model to “remember” or “guess” what you can retrieve deterministically. For a career-transition portfolio, grounded answers with citations are more persuasive than a clever prompt. At the same time, local deployments have their own constraints: CPU-only environments, limited RAM, and the need for predictable latency. We’ll keep the pipeline small, testable, and transparent.

  • Goal: ingest local docs → chunk → embed → store vectors → retrieve top-k chunks per question → assemble context → prompt LLM to answer with citations.
  • Guardrails: validate input, constrain retrieval size, enforce timeouts, implement retries, and add basic safety filters plus rate limiting.
  • Checkpoint outcome: your assistant answers grounded questions, cites sources, and fails safely when context is missing.

As you implement this, focus on two practical outcomes recruiters can understand: (1) “it cites where facts came from,” and (2) “it’s robust under messy user input.”

Practice note for Ingest local docs and chunk text for retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement embeddings and a simple vector store option: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compose prompts that cite retrieved context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add guardrails: input constraints, content policies, and rate limiting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: assistant answers grounded questions with citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Ingest local docs and chunk text for retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement embeddings and a simple vector store option: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compose prompts that cite retrieved context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add guardrails: input constraints, content policies, and rate limiting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: When to use RAG vs fine-tuning for local assistants

Section 5.1: When to use RAG vs fine-tuning for local assistants

RAG and fine-tuning solve different problems. Fine-tuning changes model behavior (style, format, domain language) by updating weights. RAG keeps the model fixed and changes what it sees at inference time by attaching relevant documents to the prompt. For most portfolio assistants that answer questions about personal projects, company policies, or internal notes, RAG is the default because the knowledge changes often and you want answers that can be traced to sources.

Use RAG when: your facts live in documents; you need citations; you expect the content to change weekly; you want to keep everything local; and you want a clear “I don’t know” path when retrieval fails. RAG also reduces hallucinations because you can instruct the model to only answer from retrieved context and to quote or cite chunk IDs.

Use fine-tuning (or lightweight alternatives like prompt templates and system instructions) when: you need consistent formatting (JSON schemas, ticket templates), specialized tone, or tool-using behavior; and the change is stable across many tasks. Fine-tuning is not an efficient way to “store” your documents—especially if you need to update them frequently. It also makes it harder to prove where an answer came from, which matters when you demo reliability.

Engineering judgment: start with RAG, then add small behavioral shaping. For example, keep a stable system prompt that defines policies (cite sources, refuse unsafe requests, admit uncertainty) and rely on RAG for facts. A common mistake is trying to solve accuracy by “prompting harder” without retrieval. Another mistake is overstuffing the prompt with too many chunks; the model becomes slow and confused, and citations become meaningless.

For this course, your portfolio-ready story is: “I built a local assistant that retrieves from my project docs and responds with citations; it’s safer and more reliable than a pure chat bot.”

Section 5.2: Document ingestion and chunking strategies

Section 5.2: Document ingestion and chunking strategies

Ingestion is the unglamorous work that determines whether RAG feels magical or broken. Your pipeline should take a folder of local documents and produce normalized text plus metadata (source path, title, page number, headings). Start by supporting the formats you can parse reliably: .txt, .md, and .pdf (via a PDF text extractor). If PDF extraction is noisy, consider exporting to markdown first for your portfolio demo; clean inputs lead to cleaner retrieval.

Chunking is the next critical decision. Retrieval works best when each chunk is small enough to be specific, but large enough to contain the answer. A practical baseline is 400–800 tokens per chunk with 10–20% overlap. Overlap preserves continuity for concepts that span chunk boundaries. If you chunk too small (e.g., single sentences), retrieval becomes brittle and you lose context. If you chunk too large (e.g., entire documents), you retrieve irrelevant material and waste the context window.

Prefer structure-aware chunking: split markdown by headings, then split long sections by paragraph boundaries. Keep metadata that helps citations: {doc_id, source, heading, chunk_id, start_char, end_char}. That metadata becomes your citation mechanism and your debugging tool when users ask, “Where did that come from?”

  • Common mistake: chunking by fixed character counts without respecting paragraphs—this often breaks code blocks and lists, making chunks hard to interpret.
  • Common mistake: discarding metadata—without it, you cannot produce credible citations.
  • Practical workflow: ingest → normalize whitespace → remove boilerplate (nav menus, repeated footers) → chunk → store.

Keep ingestion idempotent. If the docs folder hasn’t changed, don’t rebuild everything. For a local demo, a simple approach is to compute a hash of each file and re-embed only changed files. This improves iteration speed and makes your system feel like a product rather than a script.

Section 5.3: Embeddings options (local and lightweight choices)

Section 5.3: Embeddings options (local and lightweight choices)

Embeddings convert each chunk into a numeric vector so you can do similarity search. In hosted systems you might call an embeddings API, but for this course the point is local, repeatable deployment. You have three practical options: (1) run a local embedding model, (2) reuse an Ollama-provided embedding endpoint if available in your setup, or (3) choose a lightweight CPU-friendly model via a Python library.

For local-first portfolios, CPU-friendly sentence embedding models are often sufficient. The tradeoff is quality vs speed: smaller models embed faster and use less RAM, but may retrieve less accurately on nuanced queries. If you can afford it, run embeddings on GPU; if not, optimize the rest of the pipeline (good chunking, clean text, sane top-k) to compensate.

Vector store choices can remain simple. A “vector store” is just: vectors + metadata + a way to search. For lightweight options, you can use:

  • In-memory + numpy for small corpora (hundreds to a few thousand chunks). Easiest to understand and great for teaching.
  • SQLite to persist metadata and vectors (often stored as blobs), paired with brute-force cosine similarity. Good enough for demos.
  • FAISS for faster similarity search at larger scales. Adds a dependency but improves latency significantly.

Engineering judgment: keep it boring until you need performance. In a portfolio context, correctness and debuggability beat “enterprise architecture.” Make sure embeddings are deterministic: same text → same vector. Version your embedding model name and chunking parameters; changing either invalidates your existing index and can silently degrade retrieval.

A common mistake is mixing embeddings from different models in one index. Another is embedding raw PDFs with headers, footers, and page numbers intact; retrieval then matches on repeated boilerplate instead of content. Clean inputs and consistent model/version tracking are your best friends.

Section 5.4: Retrieval pipeline and context assembly

Section 5.4: Retrieval pipeline and context assembly

The retrieval pipeline is the “glue” that turns a user question into grounded context for the LLM. The basic sequence is: (1) embed the user query, (2) similarity-search your vector store, (3) select top-k chunks, (4) assemble a context block with citations, and (5) prompt the LLM to answer using only that context.

Top-k selection is not arbitrary. Start with k=4 to 8 and cap the total context to a token budget (for example, 1,500–2,500 tokens depending on your model’s context window and your desired latency). If you always include too much context, the model may “average” across chunks and produce vague answers. If you include too little, it may miss a key detail and hallucinate. A practical technique is to retrieve more (say k=12) then re-rank or filter by simple heuristics (e.g., drop chunks with similarity below a threshold).

Context assembly should be explicit and auditable. Build a block like:

  • [S1] source=docs/resume.md#Projects chunk=12
  • ...chunk text...
  • [S2] source=docs/projectA/README.md chunk=03

Then instruct the model: “Use only sources S1–S#; when you state a fact, cite like [S1]. If the sources do not contain the answer, say you can’t find it.” This single instruction dramatically improves trustworthiness. It also enables your checkpoint: answers grounded with citations.

In FastAPI, treat retrieval as a separate function/module so you can test it without the LLM. A common mistake is coupling retrieval and generation tightly; when outputs are wrong you can’t tell whether retrieval failed or generation drifted. Add a debug mode that returns retrieved chunks and scores alongside the answer (or logs them). That makes demos compelling: you can show the evidence.

Finally, handle the “no results” path. If similarity scores are low, return a safe response that asks a clarifying question or suggests which document set to add. Don’t pass empty context and hope the model guesses—this is where hallucinations are born.

Section 5.5: Safety and reliability guardrails for local deployments

Section 5.5: Safety and reliability guardrails for local deployments

Guardrails are what turn a clever demo into something you can responsibly share. Local deployments are not automatically safer; they simply change the risk profile. You still need to protect your service from abusive inputs, accidental data leakage, runaway latency, and outputs that violate your intended use (for example, generating harmful instructions or exposing sensitive document excerpts beyond what the user should see).

Start with input constraints. Validate message length, reject extremely long prompts, and limit attachments. In FastAPI, enforce max characters per message and max messages per request. Add a request size limit at the server/proxy layer if possible. This prevents memory spikes during embedding and context assembly.

Add timeouts and retries. Local inference can stall if the machine is under load. Use a per-request timeout for embedding and generation. Retries should be selective: retry transient failures (model server unavailable), but do not retry policy violations or invalid inputs. If you stream tokens, ensure you handle client disconnects cleanly so you don’t keep generating after the user has gone away.

Implement basic content policies as a layered approach:

  • Pre-check: block or warn on clearly disallowed requests (e.g., self-harm instructions, malware). Keep rules simple and transparent.
  • Generation constraints: system prompt that forbids unsafe content and requires citations for factual claims.
  • Post-check: scan the model output for disallowed patterns and redact if necessary.

Add rate limiting even locally, especially if you plan to expose the API on your network. Rate limiting protects your machine from accidental loops (e.g., a front-end bug spamming requests). A simple token bucket per IP or per API key is enough for a portfolio project.

Common mistakes: forgetting to cap retrieval context (leading to prompt overflow), not handling empty retrieval (leading to hallucinations), and logging raw documents or prompts in a way that leaks sensitive content. Your guardrails should include a policy for what gets logged and what must be redacted.

Section 5.6: Observability basics—logs, tracing, and prompt/version tracking

Section 5.6: Observability basics—logs, tracing, and prompt/version tracking

When your assistant answers incorrectly, you need to know why. Observability is the difference between guessing and debugging. For a local RAG assistant, focus on three things: structured logs, lightweight tracing, and version tracking for prompts and indices.

Structured logs should capture: request ID, user/session identifier (non-sensitive), model name, embedding model name, chunking parameters, retrieval top-k, similarity scores, and latency per stage (embed time, retrieval time, generation time). Avoid logging raw user content by default; instead log hashes or truncated previews. If you need full payloads for debugging, gate them behind a local-only debug flag and redact document text.

Tracing can be as simple as timing spans in code, but if you want a portfolio-level touch, integrate OpenTelemetry to create spans for: ingest, embed_query, vector_search, assemble_context, and generate. Even without a full tracing backend, exporting to console during development helps identify bottlenecks (often embeddings on CPU).

Prompt and version tracking is essential because small prompt changes can cause big behavior shifts. Store your system prompt and RAG prompt template in versioned files. Log the prompt version used per request. Do the same for your vector index: track the embedding model, chunk size/overlap, and the document corpus hash. When you rebuild the index, increment an index version and log it. This makes your assistant reproducible: you can rerun a demo later and explain differences when outputs change.

  • Practical checkpoint: run a grounded query (“What did I build in Project X?”), confirm the response includes citations like [S1], and verify logs show which chunks were used and how long each stage took.
  • Debugging workflow: if the answer is wrong, first inspect retrieval results; only after retrieval looks correct should you tune prompts or model parameters.

With observability in place, your assistant becomes an engineering artifact, not a black box. That is exactly the kind of maturity that signals readiness for AI-adjacent roles—especially in teams that care about reliability and auditability.

Chapter milestones
  • Ingest local docs and chunk text for retrieval
  • Implement embeddings and a simple vector store option
  • Compose prompts that cite retrieved context
  • Add guardrails: input constraints, content policies, and rate limiting
  • Checkpoint: assistant answers grounded questions with citations
Chapter quiz

1. Why does Chapter 5 add retrieval (RAG) to the assistant instead of relying on the model to "remember" information?

Show answer
Correct answer: To ground answers in deterministically retrieved local documents and reduce confident mistakes
The chapter emphasizes retrieving facts from local docs rather than asking the model to guess, making demos more reliable.

2. Which pipeline best matches the chapter’s RAG goal from documents to an answer?

Show answer
Correct answer: Ingest docs → chunk → embed → store vectors → retrieve top-k chunks → assemble context → prompt LLM with citations
Chapter 5 outlines a small, testable RAG pipeline that retrieves top-k chunks and prompts the model to answer with citations.

3. What is the primary purpose of requiring citations in the assistant’s responses?

Show answer
Correct answer: To show where facts came from by tying claims to retrieved context
Citations make answers persuasive and auditable by pointing to the retrieved source context.

4. Which set of guardrails best reflects the chapter’s recommended approach for a robust local assistant?

Show answer
Correct answer: Validate input, constrain retrieval size, enforce timeouts, implement retries, add safety filters, and rate limit
The chapter lists practical guardrails including input constraints, timeouts, retries, safety filters, and rate limiting.

5. What should the assistant do at the checkpoint when relevant context is missing from the retrieved documents?

Show answer
Correct answer: Fail safely rather than guessing, since answers must be grounded in available context
A key outcome is that the assistant stays grounded and fails safely when it cannot retrieve supporting context.

Chapter 6: Ship It—Testing, Evaluation, and Career Packaging

You can build a local LLM assistant that “works on your machine” in an afternoon. Shipping something that other people can run, trust, and evaluate is a different skill—and it’s exactly the skill hiring managers look for in career transitioners. This chapter turns your Ollama + FastAPI project into a polished artifact: tested, measured, documented, and easy to demo.

The goal is not enterprise perfection. The goal is professional reliability: your API should fail predictably, your quality should be measurable, latency should be explainable, and the repo should tell a clear story. You’ll implement smoke tests and contract tests, create a small evaluation set for quality and latency, and prepare a demo script and screenshots for your portfolio. You’ll also write a deployment/usage guide so reviewers can run it in minutes—then package the whole thing as a checkpoint: a published project that supports your career transition.

Throughout this chapter, lean into engineering judgment. For local inference, variability is normal: model updates change outputs, different machines have different performance, and prompt changes can shift behavior dramatically. Your job is to make those changes visible and manageable with tests, evals, profiling, and clear documentation.

Practice note for Create smoke tests and contract tests for the API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a small evaluation set and measure quality/latency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prepare a demo script and screenshots for your portfolio: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a deployment and usage guide for reviewers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: publish a polished project that supports your career transition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create smoke tests and contract tests for the API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a small evaluation set and measure quality/latency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prepare a demo script and screenshots for your portfolio: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a deployment and usage guide for reviewers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: publish a polished project that supports your career transition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Testing strategy (unit, integration, and golden prompts)

Testing an LLM-powered API is partly traditional software testing and partly “behavior checking.” Start with a simple pyramid: unit tests for pure functions, integration tests for your FastAPI app, and a small set of golden prompts that act like snapshot tests for your assistant’s behavior.

Unit tests should cover anything deterministic: input validation, request parsing, safety filter rules, document chunking, and prompt-template rendering. Common mistake: skipping unit tests because “the model is nondeterministic.” Your glue code is deterministic—test it. Example: a function that clamps max_tokens, rejects empty messages, or enforces a context window should be unit tested with edge cases.

Integration tests should verify your API contract. Use FastAPI’s TestClient (or httpx) to call endpoints the way a real client would. Create smoke tests that run fast and answer “is the service alive?”: health endpoint returns 200, chat endpoint returns JSON with required fields, and streaming endpoints yield multiple chunks. Add contract tests that lock down request/response schemas and error shapes. Common mistake: returning different error formats from different code paths; reviewers hate this because it makes clients brittle.

  • Smoke test: GET /health returns {"status":"ok"} and includes model name/version.
  • Contract test: invalid input returns 422 with a consistent error schema; timeout returns 504 with a stable message.
  • Streaming test: server-sent events or chunked responses contain a delta field and conclude with a final “done” marker.

Golden prompts are curated prompts with expected properties. Don’t assert exact wording; assert invariants (must cite retrieved documents, must refuse unsafe request, must answer in bullets). Store them in tests/golden.json with the prompt, retrieval context (if any), and expected checks. When you update prompts or swap Ollama models, run golden prompts to catch regressions early.

Practical outcome: you can show a reviewer a one-command test run (pytest) that proves the API is predictable and the assistant’s key behaviors remain intact across changes.

Section 6.2: Lightweight evals—rubrics, expected answers, regressions

Tests tell you the system runs; evals tell you the system is good. Keep evals lightweight: 20–50 examples that represent your target use case (career-transition assistant, doc-grounded Q&A, or whatever theme you chose). Your evaluation set should include easy wins, ambiguous questions, and “gotchas” that previously broke the system.

Build a small dataset in a simple format (evals/cases.jsonl): input messages, optional documents to retrieve from, and an expected answer sketch. An expected answer sketch is not a full script; it’s a checklist. For example: “mentions 3 steps, warns about secrets, references README section X.” This avoids the common mistake of brittle exact-match evaluation, especially with local LLM variability.

Add a rubric with 3–5 criteria you can score quickly:

  • Correctness: aligns with provided docs; no hallucinated commands.
  • Grounding: cites or clearly uses retrieved snippets when RAG is enabled.
  • Safety: refuses disallowed requests; avoids leaking secrets.
  • Usefulness: gives actionable steps, not generic advice.
  • Format: follows required style (bullets, short paragraphs, etc.).

Measure regressions by tracking scores over time. When you change the prompt template, add a safety filter, or switch Ollama models, rerun the eval set and record deltas. A practical workflow is to store a baseline report in evals/reports/ and compare new runs in CI. You’re not chasing perfect scores; you’re demonstrating that you can detect and explain changes.

Include latency in the eval output: time-to-first-token, tokens/sec, and total duration. Common mistake: only measuring total time. For chat UX, time-to-first-token matters more than total completion.

Practical outcome: you can show an “evaluation card” in your repo—what you measured, how you scored it, and what improved after iterations.

Section 6.3: Performance profiling and tuning (context, caching, batching)

Local inference performance is a product feature. Reviewers will try your demo and immediately feel whether it’s responsive. Profile first, then tune. Start by adding simple instrumentation around your FastAPI endpoints: log request size, retrieved chunk count, prompt token estimate, time-to-first-token, total duration, and error rates.

Context management is the fastest lever. Long prompts are expensive. Common mistake: blindly appending entire chat history and large retrieved passages. Instead, implement a policy: keep the last N turns, summarize older turns, and cap retrieved context by characters or estimated tokens. If you use RAG, retrieve fewer but higher-quality chunks (e.g., top 3) and include metadata (title/path) rather than extra text.

Caching improves repeat interactions. Cache at the right layer:

  • Cache document embeddings and chunk indexes on disk so restarts are fast.
  • Cache retrieval results for identical queries (short TTL) to speed demos.
  • Cache prompt templates rendering (cheap but reduces noise).

Be careful caching model outputs unless you’re explicit; it can hide regressions. A safe compromise is caching only for demo endpoints or adding a cache=false query parameter during evaluation runs.

Batching is useful if you evaluate many prompts. Instead of firing 50 separate requests, write an eval runner that submits requests sequentially with controlled concurrency (e.g., 2–4 workers) to avoid thrashing the CPU/GPU. Common mistake: maxing concurrency on a laptop and then concluding the model is “slow.” You saturated resources.

Tuning judgment: prefer small, explainable optimizations over mysterious tweaks. Your portfolio story should say, “I reduced median time-to-first-token from 1.8s to 0.7s by trimming context and limiting retrieval to 3 chunks,” not “I changed random settings until it felt better.”

Practical outcome: your repo includes a performance note with baseline metrics and specific changes tied to measurable improvements.

Section 6.4: Security basics—CORS, auth options, and secrets hygiene

Even a local-first assistant deserves security basics because reviewers will judge your professional instincts. Start with the basics: CORS, authentication options, and secrets hygiene. The goal is to show you know what matters and can implement sensible defaults.

CORS: if you have a frontend (even a simple HTML page), lock down allowed origins. Common mistake: allow_origins=["*"] forever. For local dev, allow http://localhost:5173 (or your port). In production-like demos, use an environment variable ALLOWED_ORIGINS and parse a comma-separated list.

Auth options: you don’t need a full OAuth setup for a portfolio project, but you should offer a minimal option. A practical pattern is an API key header (e.g., X-API-Key) validated by middleware. Document how to disable it for local demo (AUTH_MODE=none) and how to enable it (AUTH_MODE=api_key). Common mistake: shipping “security theater” (keys hardcoded in code) instead of real configuration.

Secrets hygiene: never commit keys, tokens, or private documents. Use .env.example and document required variables. Ensure your Docker image does not bake secrets at build time; pass them at runtime. If your RAG uses local files, add data/ to .gitignore and provide a small public sample dataset for reviewers.

Also include simple guardrails you built earlier—timeouts, retries, and input validation—and explain how they prevent abuse (e.g., huge payloads or prompt injection attempts in documents). Practical outcome: a reviewer sees responsible defaults and a clear security section in your guide.

Section 6.5: Portfolio packaging—README structure and project storytelling

A good portfolio project is a product: it explains itself, runs quickly, and shows evidence. Your README is the interface. Aim for “reviewer success in 10 minutes.” Include a demo script and screenshots so someone can evaluate without deep setup.

Use a predictable README structure:

  • What it does: one paragraph and a screenshot/GIF of the chat UI or curl output.
  • Architecture: a diagram or bullet list: FastAPI → Ollama → (optional) RAG index.
  • Quickstart: exact commands for Ollama model pull, environment variables, and docker compose up or uvicorn.
  • API: example requests for chat + streaming; document error formats.
  • Testing: how to run smoke/contract tests and golden prompts.
  • Evaluation: your small eval set, rubric, and latest report.
  • Deployment & usage guide: ports, volumes, model selection, and troubleshooting.
  • Limitations: honesty builds trust (hardware dependency, model variance, context limits).

Project storytelling matters for career transitions. Tie your engineering choices to user outcomes: “Local inference avoids API costs and supports offline use,” “Streaming improves perceived latency,” “RAG grounds answers in documents.” Common mistake: listing tools without rationale. Hiring managers want to see decision-making.

Prepare a demo script with 3–5 scenarios: one normal question, one doc-grounded question, one refusal/safety example, and one performance example (“watch time-to-first-token”). Capture screenshots of each, and store them in docs/images. Practical outcome: your portfolio can be evaluated asynchronously and still lands your message.

Section 6.6: Interview readiness—talking points and system design walkthrough

Your final checkpoint is not just publishing code; it’s being able to explain it under interview pressure. Practice a short system design walkthrough that starts from requirements and ends at trade-offs. Keep it crisp: problem, architecture, key endpoints, testing/evals, performance, and security.

Prepare talking points aligned to the course outcomes:

  • Why local LLM inference? Offline, cost control, privacy; trade-off is hardware variability and ops responsibility.
  • Model selection in Ollama: how you chose a model (quality vs speed), how you pinned versions, and how you manage prompt templates.
  • Streaming tokens: why it improves UX, how you implemented it, and how you test it.
  • Guardrails: input validation, timeouts, retries, safety filters; what failure looks like and how clients handle it.
  • RAG grounding: chunking strategy, retrieval limits, and how you prevent prompt injection via documents.

Walk through a request end-to-end: client sends chat message → FastAPI validates → optional retrieval fetches top chunks → prompt is assembled with system + user + context → Ollama generates tokens → streaming response → logging captures metrics. Then discuss where you would scale: move to a queue, add rate limiting, add observability, or separate retrieval into its own service.

Common mistake: sounding like you “followed a tutorial.” Replace that with evidence: show test outputs, eval reports, and performance numbers. When asked “what would you improve next?” reference your limitations section and propose the next iteration (better eval coverage, more robust auth, improved retrieval).

Practical outcome: you can present your project as a small but complete system—engineered, measured, and documented—ready to support your career transition into AI.

Chapter milestones
  • Create smoke tests and contract tests for the API
  • Build a small evaluation set and measure quality/latency
  • Prepare a demo script and screenshots for your portfolio
  • Write a deployment and usage guide for reviewers
  • Checkpoint: publish a polished project that supports your career transition
Chapter quiz

1. What is the main shift in focus Chapter 6 emphasizes compared to building a local LLM assistant that only works on your machine?

Show answer
Correct answer: Turning the project into a polished artifact others can run, trust, and evaluate
The chapter focuses on professional reliability—tested, measured, documented, and easy to demo—so others can run and assess it.

2. Why does Chapter 6 include both smoke tests and contract tests for the API?

Show answer
Correct answer: To confirm the API runs at all and that its behavior matches an expected interface/behavior contract
Smoke tests check basic functionality, while contract tests verify predictable API behavior and interface expectations.

3. What is the purpose of building a small evaluation set in this chapter?

Show answer
Correct answer: To measure quality and latency so performance is visible and explainable
The eval set provides a measurable way to track output quality and latency rather than guessing.

4. How does the chapter suggest you handle natural variability in local inference (model updates, machine differences, prompt changes)?

Show answer
Correct answer: Make changes visible and manageable using tests, evals, profiling, and clear documentation
Variability is expected; the professional approach is to surface and manage it with measurement and documentation.

5. Which set of deliverables best matches the “career packaging” goal of the chapter?

Show answer
Correct answer: A demo script and screenshots, plus a deployment/usage guide so reviewers can run it in minutes
The chapter emphasizes making the project easy to demo and easy for reviewers to run quickly through clear portfolio materials and guides.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.