Career Transitions Into AI — Beginner
Go from AI-curious to shipping a local LLM assistant you can demo.
This course is a short, book-style lab for career switchers who want a practical, demonstrable AI project: a local LLM assistant you can run on your own machine. Instead of relying on paid hosted APIs, you’ll use Ollama for local model serving, Docker for repeatable environments, and FastAPI to expose a clean backend API with streaming chat responses. By the end, you’ll have a project that looks and feels like real work: clear API contracts, sensible guardrails, reproducible setup, and an evaluation approach you can explain in interviews.
The course is designed for beginners who can write basic Python and use the terminal, but who may be new to LLM app engineering. Each chapter builds on the previous one: you’ll start with a career-focused blueprint and a working baseline, then add model discipline, a proper API layer, containerization, retrieval grounding, and finally testing/evaluation plus portfolio packaging.
You’ll implement a local assistant service with:
Local inference forces you to think like an engineer: model selection, latency constraints, memory limits, and reproducible deployment. These are highly transferable skills for AI-adjacent roles such as AI engineer, ML platform associate, backend engineer for LLM features, or product-minded prototyper. You’ll also learn how to communicate tradeoffs (privacy vs performance, cost vs quality, RAG vs fine-tuning), which is exactly what interviewers probe for.
Each chapter ends in a tangible checkpoint. You’ll gradually evolve a basic local chat into a containerized assistant API with retrieval grounding and a portfolio-ready delivery. The goal is not just “it runs on my laptop,” but “a reviewer can reproduce it, understand it, and see engineering judgment.”
If you want a structured path from setup to a polished demo, start here and ship chapter by chapter. Register free to track your progress, or browse all courses to compare learning paths.
Senior Machine Learning Engineer, LLM Applications
Sofia Chen builds production LLM features for developer tools, focusing on reliable inference, evaluation, and API design. She has coached career switchers into AI roles by helping them ship small, defensible portfolio projects with strong engineering fundamentals.
This course is built around a single career lever: you will ship a local LLM assistant end-to-end (model runtime, API service, packaging, and guardrails) and present it as a portfolio narrative that makes sense to hiring managers. “Local” is not a gimmick—running inference on your own machine changes the engineering constraints, the risk profile, and the kinds of problems you can solve for employers (privacy, cost control, offline workflows, and predictable deployments).
In this chapter you will define your target role and a story for your portfolio, set up a development environment (Python, Git, Docker, and a clean project structure), and learn the fundamentals and tradeoffs of local inference. You’ll also create a baseline CLI chat to validate end-to-end inference before you touch FastAPI, streaming, or retrieval. That simple chat demo becomes your first publishable checkpoint: a clean repository that any reviewer can clone and run.
Throughout the course, keep a practical mindset: you are not “learning LLMs,” you are learning how to build reliable LLM-powered software. Your strongest signal will be judgement—choosing local vs hosted, selecting an appropriate model, building with repeatability, and documenting your decisions.
Practice note for Define your target role and portfolio narrative for a local LLM assistant: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up dev environment: Python, Git, Docker, and project structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Understand local LLM constraints: latency, memory, privacy, and tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a baseline CLI chat to validate end-to-end inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: publish a clean repo with a working local chat demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define your target role and portfolio narrative for a local LLM assistant: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up dev environment: Python, Git, Docker, and project structure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Understand local LLM constraints: latency, memory, privacy, and tradeoffs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a baseline CLI chat to validate end-to-end inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Hiring managers rarely hire for “LLM enthusiasm.” They hire for outcomes: a working system, clear problem framing, and evidence you can ship safely. For a career transition, your best strategy is to target a role family (e.g., backend engineer with AI, AI product engineer, ML/LLM engineer, platform engineer) and craft a portfolio narrative that matches the responsibilities of that role.
In this course, the portfolio artifact is a local LLM assistant with an API, containerization, and basic guardrails. That maps cleanly to real job requirements: integrating model inference into services, handling latency and failures, designing APIs, and documenting deployment. When a reviewer opens your repository, they should quickly understand: (1) what the assistant does, (2) how to run it locally, (3) why you chose local inference, and (4) what tradeoffs you accepted.
Common mistake: building an impressive demo that can’t be run without hours of troubleshooting. Another mistake: burying your intent. State your target role in the README (“Built as a local-first assistant to demonstrate inference + API + packaging”) and add a short “Design decisions” section explaining your local vs hosted choice, model selection, and constraints.
Choosing local inference versus a hosted API is not ideological; it’s a product and operations decision. Hosted LLM APIs (OpenAI, Anthropic, etc.) typically offer best-in-class quality, managed scaling, and fast iteration. Local inference (via Ollama in this course) offers privacy control, offline capability, predictable cost, and tighter control over deployment environments.
Use a simple decision framework based on five axes:
Practical workflow: start with local for development if your goal is learning and reproducibility, then keep an escape hatch to hosted for quality benchmarks. A good portfolio repository documents this explicitly (“Local-first. Optional hosted adapter later.”). Common mistake: picking a model that barely fits your machine and blaming “LLMs are slow.” Instead, treat the runtime constraint as part of the design: model size, context length, and concurrency should match your environment.
Local LLM constraints define your engineering reality: memory is the gate, latency is the tax, and privacy is the reward. You need basic sizing intuition to choose models responsibly in Ollama. The two most common limiting factors are RAM/VRAM capacity and sustained token generation speed (tokens/second).
As a rule of thumb, larger parameter models need more memory, but quantization reduces memory usage at some cost to quality. A smaller, well-chosen model running reliably is better than a large model that swaps memory or crashes. Expect these practical tradeoffs:
Practical expectation setting for your demo: optimize for “works every time” rather than “largest model available.” Your baseline CLI chat (coming later in this chapter) is your measurement tool. Record simple metrics in your README: model name, quantization, approximate tokens/sec on your machine, and peak memory usage if you can. Common mistake: ignoring thermal throttling on laptops and then being surprised by inconsistent performance. When testing, keep conditions consistent (plugged in, similar background load).
Your goal is a repository that runs the same way for you, a reviewer, and future-you. That means a predictable project layout, pinned dependencies, and a development environment that anticipates the next chapters (FastAPI, streaming, Docker, and lightweight RAG). Start with Python, Git, and Docker installed, then scaffold a small but disciplined structure.
A practical starter layout looks like this:
Dependency management is where many transitioners stumble. Pick one method and stick to it (requirements.txt with pip-tools, Poetry, or uv). Pin versions for anything runtime-critical. For this chapter’s CLI chat, keep dependencies minimal: an HTTP client (or the official Ollama client if you use it), and a small CLI loop. Save the heavier web dependencies for later chapters so failures are easier to isolate.
Common mistake: mixing global Python packages with project packages and losing track of what’s installed. Use a virtual environment, commit a lockfile if your tool supports it, and document one canonical setup path in the README. Your future Docker container will thank you.
Before you build an API, validate that you can reliably “steer” the model. Assistant prompting is easiest to reason about when you separate messages by role: system (rules and identity), user (the request), and assistant (the model’s responses). This structure becomes the foundation for chat endpoints and later for RAG grounding.
In Ollama, you will choose a model and define a baseline prompt template. Your baseline CLI chat should do three things: (1) send a system message that defines the assistant behavior, (2) keep a short conversation history, and (3) print streamed tokens so you can observe latency and truncation. Keep the system message short and testable, for example: “You are a local-first coding assistant. Ask clarifying questions if requirements are missing. Do not invent files or commands that were not discussed.”
When testing prompts, change one variable at a time: model choice, temperature (if exposed), context length, or the system message. Record what changed and why. This is portfolio-grade practice because it mirrors real production debugging.
Your checkpoint for this chapter is to publish a clean repository with a working local chat demo. “Clean” is not aesthetic; it’s operational. A reviewer should be able to clone, run one or two commands, and see the assistant respond via Ollama locally. This is where README quality becomes a career skill.
Your README should include: a one-paragraph project goal, prerequisites (Ollama, Python, Docker if used), setup steps, a “Run the CLI chat” section, and a small troubleshooting section (model not found, port conflicts, slow performance). Add a short “Design decisions” section that explains why you chose local inference and what limitations exist on typical hardware.
Use .env patterns correctly: never commit secrets. Instead, commit .env.example with safe defaults (e.g., OLLAMA_HOST, model name, request timeout). Add .gitignore entries for .env, __pycache__, virtualenv folders, and local data directories. Keep configuration values centralized so later you can reuse them in FastAPI and Docker without duplication.
Common mistake: shipping a demo that depends on your personal machine state (downloaded models, untracked files, or undocumented environment variables). Treat your repository as a product: if it cannot be reproduced, it does not count. This mindset is the backbone for the next chapters, where you will containerize the service and expose a streaming FastAPI backend.
1. Why does running an LLM locally meaningfully change the "engineering constraints" compared to using a hosted model?
2. What is the main portfolio signal the course aims to produce for hiring managers?
3. Why build a baseline CLI chat before adding FastAPI, streaming, or retrieval?
4. Which checkpoint best reflects the chapter’s definition of a "publishable" first milestone?
5. According to the chapter, what mindset and skill will be the strongest signal of competence throughout the course?
This chapter turns “it runs on my machine” into “it runs reliably, repeatably, and explainably.” You’ll go beyond pulling a model and asking questions: you’ll learn how Ollama manages models locally, how to choose between 2–3 candidates, how to build prompts that stay stable across turns, and how to tune sampling for speed without chaotic outputs. You’ll also write a tiny Python wrapper so your FastAPI backend can call Ollama consistently, and you’ll finish with a checkpoint: a short, reproducible runbook that documents your model choice and the exact commands you used.
Local inference is a product decision as much as a technical one. You trade cloud convenience for control: predictable costs, offline operation, and tighter data boundaries. But you also inherit responsibility for model selection, resource constraints, and operational guardrails (timeouts, retries, and safety filters). The goal here is to develop engineering judgment: know what to measure, what to standardize, and what failure modes to expect before you build an assistant people depend on.
The workflow in this chapter is deliberately iterative. First, install and run Ollama. Next, pull and compare a few models. Then, design a prompt template and test it with a mini test set (a handful of representative prompts and expected traits). After that, tune generation settings to stabilize behavior. Finally, wrap inference calls in a small client so your API layer can stream or respond in a predictable, debuggable way.
Practice note for Install and run Ollama; pull and compare 2-3 models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design a prompt template and test behaviors with a mini test set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune generation settings for stability and speed: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a small Python wrapper client for Ollama requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: document model choice and reproducible commands: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Install and run Ollama; pull and compare 2-3 models: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design a prompt template and test behaviors with a mini test set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune generation settings for stability and speed: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a small Python wrapper client for Ollama requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Ollama is a local model runner that exposes a simple interface for downloading, storing, and serving LLMs on your machine. Conceptually, it sits between your application (CLI, Python, FastAPI) and a model runtime optimized for local inference. The key idea: you don’t “install” a model like a Python package; you pull a model artifact (weights + metadata), keep it in a local store, and then run inference requests against it. This lifecycle matters because repeatability depends on pinning model versions and controlling how they’re invoked.
Start by installing Ollama for your OS, then verify the daemon is running. Pull a model with a command like ollama pull llama3.1 (names vary by registry and version). Run it interactively with ollama run llama3.1 to validate that your GPU/CPU path is working and that token generation is fast enough for your use case. In practice, the “first token latency” is a big usability factor: if it’s too slow, you’ll need a smaller model, more quantization, or a different runtime setup.
Common mistake: treating model upgrades as “transparent.” Even minor model revisions can change formatting and compliance. For a career-transition portfolio project, document your lifecycle clearly: which model you chose, why, and the exact commands required to reproduce your environment.
Model choice is where many local projects succeed or fail. Bigger isn’t automatically better: a 70B model might be “smarter,” but if your machine can’t serve it with acceptable latency, your assistant becomes unusable. Start by comparing 2–3 models that fit your hardware constraints—often a small (e.g., ~7–8B), a medium (e.g., ~13–14B), and an alternative architecture or tuning. Pull each candidate and run the same mini test set to compare output quality and stability.
Evaluate three practical dimensions. First is size: larger models tend to follow instructions more reliably and hallucinate less on complex tasks, but they cost more memory and time. Second is context length: if you plan to do lightweight RAG later (grounding answers in local documents), context window becomes important because you’ll be injecting excerpts. Third is instruction tuning: choose instruction-tuned or chat-tuned variants for assistant behavior; base models may require more careful prompting and can drift into unhelpful completions.
Engineering judgment: choose the smallest model that reliably passes your mini test set. This keeps costs (compute, battery, thermal throttling) manageable and makes Dockerized local deployment more predictable later.
A prompt template is your assistant’s “contract.” It defines role, scope, tone, and constraints so the model behaves consistently across requests. Without a template, you’ll see drift: verbose answers when you want concise ones, refusal to follow formatting, or context confusion after a few turns. A solid template typically includes: system instructions (non-negotiable rules), developer instructions (task framing), user message, and optional context blocks (retrieved text, policies, or examples).
Conversation state is equally important. Chat-style inference works by providing prior messages back to the model each turn. Locally, you control exactly what gets sent—this is a feature and a risk. If you naively append everything forever, you’ll hit context limits and slow inference. If you trim too aggressively, the assistant forgets key constraints. A practical approach is to keep a short, structured memory: recent turns plus a running summary of important facts.
Build a mini test set (5–10 prompts) that probes your template’s behavior. Include at least: a request for structured formatting, a refusal-sensitive request (to see if it handles safety constraints), a long-input request (to test truncation), and a “follow-up question” that depends on prior context. Run this test set against each model and template revision. Treat prompt design like code: version it, document changes, and avoid ad-hoc edits that aren’t tested.
Sampling parameters are your main levers for stability, creativity, and speed. Many “model quality” complaints are actually parameter misconfiguration. If your assistant must be reliable (career guidance, planning, summarizing), you generally want lower randomness. If you’re brainstorming, you can increase randomness—but do it deliberately and test the results.
Temperature controls how deterministic the model is. Lower values (e.g., 0.1–0.3) reduce variance and make outputs more repeatable; higher values (e.g., 0.7–1.0) increase creativity but also increase the chance of format breaks and hallucinations. top_p (nucleus sampling) restricts choices to a probability mass; many teams keep temperature moderate and use top_p (e.g., 0.9) to avoid extreme tokens. max tokens (or max output) caps response length—critical for latency and for preventing the model from rambling when it’s unsure.
Common mistake: increasing temperature to “fix” bland responses when the real issue is prompt clarity. First tighten the instructions and examples; then adjust sampling. Record your final settings in your checkpoint runbook so teammates (or future you) can reproduce behavior exactly.
Once you connect a model to FastAPI, you’ll quickly want structured outputs: JSON objects you can validate, store, and render. The challenge is that LLMs are probabilistic text generators; they can produce trailing commentary, invalid quotes, or partial JSON when token limits hit. Your job is to design for “JSON-first” behavior and enforce it with validation and retries.
Start by specifying the schema in your prompt template. Be explicit: required keys, allowed values, and constraints (string length, enum choices). Ask for only JSON, no prose. Then implement a post-processing step that parses JSON and validates it (Pydantic is a good fit in Python). If parsing fails, retry once with a corrective message that includes the validation error and repeats the schema. Keep retries limited to avoid infinite loops and runaway latency.
Practical outcome: you can now build endpoints like /chat (free-form) and /extract (strict JSON) with different templates and sampling presets. This separation keeps your system predictable and easier to debug.
When local inference fails, symptoms can look similar—timeouts, garbled outputs, or sudden slowness—but root causes differ. Build a debugging checklist and apply it systematically. First, confirm whether the issue is resource-related (RAM/VRAM exhaustion, CPU throttling), prompt-related (too long, conflicting instructions), or transport-related (client timeouts, streaming handling). Logging is essential: record model name/version, prompt length (tokens or characters), sampling settings, and latency for first token and full completion.
Write a small Python wrapper client for Ollama requests to standardize calls. This wrapper should: set timeouts, support retries with backoff for transient failures, and optionally stream tokens for responsive UIs. Even if your first FastAPI version is minimal, your wrapper becomes the single place to handle request formatting, error mapping, and observability fields (request id, duration, token counts if available).
Checkpoint: document your final model choice and the reproducible commands: installation notes, ollama pull lines, the exact model tag, your prompt template file path, and your generation settings. This “paper trail” is what turns an experiment into an engineering artifact you can confidently containerize and expose through a streaming FastAPI service in the next stage.
1. What is the main shift in mindset Chapter 2 aims to achieve when moving from basic local inference to a dependable assistant?
2. Why does the chapter have you pull and compare 2–3 models instead of committing to the first model that works?
3. What is the purpose of creating a prompt template and validating it with a mini test set?
4. According to the chapter, why tune generation settings during local inference development?
5. What is the practical reason for writing a small Python wrapper client for Ollama requests in this chapter?
In Chapter 2 you proved you can run a model locally with Ollama. In this chapter, you turn that capability into a small, reliable service that other tools (a CLI, a web UI, a teammate’s script) can call. The goal is not “a demo endpoint that works once,” but an API that behaves predictably: it validates inputs, documents its contract, streams tokens for a good user experience, and fails gracefully when the model is slow or unavailable.
Think like an engineer building a product surface area. A local LLM is a dependency that can be heavy, variable in latency, and occasionally error-prone. FastAPI gives you a clean way to wrap that dependency behind a consistent interface. The key is to decide what your API promises: what a request looks like, what responses look like (including errors), how streaming is delivered, and how conversation memory is stored and scoped. Those promises are your “contracts,” and they matter even when you are the only user—because future-you will integrate this service into other projects.
We’ll build up from a health-checked FastAPI service with configuration, then add a /chat endpoint that supports conversation context, then add streaming so the UI can render partial output immediately. Along the way you’ll define Pydantic models that appear automatically in OpenAPI docs, and you’ll learn where common mistakes happen: blocking calls in async routes, over-trusting user input, letting requests run forever, and returning inconsistent error shapes.
Practice note for Create FastAPI service with health checks and configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement /chat endpoint with conversation memory: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add streaming responses for better UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define request/response models with Pydantic and OpenAPI docs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: a documented API that others can call locally: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create FastAPI service with health checks and configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement /chat endpoint with conversation memory: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add streaming responses for better UX: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define request/response models with Pydantic and OpenAPI docs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
A clean project layout pays off quickly because an LLM assistant API mixes concerns: HTTP routing, configuration, model calling, memory, and safety checks. A practical layout for this chapter looks like: app/main.py (FastAPI app), app/api/chat.py (routes), app/core/config.py (settings), app/services/ollama_client.py (model calls), and app/services/memory.py (conversation storage). Keeping these separated lets you test and replace parts without rewriting your endpoints.
Use dependency injection (DI) to pass shared resources into routes. In FastAPI, DI is usually done with Depends(). For example, you can inject a Settings object and an OllamaClient into your /chat handler. DI prevents “global variable soup” and makes it obvious what an endpoint needs to run. It also makes it easier to swap implementations later (for example, replacing in-memory conversation storage with Redis).
Start with operational basics: a /healthz endpoint that returns a simple JSON payload, plus a /readyz endpoint that can optionally check whether Ollama is reachable. Health checks are not just for Kubernetes; they help you debug quickly when something fails. A common mistake is to do expensive checks on every health request. Instead, keep /healthz cheap, and reserve deeper checks for /readyz or an admin-only path.
FastAPI(title=..., version=...) and include routers under a prefix like /v1.The practical outcome of this section is a service skeleton you can run locally with uvicorn app.main:app --reload and confidently tell whether it’s up, whether it’s ready, and which pieces are responsible for what.
Your assistant is only as stable as its inputs. Without validation, clients can send malformed JSON, absurdly long prompts, or unexpected roles that break your logic. Pydantic models are your first guardrail: they define the request/response contract and generate OpenAPI docs automatically. For a chat-style endpoint, a typical request model includes a messages list, each with role and content, plus optional fields like conversation_id, temperature, and max_tokens.
Validation is not only about types; it’s about constraints that reflect engineering judgment. For example, enforce min_length for message content (avoid empty messages), cap max_length (protect memory and latency), and restrict roles to a known set (system, user, assistant). Consider adding a rule that only the first message may be system, or that a request must include at least one user message.
It’s also the right place to implement lightweight “prompt safety” rules. You are not building a full policy engine, but you can block obviously dangerous payloads (e.g., attempts to exfiltrate secrets from server prompts) or strip control characters. Another common mistake is to silently modify user input in ways that surprise clients. Prefer explicit validation errors (HTTP 422) with clear messages so callers can fix their requests.
ChatRequest, ChatMessage.ChatResponse with assistant_message, conversation_id, and basic usage/latency metrics.constr, conint) and custom validators for cross-field rules.The practical outcome is an API that “fails fast” with readable errors, protects your service from runaway payloads, and makes your assistant’s behavior easier to reason about—especially when you later add retrieval-augmented generation (RAG) or tool calling.
Ollama exposes a local HTTP API, so your FastAPI service is essentially an HTTP-to-HTTP adapter with added contracts, memory, and safety. The main decision is whether your route handlers will be synchronous or asynchronous—and whether the underlying HTTP client is blocking. If you write async def routes but call Ollama using a blocking library, you may stall the event loop and degrade concurrency. This is a classic mistake when adding streaming later.
A practical approach is to implement an OllamaClient service with both sync and async methods. For sync, requests is straightforward and fine for low concurrency. For async, use httpx.AsyncClient so multiple requests can overlap while waiting on tokens. Keep the Ollama URL, model name, and timeout in configuration so you can swap models (e.g., llama3 vs. mistral) without code changes.
Conversation memory usually means “send prior messages back to the model.” You can store a short history per conversation_id in-memory for this chapter (a dictionary plus timestamps). Engineering judgment matters here: cap how many turns you keep (e.g., last 10 messages) and cap total characters. Otherwise, long chats will become slower and more expensive. Also decide whether the server generates conversation_id when missing, and whether clients can reset memory.
The practical outcome is a working /chat endpoint that produces responses deterministically from a known request shape, while giving you enough flexibility to tune performance and model choice as you iterate.
Streaming is a user-experience upgrade that also changes how you think about HTTP responses. Instead of waiting 5–30 seconds for a full completion, clients can render tokens as they arrive. This is especially important for local inference, where latency can vary based on CPU/GPU load. In FastAPI, the simplest approach is server-sent events (SSE) using text/event-stream with a StreamingResponse.
There are two common token strategies. First, “true token streaming,” where you pass stream=true to Ollama and forward each chunk as it arrives. Second, “simulated streaming,” where you request the full response and then yield it in small pieces. True streaming is preferred: it reduces time-to-first-token and supports cancellation. Simulated streaming can be acceptable for prototypes but wastes latency and memory.
Design your stream protocol deliberately. SSE typically emits events like event: token with JSON payloads containing delta text, and a final event: done containing the full assembled message and metadata. A common mistake is to stream plain text without framing; clients then struggle to detect completion or parse errors. Another mistake is to forget to flush or to buffer too much, which makes “streaming” feel like a delayed batch response.
The practical outcome is a streaming /chat/stream (or a query flag like ?stream=true) that feels responsive, is easy to consume from a UI, and is robust under variable local inference speed.
Local LLM services fail in predictable ways: Ollama might not be running, the model might not be pulled, generation can be slow, and the machine can run out of memory. Your API should treat these as normal conditions, not surprises. Start with explicit timeouts on outbound Ollama calls. Without them, a single request can hang until the client gives up, tying up server resources. Pair timeouts with clear HTTP status codes and consistent error shapes.
Define a small error schema, for example: {"error": {"code": "UPSTREAM_TIMEOUT", "message": "...", "retryable": true}}. Use it everywhere: validation errors, upstream failures, and internal exceptions. When you stream, errors need special care: you may already have sent partial data. In SSE, you can emit an event: error before closing the stream, so clients can show a useful message instead of a silent failure.
Retries are a judgment call. Retrying a long generation often makes things worse, but retrying a transient network error to the local Ollama process can help. If you implement retries, keep them conservative (e.g., 1 retry, short backoff) and never retry non-idempotent operations without thinking. Also implement graceful degradation paths: if streaming fails mid-way, you might fall back to returning the partial text you already received with a partial=true flag, rather than discarding everything.
The practical outcome is an assistant API that feels dependable. Even when something goes wrong, callers get predictable signals about what happened and what to do next.
An internal API is only useful if other people (or other programs) can call it without reading your source code. FastAPI’s automatic OpenAPI docs are a major advantage—if you feed them good models and examples. Add example payloads directly in your Pydantic schemas and route decorators so the interactive docs show realistic chat requests and streaming usage. Document what conversation_id means, whether memory persists across restarts, and what limits you enforce (message length, number of turns, timeout).
Provide two primary endpoints: a non-streaming POST /v1/chat for simple clients, and a streaming variant (either POST /v1/chat/stream or POST /v1/chat?stream=true). Include /healthz and /readyz so local callers can programmatically detect availability. This checkpoint matters for career transition projects: you are demonstrating that you can build “callable” services with contracts, not just notebooks.
Include copy-paste recipes. A minimal curl example for non-streaming might post JSON with messages, while streaming might use -N to disable buffering and show events in real time. Also be explicit about content types: application/json for regular requests and text/event-stream for SSE responses. A common mistake is to ship an API that works in Swagger UI but lacks real-world examples for terminals, Python scripts, or front-end fetch calls.
/v1 to keep freedom to change later.The practical outcome is your chapter checkpoint: a documented local assistant API that others can call, confidently integrate, and debug—with clear contracts for chat, streaming, and operational health.
1. Why does Chapter 3 emphasize building an API that "behaves predictably" rather than a one-off demo endpoint?
2. In the chapter’s framing, what are the API’s "contracts"?
3. What is the primary user-experience reason for adding streaming responses to the /chat endpoint?
4. How do Pydantic models help achieve the chapter’s goal of a reliable assistant API?
5. Which situation best matches the chapter’s warning about common mistakes in FastAPI LLM services?
In Chapter 3 you proved your assistant works on your machine. In this chapter you make it work on any machine—reliably, repeatedly, and with one command. Docker is the lever that turns “it runs on my laptop” into a portable, shareable, team-friendly system. For career transitions, this is not a nice-to-have: employers want reproducible environments, clear configuration boundaries, and deployable artifacts.
The stack you’re containerizing has two moving parts: a FastAPI service that exposes your chat endpoints and streams tokens, and an Ollama runtime that hosts the model. Docker lets each component have its own filesystem, dependencies, and lifecycle, while docker compose ties them together with a shared network and persistent volumes.
As you build this, keep an engineering mindset: choose defaults that minimize surprises (pinned versions, explicit ports, explicit volumes), avoid storing secrets in images, and structure compose files so development is fast while production is predictable. By the end, you’ll run docker compose up and get a working assistant stack—API, model runtime, and persistence—without manual setup steps.
Practice note for Write a Dockerfile for the FastAPI service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create docker-compose for FastAPI + Ollama with volumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add environment configuration and secrets handling patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize container startup and local dev workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: run the full assistant stack with docker compose up: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a Dockerfile for the FastAPI service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create docker-compose for FastAPI + Ollama with volumes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add environment configuration and secrets handling patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize container startup and local dev workflows: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: run the full assistant stack with docker compose up: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
AI applications amplify normal Docker concerns because models are large, startup can be slow, and state matters. Three Docker primitives drive most of your design: images, volumes, and networks.
Images are immutable templates. Your FastAPI image should contain only your app code and Python dependencies. Avoid baking model files into the image; it makes builds huge and rebuilds painful. Treat the image as “the code + runtime,” not “the data.” Pin base images (for example, python:3.11-slim) so rebuilds don’t drift.
Volumes are essential for local LLM work. Ollama stores downloaded models and runtime state under its data directory (commonly /root/.ollama inside the container). Without a volume, every container recreation forces a model re-download, which is slow and often mistaken for “the stack is broken.” In compose, bind a named volume to that directory so models persist across restarts. Similarly, if you add lightweight RAG later, mount your local documents directory read-only into the API container (or a separate indexer container) instead of copying documents into the image.
Networks are how containers talk. Compose automatically creates a private network and registers services by name, so your FastAPI container can reach Ollama at http://ollama:11434. The common mistake is pointing your app to localhost. Inside a container, localhost refers to that container, not your host and not other services. Use the compose service name as the hostname.
With these basics, you’re ready to containerize the API cleanly and keep the model runtime stateful without turning your application image into a giant artifact.
Your FastAPI container should build fast, start fast, and expose a predictable port. A straightforward Dockerfile is usually best for this course: install dependencies, copy code, and run Uvicorn. Later, you can choose Gunicorn for multi-worker production deployments, but be careful with streaming token endpoints.
A practical Dockerfile pattern looks like this:
python:3.11-slim (or similar) as the base.WORKDIR (for example, /app).pyproject.toml/poetry.lock or requirements.txt) to maximize build caching.uvicorn app.main:app --host 0.0.0.0 --port 8000.For token streaming, Uvicorn is often the simplest and most predictable in containers. If you adopt Gunicorn, use the Uvicorn worker class (for example, -k uvicorn.workers.UvicornWorker) and test streaming carefully. Multi-worker setups can introduce subtle issues: streaming responses may behave differently across workers, and you may need sticky sessions if you add websockets or stateful connections.
Two common mistakes: (1) binding to 127.0.0.1 inside the container, which makes the service unreachable from outside; always bind to 0.0.0.0. (2) forgetting to expose or map the port in compose, then assuming the server didn’t start.
Configuration belongs in environment variables, not hard-coded constants. In this chapter you’ll pass OLLAMA_BASE_URL, request timeouts, and safety/validation settings via compose. That keeps the image reusable in dev, testing, and production-like runs without rebuilding.
Practical outcome: you can build your FastAPI image once and run it anywhere, and you’ve made an intentional server choice (Uvicorn now; Gunicorn later when you need more concurrency).
Ollama is the model runtime in your stack. When running it in a container, you care about three things: how the API is reached, where models are stored, and how to manage model downloads without slowing down your workflow.
By default, Ollama listens on port 11434. In compose, you’ll typically publish it to the host (for example, 11434:11434) so you can test it directly with curl from your machine. Even if you don’t publish it, your FastAPI service can still reach it internally using the compose network. Publishing is mainly for developer convenience.
Persistence is non-negotiable: mount a named volume to Ollama’s data directory so model pulls survive container recreation. Without that volume, every docker compose up after a cleanup triggers a fresh download. This is one of the most frequent “why is this taking forever?” moments for first-time local LLM users.
Model management should be intentional. Decide which model(s) you need for your assistant’s role (smaller for speed, larger for quality), and pull them once. A practical pattern is to keep the model name in an environment variable like OLLAMA_MODEL so switching models doesn’t require code changes. You can also separate “startup” from “model warmup”: start the containers first, then run a one-time ollama pull step against the running service to populate the volume.
http://localhost:11434 from inside the container. In compose, use http://ollama:11434.Practical outcome: you’ll have a stable Ollama container whose model cache persists, with ports and URLs that behave consistently across restarts and across machines.
docker-compose.yml (or compose.yml) is where the stack becomes one command. You define two services—api and ollama—attach volumes, map ports, and inject configuration. Compose also gives you a shared DNS-based network so services can find each other by name.
A well-structured compose file does more than start containers; it expresses operational intent. Add a health check for Ollama so the API doesn’t hammer it during startup. Health checks are especially useful because model runtimes can be “running” but not yet ready to respond. A simple health check can call an Ollama endpoint (for example, a tags/list endpoint) and retry until it succeeds.
Then, use depends_on with a health condition (supported in modern Compose implementations) to delay API startup until Ollama is healthy. Even with that, your API should still implement timeouts and retries when calling Ollama—health checks reduce failures, they don’t eliminate them.
Mount volumes explicitly:
./docs:/docs:ro), which keeps grounding material outside the image.Practical outcome: “run the full assistant stack” becomes docker compose up, and you gain resilience against race conditions where the API starts before the model runtime is actually ready.
One compose file can support both a rapid local development loop and a production-like run, but only if you separate concerns. Compose profiles are a clean way to do this: define a dev profile that enables hot reload and bind mounts, and a default (or prod) profile that runs immutable containers with conservative settings.
In development, prioritize feedback speed:
--reload so code changes take effect immediately.8000, Ollama on 11434) so you can test with local tools..env file for non-sensitive config (model name, log level), but keep secrets out of Git.In production-like mode, prioritize predictability and safety:
For configuration and secrets handling, adopt patterns you can explain in an interview: environment variables for runtime config, .env for local developer convenience, and Docker secrets (or your platform’s secret manager) for real credentials. Even if your assistant is fully local today, practicing “secrets hygiene” now prevents bad habits later.
Practical outcome: you can switch between fast iteration and a stable deployment posture with a single flag, without editing YAML each time.
When Dockerizing AI workloads, failures cluster around three areas: filesystem permissions, port confusion, and resource limits. Knowing the patterns saves hours.
Permissions: If Ollama can’t write to its model directory, you’ll see repeated download failures or corrupted caches. This often happens with bind mounts on Linux where host directory ownership doesn’t match the container user. Prefer named volumes for Ollama storage because Docker manages permissions more predictably. If you must use a bind mount, verify ownership and consider running the container with a user that can write to the mount point.
Ports and hostnames: If your API can’t reach Ollama, check the URL from the API container’s perspective. Inside compose, the correct host is usually the service name (ollama), not localhost. If you can curl http://ollama:11434 from inside the API container but not from your host, you likely forgot to publish the port. If you can reach it from the host but not from the API container, you likely misconfigured the base URL.
Resource limits: Local inference is heavy. If the model loads slowly or requests time out, check CPU/RAM availability. Containers don’t magically create compute; they compete for the same host resources. On Docker Desktop, ensure the VM has enough memory allocated. Also watch for timeouts: your FastAPI client calls to Ollama should set a realistic read timeout for streaming responses. Too short looks like “random failures,” especially on first token generation when the model is cold.
curl, then validate volume mounts with docker volume inspect and directory listings.Checkpoint outcome: you can bring the entire assistant up with docker compose up, verify the API endpoint responds, and confirm Ollama’s models persist across restarts—making your local LLM assistant a reproducible artifact instead of a fragile setup.
1. What is the main reason Chapter 4 introduces Docker for the Ollama + FastAPI assistant stack?
2. Why does the chapter recommend containerizing FastAPI and Ollama as separate services rather than bundling them into one container?
3. In the chapter’s recommended compose setup, what is the purpose of using persistent volumes?
4. Which configuration practice best matches the chapter’s guidance on minimizing surprises and keeping deployments predictable?
5. Which approach aligns with the chapter’s recommendation for handling secrets when Dockerizing the stack?
By now you have a local LLM running via Ollama and a FastAPI service that can stream chat responses. That is a strong prototype, but it still has a common portfolio problem: it can sound confident while being wrong, and it can’t reliably answer questions about your data (resume, project docs, internal notes, PDFs, readme files). This chapter upgrades your assistant into something you can responsibly demo: it will ground answers in local documents using lightweight Retrieval-Augmented Generation (RAG), cite the context it used, and apply practical guardrails (input limits, content policy checks, timeouts, retries, and rate limiting).
The engineering mindset here is simple: don’t ask the model to “remember” or “guess” what you can retrieve deterministically. For a career-transition portfolio, grounded answers with citations are more persuasive than a clever prompt. At the same time, local deployments have their own constraints: CPU-only environments, limited RAM, and the need for predictable latency. We’ll keep the pipeline small, testable, and transparent.
As you implement this, focus on two practical outcomes recruiters can understand: (1) “it cites where facts came from,” and (2) “it’s robust under messy user input.”
Practice note for Ingest local docs and chunk text for retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement embeddings and a simple vector store option: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compose prompts that cite retrieved context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add guardrails: input constraints, content policies, and rate limiting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: assistant answers grounded questions with citations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Ingest local docs and chunk text for retrieval: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement embeddings and a simple vector store option: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compose prompts that cite retrieved context: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add guardrails: input constraints, content policies, and rate limiting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
RAG and fine-tuning solve different problems. Fine-tuning changes model behavior (style, format, domain language) by updating weights. RAG keeps the model fixed and changes what it sees at inference time by attaching relevant documents to the prompt. For most portfolio assistants that answer questions about personal projects, company policies, or internal notes, RAG is the default because the knowledge changes often and you want answers that can be traced to sources.
Use RAG when: your facts live in documents; you need citations; you expect the content to change weekly; you want to keep everything local; and you want a clear “I don’t know” path when retrieval fails. RAG also reduces hallucinations because you can instruct the model to only answer from retrieved context and to quote or cite chunk IDs.
Use fine-tuning (or lightweight alternatives like prompt templates and system instructions) when: you need consistent formatting (JSON schemas, ticket templates), specialized tone, or tool-using behavior; and the change is stable across many tasks. Fine-tuning is not an efficient way to “store” your documents—especially if you need to update them frequently. It also makes it harder to prove where an answer came from, which matters when you demo reliability.
Engineering judgment: start with RAG, then add small behavioral shaping. For example, keep a stable system prompt that defines policies (cite sources, refuse unsafe requests, admit uncertainty) and rely on RAG for facts. A common mistake is trying to solve accuracy by “prompting harder” without retrieval. Another mistake is overstuffing the prompt with too many chunks; the model becomes slow and confused, and citations become meaningless.
For this course, your portfolio-ready story is: “I built a local assistant that retrieves from my project docs and responds with citations; it’s safer and more reliable than a pure chat bot.”
Ingestion is the unglamorous work that determines whether RAG feels magical or broken. Your pipeline should take a folder of local documents and produce normalized text plus metadata (source path, title, page number, headings). Start by supporting the formats you can parse reliably: .txt, .md, and .pdf (via a PDF text extractor). If PDF extraction is noisy, consider exporting to markdown first for your portfolio demo; clean inputs lead to cleaner retrieval.
Chunking is the next critical decision. Retrieval works best when each chunk is small enough to be specific, but large enough to contain the answer. A practical baseline is 400–800 tokens per chunk with 10–20% overlap. Overlap preserves continuity for concepts that span chunk boundaries. If you chunk too small (e.g., single sentences), retrieval becomes brittle and you lose context. If you chunk too large (e.g., entire documents), you retrieve irrelevant material and waste the context window.
Prefer structure-aware chunking: split markdown by headings, then split long sections by paragraph boundaries. Keep metadata that helps citations: {doc_id, source, heading, chunk_id, start_char, end_char}. That metadata becomes your citation mechanism and your debugging tool when users ask, “Where did that come from?”
Keep ingestion idempotent. If the docs folder hasn’t changed, don’t rebuild everything. For a local demo, a simple approach is to compute a hash of each file and re-embed only changed files. This improves iteration speed and makes your system feel like a product rather than a script.
Embeddings convert each chunk into a numeric vector so you can do similarity search. In hosted systems you might call an embeddings API, but for this course the point is local, repeatable deployment. You have three practical options: (1) run a local embedding model, (2) reuse an Ollama-provided embedding endpoint if available in your setup, or (3) choose a lightweight CPU-friendly model via a Python library.
For local-first portfolios, CPU-friendly sentence embedding models are often sufficient. The tradeoff is quality vs speed: smaller models embed faster and use less RAM, but may retrieve less accurately on nuanced queries. If you can afford it, run embeddings on GPU; if not, optimize the rest of the pipeline (good chunking, clean text, sane top-k) to compensate.
Vector store choices can remain simple. A “vector store” is just: vectors + metadata + a way to search. For lightweight options, you can use:
Engineering judgment: keep it boring until you need performance. In a portfolio context, correctness and debuggability beat “enterprise architecture.” Make sure embeddings are deterministic: same text → same vector. Version your embedding model name and chunking parameters; changing either invalidates your existing index and can silently degrade retrieval.
A common mistake is mixing embeddings from different models in one index. Another is embedding raw PDFs with headers, footers, and page numbers intact; retrieval then matches on repeated boilerplate instead of content. Clean inputs and consistent model/version tracking are your best friends.
The retrieval pipeline is the “glue” that turns a user question into grounded context for the LLM. The basic sequence is: (1) embed the user query, (2) similarity-search your vector store, (3) select top-k chunks, (4) assemble a context block with citations, and (5) prompt the LLM to answer using only that context.
Top-k selection is not arbitrary. Start with k=4 to 8 and cap the total context to a token budget (for example, 1,500–2,500 tokens depending on your model’s context window and your desired latency). If you always include too much context, the model may “average” across chunks and produce vague answers. If you include too little, it may miss a key detail and hallucinate. A practical technique is to retrieve more (say k=12) then re-rank or filter by simple heuristics (e.g., drop chunks with similarity below a threshold).
Context assembly should be explicit and auditable. Build a block like:
[S1] source=docs/resume.md#Projects chunk=12...chunk text...[S2] source=docs/projectA/README.md chunk=03Then instruct the model: “Use only sources S1–S#; when you state a fact, cite like [S1]. If the sources do not contain the answer, say you can’t find it.” This single instruction dramatically improves trustworthiness. It also enables your checkpoint: answers grounded with citations.
In FastAPI, treat retrieval as a separate function/module so you can test it without the LLM. A common mistake is coupling retrieval and generation tightly; when outputs are wrong you can’t tell whether retrieval failed or generation drifted. Add a debug mode that returns retrieved chunks and scores alongside the answer (or logs them). That makes demos compelling: you can show the evidence.
Finally, handle the “no results” path. If similarity scores are low, return a safe response that asks a clarifying question or suggests which document set to add. Don’t pass empty context and hope the model guesses—this is where hallucinations are born.
Guardrails are what turn a clever demo into something you can responsibly share. Local deployments are not automatically safer; they simply change the risk profile. You still need to protect your service from abusive inputs, accidental data leakage, runaway latency, and outputs that violate your intended use (for example, generating harmful instructions or exposing sensitive document excerpts beyond what the user should see).
Start with input constraints. Validate message length, reject extremely long prompts, and limit attachments. In FastAPI, enforce max characters per message and max messages per request. Add a request size limit at the server/proxy layer if possible. This prevents memory spikes during embedding and context assembly.
Add timeouts and retries. Local inference can stall if the machine is under load. Use a per-request timeout for embedding and generation. Retries should be selective: retry transient failures (model server unavailable), but do not retry policy violations or invalid inputs. If you stream tokens, ensure you handle client disconnects cleanly so you don’t keep generating after the user has gone away.
Implement basic content policies as a layered approach:
Add rate limiting even locally, especially if you plan to expose the API on your network. Rate limiting protects your machine from accidental loops (e.g., a front-end bug spamming requests). A simple token bucket per IP or per API key is enough for a portfolio project.
Common mistakes: forgetting to cap retrieval context (leading to prompt overflow), not handling empty retrieval (leading to hallucinations), and logging raw documents or prompts in a way that leaks sensitive content. Your guardrails should include a policy for what gets logged and what must be redacted.
When your assistant answers incorrectly, you need to know why. Observability is the difference between guessing and debugging. For a local RAG assistant, focus on three things: structured logs, lightweight tracing, and version tracking for prompts and indices.
Structured logs should capture: request ID, user/session identifier (non-sensitive), model name, embedding model name, chunking parameters, retrieval top-k, similarity scores, and latency per stage (embed time, retrieval time, generation time). Avoid logging raw user content by default; instead log hashes or truncated previews. If you need full payloads for debugging, gate them behind a local-only debug flag and redact document text.
Tracing can be as simple as timing spans in code, but if you want a portfolio-level touch, integrate OpenTelemetry to create spans for: ingest, embed_query, vector_search, assemble_context, and generate. Even without a full tracing backend, exporting to console during development helps identify bottlenecks (often embeddings on CPU).
Prompt and version tracking is essential because small prompt changes can cause big behavior shifts. Store your system prompt and RAG prompt template in versioned files. Log the prompt version used per request. Do the same for your vector index: track the embedding model, chunk size/overlap, and the document corpus hash. When you rebuild the index, increment an index version and log it. This makes your assistant reproducible: you can rerun a demo later and explain differences when outputs change.
With observability in place, your assistant becomes an engineering artifact, not a black box. That is exactly the kind of maturity that signals readiness for AI-adjacent roles—especially in teams that care about reliability and auditability.
1. Why does Chapter 5 add retrieval (RAG) to the assistant instead of relying on the model to "remember" information?
2. Which pipeline best matches the chapter’s RAG goal from documents to an answer?
3. What is the primary purpose of requiring citations in the assistant’s responses?
4. Which set of guardrails best reflects the chapter’s recommended approach for a robust local assistant?
5. What should the assistant do at the checkpoint when relevant context is missing from the retrieved documents?
You can build a local LLM assistant that “works on your machine” in an afternoon. Shipping something that other people can run, trust, and evaluate is a different skill—and it’s exactly the skill hiring managers look for in career transitioners. This chapter turns your Ollama + FastAPI project into a polished artifact: tested, measured, documented, and easy to demo.
The goal is not enterprise perfection. The goal is professional reliability: your API should fail predictably, your quality should be measurable, latency should be explainable, and the repo should tell a clear story. You’ll implement smoke tests and contract tests, create a small evaluation set for quality and latency, and prepare a demo script and screenshots for your portfolio. You’ll also write a deployment/usage guide so reviewers can run it in minutes—then package the whole thing as a checkpoint: a published project that supports your career transition.
Throughout this chapter, lean into engineering judgment. For local inference, variability is normal: model updates change outputs, different machines have different performance, and prompt changes can shift behavior dramatically. Your job is to make those changes visible and manageable with tests, evals, profiling, and clear documentation.
Practice note for Create smoke tests and contract tests for the API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a small evaluation set and measure quality/latency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prepare a demo script and screenshots for your portfolio: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a deployment and usage guide for reviewers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: publish a polished project that supports your career transition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create smoke tests and contract tests for the API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a small evaluation set and measure quality/latency: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prepare a demo script and screenshots for your portfolio: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write a deployment and usage guide for reviewers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: publish a polished project that supports your career transition: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Testing an LLM-powered API is partly traditional software testing and partly “behavior checking.” Start with a simple pyramid: unit tests for pure functions, integration tests for your FastAPI app, and a small set of golden prompts that act like snapshot tests for your assistant’s behavior.
Unit tests should cover anything deterministic: input validation, request parsing, safety filter rules, document chunking, and prompt-template rendering. Common mistake: skipping unit tests because “the model is nondeterministic.” Your glue code is deterministic—test it. Example: a function that clamps max_tokens, rejects empty messages, or enforces a context window should be unit tested with edge cases.
Integration tests should verify your API contract. Use FastAPI’s TestClient (or httpx) to call endpoints the way a real client would. Create smoke tests that run fast and answer “is the service alive?”: health endpoint returns 200, chat endpoint returns JSON with required fields, and streaming endpoints yield multiple chunks. Add contract tests that lock down request/response schemas and error shapes. Common mistake: returning different error formats from different code paths; reviewers hate this because it makes clients brittle.
GET /health returns {"status":"ok"} and includes model name/version.422 with a consistent error schema; timeout returns 504 with a stable message.delta field and conclude with a final “done” marker.Golden prompts are curated prompts with expected properties. Don’t assert exact wording; assert invariants (must cite retrieved documents, must refuse unsafe request, must answer in bullets). Store them in tests/golden.json with the prompt, retrieval context (if any), and expected checks. When you update prompts or swap Ollama models, run golden prompts to catch regressions early.
Practical outcome: you can show a reviewer a one-command test run (pytest) that proves the API is predictable and the assistant’s key behaviors remain intact across changes.
Tests tell you the system runs; evals tell you the system is good. Keep evals lightweight: 20–50 examples that represent your target use case (career-transition assistant, doc-grounded Q&A, or whatever theme you chose). Your evaluation set should include easy wins, ambiguous questions, and “gotchas” that previously broke the system.
Build a small dataset in a simple format (evals/cases.jsonl): input messages, optional documents to retrieve from, and an expected answer sketch. An expected answer sketch is not a full script; it’s a checklist. For example: “mentions 3 steps, warns about secrets, references README section X.” This avoids the common mistake of brittle exact-match evaluation, especially with local LLM variability.
Add a rubric with 3–5 criteria you can score quickly:
Measure regressions by tracking scores over time. When you change the prompt template, add a safety filter, or switch Ollama models, rerun the eval set and record deltas. A practical workflow is to store a baseline report in evals/reports/ and compare new runs in CI. You’re not chasing perfect scores; you’re demonstrating that you can detect and explain changes.
Include latency in the eval output: time-to-first-token, tokens/sec, and total duration. Common mistake: only measuring total time. For chat UX, time-to-first-token matters more than total completion.
Practical outcome: you can show an “evaluation card” in your repo—what you measured, how you scored it, and what improved after iterations.
Local inference performance is a product feature. Reviewers will try your demo and immediately feel whether it’s responsive. Profile first, then tune. Start by adding simple instrumentation around your FastAPI endpoints: log request size, retrieved chunk count, prompt token estimate, time-to-first-token, total duration, and error rates.
Context management is the fastest lever. Long prompts are expensive. Common mistake: blindly appending entire chat history and large retrieved passages. Instead, implement a policy: keep the last N turns, summarize older turns, and cap retrieved context by characters or estimated tokens. If you use RAG, retrieve fewer but higher-quality chunks (e.g., top 3) and include metadata (title/path) rather than extra text.
Caching improves repeat interactions. Cache at the right layer:
Be careful caching model outputs unless you’re explicit; it can hide regressions. A safe compromise is caching only for demo endpoints or adding a cache=false query parameter during evaluation runs.
Batching is useful if you evaluate many prompts. Instead of firing 50 separate requests, write an eval runner that submits requests sequentially with controlled concurrency (e.g., 2–4 workers) to avoid thrashing the CPU/GPU. Common mistake: maxing concurrency on a laptop and then concluding the model is “slow.” You saturated resources.
Tuning judgment: prefer small, explainable optimizations over mysterious tweaks. Your portfolio story should say, “I reduced median time-to-first-token from 1.8s to 0.7s by trimming context and limiting retrieval to 3 chunks,” not “I changed random settings until it felt better.”
Practical outcome: your repo includes a performance note with baseline metrics and specific changes tied to measurable improvements.
Even a local-first assistant deserves security basics because reviewers will judge your professional instincts. Start with the basics: CORS, authentication options, and secrets hygiene. The goal is to show you know what matters and can implement sensible defaults.
CORS: if you have a frontend (even a simple HTML page), lock down allowed origins. Common mistake: allow_origins=["*"] forever. For local dev, allow http://localhost:5173 (or your port). In production-like demos, use an environment variable ALLOWED_ORIGINS and parse a comma-separated list.
Auth options: you don’t need a full OAuth setup for a portfolio project, but you should offer a minimal option. A practical pattern is an API key header (e.g., X-API-Key) validated by middleware. Document how to disable it for local demo (AUTH_MODE=none) and how to enable it (AUTH_MODE=api_key). Common mistake: shipping “security theater” (keys hardcoded in code) instead of real configuration.
Secrets hygiene: never commit keys, tokens, or private documents. Use .env.example and document required variables. Ensure your Docker image does not bake secrets at build time; pass them at runtime. If your RAG uses local files, add data/ to .gitignore and provide a small public sample dataset for reviewers.
Also include simple guardrails you built earlier—timeouts, retries, and input validation—and explain how they prevent abuse (e.g., huge payloads or prompt injection attempts in documents). Practical outcome: a reviewer sees responsible defaults and a clear security section in your guide.
A good portfolio project is a product: it explains itself, runs quickly, and shows evidence. Your README is the interface. Aim for “reviewer success in 10 minutes.” Include a demo script and screenshots so someone can evaluate without deep setup.
Use a predictable README structure:
docker compose up or uvicorn.Project storytelling matters for career transitions. Tie your engineering choices to user outcomes: “Local inference avoids API costs and supports offline use,” “Streaming improves perceived latency,” “RAG grounds answers in documents.” Common mistake: listing tools without rationale. Hiring managers want to see decision-making.
Prepare a demo script with 3–5 scenarios: one normal question, one doc-grounded question, one refusal/safety example, and one performance example (“watch time-to-first-token”). Capture screenshots of each, and store them in docs/images. Practical outcome: your portfolio can be evaluated asynchronously and still lands your message.
Your final checkpoint is not just publishing code; it’s being able to explain it under interview pressure. Practice a short system design walkthrough that starts from requirements and ends at trade-offs. Keep it crisp: problem, architecture, key endpoints, testing/evals, performance, and security.
Prepare talking points aligned to the course outcomes:
Walk through a request end-to-end: client sends chat message → FastAPI validates → optional retrieval fetches top chunks → prompt is assembled with system + user + context → Ollama generates tokens → streaming response → logging captures metrics. Then discuss where you would scale: move to a queue, add rate limiting, add observability, or separate retrieval into its own service.
Common mistake: sounding like you “followed a tutorial.” Replace that with evidence: show test outputs, eval reports, and performance numbers. When asked “what would you improve next?” reference your limitations section and propose the next iteration (better eval coverage, more robust auth, improved retrieval).
Practical outcome: you can present your project as a small but complete system—engineered, measured, and documented—ready to support your career transition into AI.
1. What is the main shift in focus Chapter 6 emphasizes compared to building a local LLM assistant that only works on your machine?
2. Why does Chapter 6 include both smoke tests and contract tests for the API?
3. What is the purpose of building a small evaluation set in this chapter?
4. How does the chapter suggest you handle natural variability in local inference (model updates, machine differences, prompt changes)?
5. Which set of deliverables best matches the “career packaging” goal of the chapter?