AI Certifications & Exam Prep — Intermediate
Deploy production ML on Kubernetes with cert-ready skills in 6 chapters.
Shipping machine learning is not the same as shipping a web app. Model artifacts are large, dependencies can be fragile, latency targets are unforgiving, and reliability issues show up under load. Kubernetes is the standard control plane for running production systems, and it has become the default platform for MLOps teams that need predictable deployments, scalable inference, and auditable operations.
This course is a short technical book in six chapters that builds from fundamentals to production-grade operation. The goal is practical certification readiness: you will learn the cluster skills and model-serving patterns that show up in hands-on exams and real work.
You will package an ML inference service into a secure container, deploy it to Kubernetes, expose it with the right networking primitives, and scale it safely. You’ll also learn how to ship changes with Helm and GitOps, and how to operate your service using observability signals and incident-style debugging.
Chapter 1 establishes the Kubernetes mental model for MLOps and sets up a lab workflow that supports fast iteration. Chapter 2 focuses on packaging: image construction, artifact handling, performance considerations, and security hygiene. Chapter 3 turns your container into a real service with Kubernetes manifests, networking, and configuration management. Chapter 4 adds the production layer—autoscaling, rollouts, and reliability controls. Chapter 5 shows how teams actually deliver: Helm packaging, GitOps reconciliation, and controlled promotion across environments. Chapter 6 ties everything together with observability and certification-style practice tasks that force you to deploy, validate, debug, and improve under time constraints.
This is designed for engineers preparing for Kubernetes-in-MLOps responsibilities and certification-style evaluations: ML engineers moving toward deployment ownership, platform engineers supporting model serving, and data scientists who need to operationalize models beyond notebooks. If you already know the basics of Docker and can read YAML, you’re ready for this intermediate track.
Set up your practice environment (local or managed), then follow the chapters in order. Each chapter ends with a checkpoint milestone that mirrors common exam tasks: deploy, expose, secure, scale, and troubleshoot.
To begin learning immediately, Register free. Prefer to explore other tracks first? You can also browse all courses.
By the end, you’ll have a repeatable blueprint for packaging and serving models on Kubernetes—and the operational instincts to scale, roll out updates safely, and prove your system is healthy using measurable signals.
Senior MLOps Engineer, Kubernetes & Model Serving
Sofia Chen is a Senior MLOps Engineer who designs Kubernetes platforms for model training and real-time inference. She has led production deployments using GitOps, service meshes, and GPU scheduling across regulated environments. Her teaching focuses on practical, exam-ready skills that map directly to real cluster operations.
This course is about turning a trained model into a dependable service: packaged securely, deployed repeatably, scaled predictably, and operated calmly under real production constraints. Kubernetes is not “the place you run containers”; it is the control plane that turns containerized software into a managed, observable system. For MLOps, that means your inference code, model files, and runtime configuration must be expressed in Kubernetes-native terms (pods, deployments, services, ingress, config, and identity) so the platform can do its job.
In this first chapter, you will map the end-to-end model serving lifecycle to Kubernetes primitives, stand up a practice cluster, and build a certification-style workflow that makes your labs reproducible. You will also run a first inference service in a pod and expose it locally. Throughout, keep an “exam mindset”: practice the muscle memory of reading manifests, predicting behavior, and troubleshooting from symptoms to root cause.
As you read, treat each concept as a lever you will pull later: containers for reproducibility, controllers for desired state, services for stable networking, and configuration/identity primitives for safety. Reliability and scale (probes, resource limits, disruption budgets, autoscaling, rollouts) will be recurring themes; you are laying the mental model that makes those topics intuitive rather than memorized.
Practice note for Map the end-to-end model serving lifecycle to Kubernetes primitives: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Stand up a practice cluster and validate kubectl, contexts, and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a certification-style study plan and lab workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run your first inference service in a pod and expose it locally: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: troubleshoot common cluster and networking issues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Map the end-to-end model serving lifecycle to Kubernetes primitives: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Stand up a practice cluster and validate kubectl, contexts, and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a certification-style study plan and lab workflow: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run your first inference service in a pod and expose it locally: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Most ML systems end up supporting both batch inference and online inference, and Kubernetes helps because it can schedule and manage both styles using the same core primitives. Batch inference is usually throughput-oriented: you run a job over a large dataset, accept higher latency per item, and care about retry behavior, resource efficiency, and data locality. Online inference is latency-oriented: you serve predictions on demand with tight SLOs, handle bursts, and prioritize fast rollouts and safe fallbacks.
In Kubernetes terms, batch often maps naturally to Jobs and CronJobs (or workflow engines layered on top), while online inference maps to long-running Deployments fronted by Services and optionally Ingress. The same cluster can host both, but your engineering judgment changes: batch can tolerate node churn and preemption if it retries; online services need readiness gates, stable networking, and careful rollout strategies.
A practical lifecycle mapping looks like this: package inference code and dependencies into a container image; store it in a registry; deploy it as a Pod template under a controller (Deployment for online, Job for batch); wire stable access using Service/Ingress; provide model and runtime configuration via ConfigMaps/Secrets; grant access to external systems (object storage, feature store, model registry) via ServiceAccounts and RBAC; then scale with HPA/VPA and roll forward with controlled updates. Common mistakes include treating models as “just files” copied manually into nodes, using “latest” tags that break reproducibility, or shipping secrets inside images. Kubernetes gives you the tools to avoid these pitfalls, but only if you consistently express everything as declarative, reviewable configuration.
Kubernetes becomes much easier when you can name what is happening. A node is a worker machine (VM or bare metal). A pod is the smallest schedulable unit: one or more containers sharing networking and volumes. You rarely create pods directly in production; you create a controller that maintains a desired number of pods. For model serving, that is typically a Deployment (or StatefulSet if you need stable identities, which is uncommon for stateless inference).
Networking is where many early labs fail. Inside the cluster, the CNI (Container Network Interface) plugin provides pod-to-pod networking and assigns IPs. A Service gives a stable virtual IP and DNS name that load-balances to pods matching labels. The cluster DNS (commonly CoreDNS) turns Service names like inference.default.svc.cluster.local into reachable endpoints. If DNS is broken, everything looks “randomly down,” especially libraries that call internal services.
For exam readiness and real operations, build a habit of tracing symptoms to layers. If a request fails, ask: is the pod running? is it ready? is the Service selecting endpoints? does DNS resolve? is network policy blocking traffic? A classic mistake is assuming a Deployment guarantees availability. It only guarantees the controller is trying. Real availability depends on image pulls, probes, resource pressure, CNI health, and whether the Service has endpoints. When you later add autoscaling, canary rollouts, or node upgrades, these fundamentals explain most “mysterious” inference outages.
ML teams often share clusters across environments (dev, staging, prod) and across multiple services (features, training, serving). Namespaces are the first boundary: they scope names, quotas, and many policies. A practical pattern is one namespace per team per environment (for example, reco-dev, reco-staging, reco-prod) so resource limits, network rules, and access controls are easier to reason about.
Your day-to-day safety depends on kubectl contexts. A context combines cluster + user credentials + default namespace. Many incidents start with “I applied the manifest to the wrong cluster.” Make it muscle memory to check context before running commands, and to set an explicit namespace in scripts. In certification scenarios, this also saves time: you will switch clusters or namespaces frequently, and context hygiene prevents subtle errors.
RBAC (Role-Based Access Control) defines what identities can do. For inference services, the identity is typically a ServiceAccount attached to the pod. Grant the minimum permissions needed—ideally none unless your service must call the Kubernetes API. For external systems (S3/GCS, secret managers), prefer cloud-native workload identity integrations over broad node credentials. A common mistake in ML is over-privileging “just to make it work,” which becomes a long-term risk when the same service later gains access to production data. Good RBAC is also good debugging: when permissions are tight and explicit, failures are clearer and easier to audit.
You will troubleshoot constantly in MLOps: image pulls, crash loops, bad configuration, failed DNS, slow startups due to model loading, and miswired Services. The fastest path to competence is mastering a small set of kubectl commands and using them in a consistent order.
kubectl get pods -n reco-dev, kubectl get deploy,svc,ing -n reco-dev.kubectl exec -it pod/name -- sh.Run your first inference service as a single pod to keep variables low. Use a small container that serves HTTP (a minimal FastAPI/Flask server or a known demo image) and verify you can port-forward and call a /predict endpoint locally. Then practice reading failure modes: if the port-forward connects but requests hang, check container port binding; if logs show “model file not found,” confirm volume mounts or image contents; if the pod restarts, look for OOMKills and set realistic memory requests. These are the same mechanics you will use later with Deployments and Services—only with more moving parts.
Reproducible serving starts with a disciplined container workflow. Your inference image should contain the serving code and pinned dependencies; your model artifact can be baked into the image for small models, or fetched at startup for large models and frequent updates. Either way, the registry is the distribution point, and your tags are part of your release contract.
Use immutable references whenever possible. In practice, that means tagging images with a version (1.4.2) and also recording the digest (@sha256:...) in deployment manifests for high-assurance environments. Avoid using latest in Kubernetes manifests; it breaks rollback reasoning and can cause non-deterministic behavior across nodes due to caching. For exam tasks and real incidents, being able to say “this pod is running image digest X” is a major debugging advantage.
Security and size matter for ML images. Common mistakes include huge images that slow rollout (hurting availability) and running as root by default. Prefer slim base images, multi-stage builds, and non-root users. Store credentials for the registry using Kubernetes Secrets (imagePullSecrets) or cloud-native identity, not in Dockerfiles. Finally, separate “model versioning” from “code versioning” deliberately: sometimes you will redeploy the same server code with a new model, and sometimes the reverse. Your workflow should support both without forcing ad-hoc manual steps.
You need a practice environment that is fast to reset and close enough to production to build correct instincts. For local labs, kind (Kubernetes in Docker) is excellent for speed, reproducibility, and GitOps-style iteration. minikube can be easier when you need add-ons, ingress controllers, or a VM-based setup. The trade-off is realism: local clusters may differ from managed clusters in load balancers, storage classes, and cloud identity integrations.
For certification-style study, use a repeatable lab workflow: script cluster creation, set contexts predictably, and keep a “clean room” namespace for each exercise. Practice validating your environment first: check kubectl version, confirm current context, create a namespace, and run a simple pod. Then layer complexity: Deployment + Service, then ingress, then configuration objects, then autoscaling. This staged approach mirrors how you should build real inference platforms—small, verifiable steps.
Managed clusters (EKS/GKE/AKS) add production-relevant components: real cloud load balancers, IAM integration, and managed control planes. They also introduce additional failure modes (quota limits, identity misconfiguration, cloud networking). A practical pattern is to learn mechanics locally, then validate once per week on a managed cluster to ensure your manifests behave under real ingress and DNS. When you hit issues, use a troubleshooting checkpoint mindset: isolate whether it is Kubernetes (pods, services, DNS), your application (listening port, model load time), or the environment (CNI, firewall rules, cloud load balancer provisioning). That habit—narrowing the layer before changing anything—is what turns “trial-and-error” into engineering.
1. Why does the chapter describe Kubernetes as more than “the place you run containers” in an MLOps serving context?
2. What is the main reason inference code, model files, and runtime configuration should be expressed in Kubernetes-native terms?
3. In the chapter’s mapping of model serving to Kubernetes primitives, which primitive primarily provides stable networking to reach a changing set of pods?
4. What best reflects the chapter’s “exam mindset” approach to learning Kubernetes for MLOps?
5. Which set of concerns does the chapter highlight as recurring reliability and scale themes you are building intuition for?
Model serving on Kubernetes succeeds or fails long before you write a Deployment. The “unit of delivery” in most MLOps platforms is a container image: it contains your inference API, runtime dependencies, and often (directly or indirectly) the model artifacts. This chapter focuses on packaging inference services into secure, reproducible images that are small, fast to start, and compatible with CPU and GPU environments.
The certification-level skill here is engineering judgment: knowing what to pin, what to isolate, what to scan, and what to version. You will build deterministic container images (so the same commit produces the same bits), choose an artifact loading strategy (baked-in model vs pulled at startup), and harden your images (non-root, minimal base images, and vulnerability scanning). You’ll also practice the supply-chain “checkpoint” workflow: publish versioned images and verify SBOM and scan outputs so a cluster operator can trust what you deploy.
Common mistakes in this space are predictable: shipping huge images that slow cold starts, relying on unpinned dependencies that silently change behavior, running as root with writable filesystems, or mixing model versioning into an untraceable pile of “latest” tags. The goal of this chapter is to turn those pitfalls into repeatable patterns you can apply in CI/CD and later integrate into Helm/GitOps workflows.
Practice note for Containerize an inference API with deterministic builds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize images for size, cold start, and CPU/GPU compatibility: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement model artifact loading strategies and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Harden images with non-root users, minimal bases, and scanning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: publish versioned images and verify SBOM/scan outputs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Containerize an inference API with deterministic builds: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Optimize images for size, cold start, and CPU/GPU compatibility: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement model artifact loading strategies and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Harden images with non-root users, minimal bases, and scanning: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
An inference container is easiest to maintain when you separate three concerns: the API surface, the model loader, and the dependency layer. The API surface is the HTTP/gRPC contract (for example, FastAPI or Flask for HTTP). The model loader is code responsible for fetching/initializing the model, warming caches, and exposing a single “predict()” interface. The dependency layer is everything needed to run predict deterministically: Python packages, system libraries (glibc, libgomp), and optional acceleration stacks (CUDA, cuDNN, MKL/OpenBLAS).
Structure your repo so that the API can be tested without needing a production model artifact. A practical pattern is: app/ for the web server, model/ for loader and pre/post-processing, and requirements.lock (or equivalent) for pinned dependencies. The loader should support configuration via environment variables (model URI, device type, batch size), because Kubernetes will later inject these safely through ConfigMaps/Secrets. Even if Chapter 3 focuses on manifests, you should design now for “12-factor” runtime configuration: no hard-coded endpoints, tokens, or file paths.
To keep startup predictable, treat model initialization as an explicit lifecycle step. Avoid importing heavy libraries at module import time in a way that triggers large allocations before your server process is ready. Instead, load the model in an application startup hook and expose a readiness flag only when loading completes. This becomes critical for Kubernetes readiness probes: if your container reports ready too early, traffic will hit a half-initialized model and produce timeouts that look like random flakiness.
Finally, define a deterministic “contract” between API and loader. For example, your API handler should validate inputs, then call predict(), then post-process. Keeping this boundary clear makes it easier to swap loaders later (baked-in model vs remote pull) without touching HTTP logic.
Deterministic builds start with two decisions: pin what you can, and separate build-time tools from runtime. Multi-stage Docker builds let you compile or assemble dependencies in a “builder” stage and copy only the results into a minimal runtime stage. This reduces image size, attack surface, and cold start time.
A practical Dockerfile pattern for Python is: (1) builder stage installs pinned wheels (from a lockfile) into a virtual environment or a wheelhouse directory; (2) runtime stage uses a slim base image, copies the venv/wheels and your app code, and sets a non-root user. Pin the base image by digest when you need strict reproducibility (for example, python:3.11-slim@sha256:...) and pin your Python dependencies with a lockfile (pip-tools, Poetry lock, uv lock). “Floating” versions (like numpy>=1.20) are a classic cause of “it worked yesterday” incidents.
Enable deterministic installs by controlling indexes and caches in CI. For example, prefer building wheels once and reusing them, rather than compiling from source each build (which can vary with system libraries). When you must compile (common with scientific stacks), compile in the builder stage and copy only shared libraries needed at runtime. Keep the runtime stage free of compilers and package managers whenever possible.
pip cache), and avoid copying test data and notebooks.CPU/GPU compatibility deserves attention early. A CPU image built on a slim Debian base will not magically run on GPU nodes without CUDA libraries. Maintain separate image variants (for example, -cpu and -gpu) and keep their dependency graphs explicit. This makes Kubernetes scheduling and node selection straightforward later, and it prevents accidentally shipping a GPU-sized image to CPU-only clusters.
Serving requires a decision: do you bake model artifacts into the image, or pull them at startup? Both can be correct, and the best choice depends on model size, update frequency, and your supply-chain requirements.
Baking-in means the model file(s) are copied into the image during build. This improves startup reliability (no network dependency) and makes the image self-contained, which simplifies air-gapped deployments. It also creates a single immutable artifact: “image X contains model Y.” The drawback is slower rebuilds and pushes for large models, and potentially large registry storage costs. Baking-in is most attractive for smaller models or when you need strict reproducibility and fast rollback: you can roll back by switching image tags.
Pulling at startup means the image contains code to download the model from object storage or a model registry when the container boots. This keeps images small and lets you update models without rebuilding application code. However, it introduces new failure modes: network outages, credential issues, throttling, and “thundering herd” behavior when many replicas start simultaneously. Mitigate this with caching strategies: download once to a shared PersistentVolume, or implement a local cache directory with checksum validation and backoff retries. If you’re running on Kubernetes, an init container is often the cleanest approach: it downloads the model to a volume, and the main container starts only after the model is present.
Whichever approach you choose, treat the model as a first-class versioned artifact. Model version should be explicit (for example, a registry version or object path), and your runtime logs should print the resolved version and checksum. This is invaluable when debugging drift or unexpected predictions in production.
Inference performance is rarely just “add more CPU.” Packaging choices influence latency, throughput, and how quickly pods become ready. Start with threading: many ML libraries use native thread pools (OpenMP, MKL, OpenBLAS). If you run multiple worker processes (for example, Gunicorn workers) and each process uses many BLAS threads, you can oversubscribe the CPU and slow everything down. Set thread-related environment variables intentionally (for example, OMP_NUM_THREADS, MKL_NUM_THREADS) and align them with Kubernetes CPU limits. A common practical approach is “one process per core” for CPU-bound workloads and a small fixed number of BLAS threads per process.
BLAS choice matters. MKL can be faster for some workloads but may increase image size and licensing/redistribution constraints depending on your base image and distribution. OpenBLAS is widely available and often sufficient. The key is consistency: choose one stack, document it, and test under the same container runtime constraints you’ll have in production.
Cold start time is influenced by image pull time and model initialization time. To reduce pull time, keep images small and avoid large layers that change frequently. Arrange Docker layers so the most stable layers (base OS, pinned dependencies) are built first, and your frequently changing application code is near the end. This improves layer caching in CI and in the node’s image cache.
Finally, validate performance inside the container, not on your laptop environment. Benchmarking with the same container image you deploy catches surprises like missing CPU instruction support, different glibc behavior, or thread pool defaults that change under cgroup CPU limits.
Container security for model serving is mostly about eliminating unnecessary privilege and reducing what can be exploited. Start by running as a non-root user. In Dockerfile terms, create a dedicated user/group and switch with USER. In Kubernetes terms, you’ll later reinforce this with a pod security context (runAsNonRoot, runAsUser) and disallow privilege escalation. This prevents a large class of “breakout” scenarios from becoming cluster-level incidents.
Next, aim for a read-only root filesystem. Many inference services don’t need to write anywhere except temporary directories. If your framework needs a writable cache (for tokenizers, compiled kernels, or model downloads), mount an explicit writable volume at a known path (for example, /tmp or /cache). The practice forces you to know exactly what the process writes, which is also helpful for reproducibility and debugging.
Base image selection is a security decision. Minimal bases reduce CVE count and patch workload. Distroless images can be excellent for runtime, but be careful: debugging becomes harder, and some Python stacks assume shell utilities exist. A common compromise is a slim Debian/Ubuntu base for early maturity, and a later move to distroless once you have strong observability and a stable dependency graph.
Common mistakes include shipping secrets inside images (API keys in environment defaults or baked config files) and keeping package managers in runtime layers. Your image should not contain credentials, and production runtime should obtain secrets through Kubernetes Secrets with least-privilege ServiceAccounts and scoped access policies.
Versioning is where packaging meets operations. A Kubernetes cluster can only run what you can identify and retrieve, so define a tagging strategy that supports rollbacks, audits, and reproducibility. Avoid relying on latest. Instead, publish immutable tags tied to source control (for example, app:1.4.2 and app:git-3f2c1d7) and record the image digest (sha256:...) in deployment manifests or release notes. Digests are the ground truth: tags can be moved; digests cannot.
Model versioning should be explicit and independent. If you bake the model into the image, include the model version in the image tag (for example, inference:app1.4.2-model2.1.0) or as OCI labels. If you pull at startup, pass the model version as configuration and log the resolved artifact ID and checksum at startup. Either way, your operational question should be answerable: “Which model version served this request?”
Provenance is the supply-chain checkpoint: prove where the image came from and what it contains. In practice, this means: build in CI, sign images (for example, Sigstore/cosign), attach SBOMs and provenance attestations, and verify them before deployment in higher environments. Even if your exam scope focuses on Kubernetes objects, packaging discipline is what makes GitOps safe: Git points to a digest, that digest has a signature and SBOM, and scanning output is stored and reviewed.
When you treat images as immutable, verifiable artifacts, the rest of Kubernetes operations become calmer: rollouts are predictable, incidents are diagnosable, and scaling events don’t introduce surprise behavior because “the image changed under the same tag.”
1. Why does Chapter 2 describe the container image as the primary “unit of delivery” for model serving on Kubernetes?
2. What is the main goal of building deterministic container images for an inference service?
3. When choosing a model artifact loading strategy, what trade-off is the chapter highlighting between baking the model into the image vs pulling it at startup?
4. Which practice best aligns with the chapter’s image hardening guidance?
5. What is the purpose of the supply-chain “checkpoint” workflow described in the chapter?
Model serving on Kubernetes is mostly an exercise in making the implicit explicit: what runs (container + command), how it is reached (networking objects), how it is configured (config and secrets), and what it is allowed to do (identity and RBAC). In MLOps, these details determine whether your inference service is reproducible, safe to operate, and predictable under load. This chapter walks through the concrete manifests you will write for a model API and the supporting objects that let traffic reach it securely, while keeping runtime configuration manageable and least-privilege by default.
A good mental model is “three layers”: (1) workload controller (usually a Deployment) that manages Pods; (2) Service and Ingress that turn Pods into a stable network endpoint; and (3) runtime configuration and permissions (ConfigMaps, Secrets, ServiceAccounts/RBAC). Reliability features—probes and lifecycle hooks—tie it all together by making Kubernetes aware of when a Pod is actually ready to serve and when it should be restarted.
In practice, you will build manifests iteratively. Start with a minimal Deployment for your model server container, add resource requests/limits, then layer on a Service, then Ingress/TLS patterns. Only after connectivity is verified should you tighten permissions with a dedicated ServiceAccount and RBAC, and finally refine health checks for safe rollouts. This workflow keeps debugging tractable: you isolate “my container doesn’t start” from “my networking is wrong” from “my config isn’t being read.”
As you read, pay attention to engineering judgement points: which controller to use for serving, which Service type to pick, when to terminate TLS at the edge, what should be a Secret vs ConfigMap, and how strict probes should be for GPU-heavy model initialization. These are common exam and real-world failure points.
Practice note for Write Deployment and Service manifests for a model API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Expose inference endpoints with Ingress and TLS-ready patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Manage configuration with ConfigMaps and Secrets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Apply RBAC and ServiceAccounts for least privilege: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: validate connectivity paths and configuration reload behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write Deployment and Service manifests for a model API: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Expose inference endpoints with Ingress and TLS-ready patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Most inference APIs are long-running HTTP/gRPC servers, so the default controller is a Deployment. Deployments provide a ReplicaSet for stable, declarative scaling and rolling updates. A Job, by contrast, is for finite work that should complete (batch scoring, offline embedding generation, one-time model conversion). Choosing the wrong controller creates operational pain: if you run a server as a Job, Kubernetes will consider “running forever” as “never completing,” and your platform automation may treat it as stuck. If you run batch scoring as a Deployment, you risk repeated restarts and duplicate work if the process exits successfully.
A practical Deployment manifest for a model API should include at minimum: labels (for Service selection), a container image pinned by digest for reproducibility, a container port, and resource requests/limits. For example, request CPU/memory that reflects steady-state inference, and set limits to prevent noisy-neighbor behavior. If your model loads large weights at startup, plan for higher memory during initialization; otherwise you will see OOMKills that look like “random crashes” during rollouts.
Common mistakes include: relying on the image tag :latest (breaks reproducibility), omitting resource requests (HPA and scheduling behave poorly), and mixing app and infra responsibilities (embedding Ingress logic in the container rather than using Kubernetes objects). Keep the Pod spec focused: run one primary model server container; add a sidecar only when there is a clear need (e.g., a metrics exporter or a model file syncer).
Engineering judgement: use a Deployment for online inference; use a Job (or CronJob) for scheduled or one-off inference tasks; consider a StatefulSet only if each replica must maintain stable identity or local state (rare for stateless serving). This clarity helps you write manifests that match the runtime behavior you actually want.
A Pod IP is ephemeral; a Service creates a stable virtual IP and DNS name that load-balances traffic to matching Pods. For internal serving, you will usually create a ClusterIP Service, then optionally add an Ingress for external access. The typical pattern for a model API is: Deployment labels like app: fraud-model, Service selector app: fraud-model, and a named port (e.g., http on 8080) so other objects can reference it cleanly.
Service types matter. ClusterIP is internal-only and is the default; it is ideal for microservice-to-microservice calls or when Ingress will front the service. NodePort exposes a port on every node; it is simple but rarely the best production choice because it expands the attack surface and complicates TLS and routing. LoadBalancer provisions a cloud load balancer (when supported) and is common for direct L4 exposure (gRPC, TCP) or simple setups without an Ingress controller. A headless Service (ClusterIP: None) returns Pod IPs directly via DNS and is used for stateful/discovery patterns—usually not required for stateless inference, but relevant if you run per-replica caching layers or specialized routing.
Practical workflow: write the Service immediately after the Deployment and test connectivity inside the cluster. Use an ephemeral debug Pod (e.g., kubectl run -it --rm with a curl image) to call http://<service>.<namespace>.svc.cluster.local:8080/health. If this fails, do not jump to Ingress debugging—fix selector labels, ports, and readiness first. A frequent mistake is mismatched labels: the Service selector does not match Pod labels, resulting in zero endpoints. Another is targeting the wrong port: the Service port and targetPort must map correctly to the container port.
Engineering judgement: default to ClusterIP + Ingress for HTTP-based inference. Use LoadBalancer when you need raw L4 exposure or your environment lacks an Ingress controller. Avoid NodePort unless you have a specific controlled-network use case.
Ingress is the standard Kubernetes API for HTTP(S) routing into the cluster. It works in conjunction with an Ingress controller (NGINX, Traefik, HAProxy, cloud-native controllers) that actually implements the routing. The core value for model serving is clean separation: your model server stays a simple HTTP app behind a ClusterIP Service, while Ingress handles hostnames, paths, TLS termination, and edge behaviors like request size limits.
Routing fundamentals: an Ingress rule typically binds a host (e.g., api.example.com) and a path prefix (e.g., /fraud) to a backend Service and port. This enables multi-model routing patterns where one domain fronts multiple inference services. Be deliberate with path handling: some controllers rewrite paths, some preserve them; mismatches lead to “404 only through Ingress” bugs. Use annotations (controller-specific metadata) to configure timeouts and maximum body sizes—important when requests include images or large feature payloads.
TLS concepts: Ingress can terminate TLS at the edge and forward plain HTTP to your Service, or it can use TLS passthrough depending on controller features. For most MLOps teams, edge termination is simplest: clients connect via HTTPS; the Ingress presents a certificate and routes to internal Services. “TLS-ready” patterns include reserving a stable hostname, referencing a Secret containing the TLS cert/key, and choosing a certificate workflow (manual, external PKI, or cert-manager). A common mistake is creating an Ingress with a TLS section but forgetting the Secret or using a Secret in the wrong namespace; Ingress Secrets are namespace-scoped.
Practical outcome: once ClusterIP connectivity works, add Ingress and validate the full path: DNS → Ingress IP → Ingress rule → Service endpoints → Pod. If something fails, inspect in that order. Many teams waste hours debugging application code when the issue is an Ingress annotation (timeout too low for cold-start models) or an incorrect service port reference.
Model servers rarely run with zero configuration. You will set things like log levels, model artifact locations, feature store endpoints, timeouts, and authentication settings. Kubernetes provides two first-class objects: ConfigMaps for non-sensitive configuration and Secrets for sensitive values (API keys, tokens, database passwords, private certificates). The operational goal is to keep images immutable while injecting environment-specific configuration at runtime.
You can consume ConfigMaps/Secrets as environment variables or as mounted files. Environment variables are straightforward and work well for small settings (e.g., MODEL_NAME, LOG_LEVEL). Mounted files are better for structured config (YAML/JSON), certificates, or when your application watches files for reload. A common mistake is putting large blobs into env vars and then struggling with quoting/formatting; use a mounted file instead.
Rotation and reload behavior matter in production. When you update a ConfigMap or Secret, Pods do not automatically restart. If values are injected as env vars, a restart is required to take effect. If values are mounted as files, Kubernetes updates the projected volume content (with some delay), but your application must reread the files. Therefore, decide explicitly: do you want “restart to apply” (simple, predictable) or “live reload” (more complex, sometimes necessary)? For ML serving, many teams prefer controlled restarts via rolling updates so changes are auditable and correlate cleanly with metrics.
Practical pattern: store non-secret config in a ConfigMap, secrets in a Secret, mount both under /etc/modelserver/, and have the container read a single config file at startup. If you need to rotate credentials without downtime, implement short TTL credentials where possible and design the server to re-auth on demand. Common mistakes include committing Secret manifests to source control unencrypted and using the same Secret across namespaces without clear ownership; both violate the principle of least exposure.
Kubernetes identity for Pods is expressed through a ServiceAccount. RBAC then defines what that identity can do. For inference workloads, the safest default is: the Pod should not be able to list or modify Kubernetes resources unless it truly needs to. Many serving containers only need network access to upstream dependencies and do not need Kubernetes API access at all. In that case, you still create a dedicated ServiceAccount but grant it no additional permissions, and you can even disable automounting of service account tokens if supported by your platform.
When permissions are required (e.g., the model server needs to read a ConfigMap that contains routing rules, or it participates in leader election, or it reports custom metrics through the Kubernetes API), define a Role scoped to the namespace rather than a ClusterRole. Then bind it with a RoleBinding to the ServiceAccount used by the Deployment. Keep rules narrow: list the exact API groups, resources, and verbs. A common anti-pattern is granting cluster-admin to “make it work,” which can turn a compromised container into a cluster compromise.
Practical workflow: (1) create ServiceAccount fraud-infer-sa; (2) update the Deployment to reference it; (3) run with zero RBAC permissions; (4) if the application fails due to forbidden errors, add the minimal Role rules needed. This “fail closed” approach is both exam-relevant and operationally sound.
Engineering judgement: separate human permissions from workload permissions. Developers may have broad access for troubleshooting, but the running inference Pods should be tightly constrained. This reduces blast radius and aligns with regulated environments where model endpoints handle sensitive features or PII-derived signals.
Kubernetes can only manage reliability if it can observe application health. Probes are the mechanism: startupProbe answers “has the process finished initializing?”, readinessProbe answers “should this Pod receive traffic?”, and livenessProbe answers “is this Pod stuck and should it be restarted?” For ML inference, these should be designed around real serving behavior, not generic “port open” checks.
Startup time is a key difference between typical web apps and model servers. Loading weights, warming caches, or initializing GPU contexts can take tens of seconds or minutes. Without a startupProbe, Kubernetes may run liveness checks too early and restart the Pod repeatedly, causing a crash loop that looks like a capacity problem. Use a generous startupProbe threshold and keep its endpoint lightweight (e.g., “process started and model file present”), while readiness should represent “model is loaded and can answer a small request.”
Readiness gates rollouts safely. During a rolling update, only ready Pods should receive traffic through the Service endpoints. If your readinessProbe is too lax, you can send traffic to a Pod that will time out, causing a brief outage during deploys. If it is too strict or too slow, rollouts take longer and autoscaling may misinterpret the service as unhealthy. Liveness probes should be conservative; restarting a model server can be expensive. Prefer detecting deadlocks or unrecoverable states rather than transient upstream failures.
Lifecycle hooks add practical control. A preStop hook plus a termination grace period can let the server stop accepting new requests and drain in-flight ones, reducing 5xx spikes during scale-down and rollouts. In the chapter checkpoint, validate that traffic flows end-to-end (Pod → Service → Ingress) and that configuration changes behave as expected: if config is env-based, a rollout should apply it; if file-based, confirm whether your server reloads without restart. These checks turn “manifests exist” into “the system is operable.”
1. Which sequence best matches the chapter’s recommended iterative workflow for building a Kubernetes serving setup while keeping debugging tractable?
2. In the chapter’s “three layers” mental model, what is the primary role of Service and Ingress?
3. Why does the chapter highlight probes and lifecycle hooks as reliability features for model serving?
4. What is the chapter’s guidance on permissions for a model-serving workload?
5. Which choice best reflects the chapter’s distinction between configuration types and why it matters for operations?
Once your model server is running in Kubernetes, the next engineering challenge is keeping it fast and available as demand changes and as you ship new model versions. Inference workloads are often bursty (a marketing email triggers a spike), sensitive to tail latency (p95/p99), and resource-hungry in specific ways (CPU vectorization, memory for batching/caches, or GPUs for deep nets). This chapter focuses on scaling and reliability as a single discipline: right-sizing resources so the scheduler can place pods efficiently, autoscaling to match traffic, rollout strategies that protect SLOs during upgrades, and resilience patterns that keep service healthy through node drains and failures.
A practical mindset for this chapter: treat your model endpoint like a production API. Define what “good” means (latency and error rate SLOs), then design Kubernetes behaviors—requests/limits, autoscalers, rollouts, and disruption handling—to defend those targets. You will also run load tests to surface failure modes that don’t show up at idle, such as queue buildup, GC pauses, memory pressure eviction, and slow-start during canaries.
The goal is not to “turn on every feature,” but to make deliberate tradeoffs. Aggressive scaling can thrash caches and cold-start costs; overly conservative scaling can violate latency. Safe rollouts can slow delivery, while risky rollouts can trigger incidents. By the end of this chapter, you should be able to choose a scaling and reliability plan that matches your model’s compute profile and your organization’s tolerance for risk.
Practice note for Right-size resources with requests/limits and QoS classes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Configure HPA for inference workloads and test scaling triggers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Roll out updates safely using rolling, blue/green, and canary patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Improve resilience with PDBs, anti-affinity, and graceful shutdown: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: run load tests and observe SLO-impacting failure modes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Right-size resources with requests/limits and QoS classes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Configure HPA for inference workloads and test scaling triggers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Roll out updates safely using rolling, blue/green, and canary patterns: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Improve resilience with PDBs, anti-affinity, and graceful shutdown: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Kubernetes scheduling decisions start with resource requests. If your inference container requests 500m CPU and 1Gi memory, the scheduler finds a node with at least that much allocatable capacity. Limits cap what a container can consume. The difference matters: requests determine placement and QoS; limits determine throttling (CPU) and termination risk (memory).
QoS classes are derived from your requests/limits. Guaranteed means CPU and memory requests equal limits for all containers in the pod. This is the most stable under pressure and is least likely to be evicted. Burstable means requests are set but limits exceed requests (or only some are set). BestEffort means no requests/limits and is first to be evicted. For model serving, Burstable is common (allow bursts), but for latency-critical endpoints on busy clusters, Guaranteed is often worth the higher reservation cost.
Engineering judgment: set requests to the sustained usage you expect under typical load, not the absolute peak, but do not under-request memory. Memory is not compressible; if a pod exceeds its memory limit, it will be OOMKilled. CPU is compressible; if a pod exceeds its CPU limit, it will be throttled, which often increases latency without crashing—sometimes worse than a crash because it silently degrades.
Eviction happens when a node is under resource pressure (especially memory). Your defense is accurate requests, reasonable limits, and readiness/liveness probes (covered later) so Kubernetes can remove unhealthy pods quickly. For inference servers with warmup cost (model load time), plan for the cost of restarts: store models locally (image-baked or fast object storage), and avoid “death by eviction” by reserving enough memory for the model plus batching buffers.
The Horizontal Pod Autoscaler (HPA) changes replica counts based on observed metrics. For inference, the key is choosing a metric that correlates with user experience. CPU utilization is the default, but many model servers are limited by request concurrency, queue depth, or GPU utilization rather than CPU. Start with CPU if you lack better signals, then graduate to custom metrics.
HPA works best when requests are set correctly, because CPU utilization is measured relative to the request. If you request 1 CPU but typically use 200m, HPA may never scale even though latency is rising due to Python GIL or downstream I/O. A good workflow is: run a load test, measure steady-state CPU and memory, set requests accordingly, then tune HPA targets.
Memory-based HPA is possible but can be tricky. Memory tends to climb with caching and fragmentation; scaling on memory can cause oscillations and does not always improve latency. Use memory signals when memory pressure is the primary failure mode (OOM/evictions), and pair it with right-sized limits.
Custom metrics are often best for inference. Examples include requests per second per pod, in-flight requests, or p95 latency (used carefully). With Prometheus Adapter (or a managed metrics pipeline), you can scale on a metric like queue length or concurrency. This aligns scaling with SLOs: if queue depth increases, add replicas before latency explodes.
minReplicas to avoid cold-start penalties for large models; the “cheapest” latency is the one you never make users pay.To test triggers, run a controlled load test (constant RPS, then step increases) and watch: replica count, CPU throttling, request latency, and error rate. Verify that scaling events happen before SLO violation, not after. If new pods take 60–120 seconds to become Ready because they load a model, you may need higher baseline replicas, faster image pull, or a pre-loading strategy.
Not every inference service scales horizontally. Some models need a lot of memory per replica or benefit from large CPU allocations for vectorized kernels. This is where vertical scaling concepts matter. The Vertical Pod Autoscaler (VPA) recommends (and optionally applies) new requests/limits based on historical usage. For many teams, VPA is first used in “recommendation mode” to inform better sizing rather than automatically changing pods.
Understand the tradeoff: VPA often requires restarting pods to apply new resource values, which can conflict with strict availability goals unless you have enough replicas and disruption handling. It can also fight with HPA if both are enabled on the same resource dimensions. A common pattern is: HPA scales replicas on concurrency/CPU, while VPA provides recommendations for memory requests to reduce OOM risk.
Node scaling is the other half of the story. If HPA wants more pods but the cluster has no room, pods will stay Pending. The Cluster Autoscaler (or a managed equivalent) adds nodes when it sees unschedulable pods. For inference, this can be essential during large traffic spikes, but it adds time-to-capacity: provisioning nodes and pulling images can take minutes.
Engineering judgment: decide where you want elasticity. If your product needs rapid scale-up, you may keep spare capacity (buffer nodes, higher minReplicas) to avoid waiting on node provisioning. If cost is paramount and traffic is predictable, you can lean more on cluster autoscaling and scheduled scaling (e.g., higher replicas during business hours).
For exam readiness and real operations, be able to articulate: HPA changes replicas; VPA changes pod size; cluster autoscaler changes node count. Reliable scaling requires coordinating all three layers with startup time, image pull time, and model load time in mind.
Shipping a new model or server version is a reliability event. Kubernetes Deployments provide rolling updates controlled by maxSurge and maxUnavailable. For example, maxSurge: 25% allows extra pods above desired replicas during rollout, while maxUnavailable: 0 prevents reducing capacity. For latency-sensitive inference, setting maxUnavailable: 0 is a common baseline so you don’t sacrifice throughput during an update.
However, rolling updates alone do not validate model quality or performance. A new model might be slower, use more memory, or change outputs in undesirable ways. That’s where blue/green and canary patterns come in. Blue/green stands up a full new version (green) alongside the old (blue) and switches traffic at once—simple rollback, but it can double capacity temporarily. Canary sends a small percentage of traffic to the new version first and gradually increases it while you observe metrics.
In Kubernetes, canaries can be implemented with multiple Deployments behind the same Service (using label selectors carefully), or with a service mesh/Ingress controller that supports weighted routing. For ML, canaries should evaluate more than HTTP 200s: watch latency, memory, GPU utilization, and domain metrics (e.g., prediction distribution drift, business KPI). Also ensure your readiness probe reflects “model loaded and ready,” not just “process is up,” otherwise traffic may hit pods still warming up.
progressDeadlineSeconds for slow model loads; ensure rollback is automated or one command away; and capture baseline p95 latency before starting.Rollouts are successful when users don’t notice. Your aim is controlled exposure, fast detection, and fast rollback. Treat every new model artifact as a change that could impact SLOs, and design the Deployment strategy accordingly.
Resilience is the set of Kubernetes behaviors that keep your service available when the cluster changes: node upgrades, spot interruptions, hardware failures, and routine maintenance. Start with Pod Disruption Budgets (PDBs). A PDB tells Kubernetes how many pods must remain available during voluntary disruptions (like node drains). For example, setting minAvailable: 90% (or maxUnavailable: 1) prevents a maintenance event from evicting too many inference pods at once.
Next, use anti-affinity to spread replicas across nodes (or zones) so a single node failure doesn’t take out most capacity. A practical rule: if you run 3+ replicas for a critical endpoint, prefer podAntiAffinity on hostname or topology zone. If your cluster is small, use “preferred” anti-affinity to avoid making pods unschedulable.
Graceful shutdown matters because load balancers and clients may continue sending traffic briefly after termination begins. Configure a preStop hook and a terminationGracePeriodSeconds that allows in-flight requests to finish and the server to stop accepting new work. Combine this with readiness probes that flip to NotReady quickly on shutdown, so traffic drains before the process exits.
/healthz can cause restarts that worsen the incident.Your operational checkpoint is to create failures on purpose: drain a node, kill a pod, and simulate a dependency slowdown during a load test. Observe whether you breach SLOs and why. Reliability improves fastest when you can reproduce “bad days” safely and fix the underlying control knobs.
GPU-backed inference introduces scheduling rules beyond CPU and memory. Kubernetes does not “discover” GPUs by default; it relies on a device plugin (commonly the NVIDIA device plugin) to advertise GPU resources (e.g., nvidia.com/gpu: 1) to the scheduler. Your pod then requests GPUs as a resource limit, and Kubernetes schedules it only onto nodes that can satisfy that request.
To prevent non-GPU workloads from landing on expensive GPU nodes, clusters commonly apply taints to GPU nodes (e.g., nvidia.com/gpu=true:NoSchedule). Your inference pods must include matching tolerations to be eligible for those nodes. This is a clean separation: CPU-only services won’t accidentally consume GPU nodes, and GPU services express intent explicitly.
Practical considerations for reliability: GPUs amplify cold-start costs (driver initialization, model load to VRAM). Set readiness probes to ensure the model is loaded and GPU is accessible. Right-size GPU requests carefully—requesting 1 GPU means exclusive scheduling of that unit; if your model uses only a fraction, consider batching, model multiplexing frameworks, or smaller GPU types rather than over-provisioning.
For the chapter checkpoint, include GPU scenarios if relevant: run a load test, watch GPU utilization and memory, then verify that autoscaling and rollout behaviors still protect SLOs. The best outcome is predictable behavior: pods schedule where intended, scale within real capacity, and roll out safely without saturating GPUs or triggering eviction loops.
1. Why does the chapter treat scaling and reliability as a single discipline for inference services?
2. What is the most appropriate starting mindset for designing scaling and reliability for a model endpoint?
3. Which scenario best illustrates why inference workloads are challenging to scale correctly?
4. What tradeoff does the chapter warn about regarding autoscaling behavior for inference?
5. Why does the chapter include a checkpoint to run load tests when evaluating scaling and reliability?
In earlier chapters you learned how to containerize an inference service, define Deployments/Services/Ingress, and apply reliability practices like probes, resource limits, and safe rollouts. This chapter moves one layer up: platform delivery. The goal is to make deployments repeatable, auditable, and promotable across environments (dev → stage → prod) without “kubectl roulette” or hand-edited YAML. You’ll package your serving stack into a reusable Helm chart, add environment overlays, and adopt GitOps so the cluster continuously reconciles toward what Git says should exist.
Platform delivery for MLOps is tricky because you’re shipping two kinds of change: code (image tags, runtime flags) and data-like artifacts (model versions, feature configs). Your delivery system must support fast iteration in dev while enforcing guardrails in production. You want easy rollbacks, clear approvals, and minimal configuration drift. A practical design is: Helm (or Kustomize) to generate Kubernetes manifests, Git as the source of truth, and a GitOps controller (Argo CD or Flux) to apply and monitor state. Secrets and tenant boundaries need special handling so promotions don’t leak credentials or exceed shared-cluster budgets.
By the end of this chapter you should be able to deploy a model serving stack via a chart or overlays, promote releases through environments with approvals, manage secrets safely, and perform an audited rollback by reverting Git history.
Practice note for Package the serving stack into a reusable Helm chart: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create environment overlays for dev/stage/prod promotions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Adopt GitOps to deploy and roll back consistently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Manage secrets and configs across environments safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: perform an audited rollback via Git history: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Package the serving stack into a reusable Helm chart: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create environment overlays for dev/stage/prod promotions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Adopt GitOps to deploy and roll back consistently: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Manage secrets and configs across environments safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Helm is the “package manager” for Kubernetes: it bundles a set of manifests and the logic to parameterize them. For model serving, a single chart can include a Deployment, Service, Ingress (or Gateway), ConfigMaps, ServiceAccount/RBAC, HPA objects, and optional canary resources. The chart becomes your reusable unit of delivery—one chart, many environments—by separating templates (the what) from values (the per-environment configuration).
A Helm chart has three main parts: Chart.yaml (metadata and dependencies), values.yaml (defaults), and templates/ (YAML with Go templating). A typical inference chart exposes values like image.repository, image.tag, resources, ingress.host, replicaCount, and env. Keep values stable and predictable; avoid “clever” templates that encode business logic. When your chart is too dynamic, debugging failures becomes harder than writing plain YAML.
Releases are central to Helm. A release is an installed instance of a chart with a specific values set in a namespace. Helm tracks release history, enabling helm rollback. In a GitOps world, you typically let the GitOps controller run Helm under the hood, but it’s still useful to understand release semantics: versioned upgrades, diffing between revisions, and how hooks or jobs run during installs.
--set in CI to inject secrets. Values are not a secret store. Treat them as configuration, not credentials.When designing chart interfaces, think like an API designer: prefer a small set of documented values over dozens of toggles. Provide a clear values.schema.json if your tooling supports it, and include a minimal example values file per environment for readability.
Kustomize approaches reuse differently: instead of templates, it composes YAML through overlays. You define a base (shared resources) and create environment-specific overlays that apply patches, add labels, adjust namespaces, or change images. This is appealing for teams that want to keep manifests “pure Kubernetes” and avoid a templating language.
A practical layout for inference might be: k8s/base containing Deployment/Service/Ingress/ServiceAccount and reliability defaults, then k8s/overlays/dev, stage, and prod. Overlays can patch replica counts, set different ingress hosts, tune HPA thresholds, or swap a GPU nodeSelector in prod. Use strategic merge patches for small edits and JSON6902 patches for precise operations (e.g., replacing one env var without copying the entire container spec).
Overlays are excellent for environment diffs, but they can also hide complexity if patches become large and divergent. A good practice is to keep the base authoritative for security and reliability: probes, resource limits, securityContext, pod disruption budget, and standard labels. Then overlays should mostly adjust scale and routing. If your overlay starts copying most of the Deployment, your “base” has stopped being a base.
Many platforms combine Helm and Kustomize: Helm renders a chart, then Kustomize applies organization-level overlays (namespaces, network policies, extra annotations). This hybrid approach is powerful, but establish ownership boundaries—chart owners control templates, platform owners control overlays—so changes don’t conflict.
GitOps means Git is the source of truth for desired cluster state, and an agent in the cluster continuously reconciles reality to match it. Tools like Argo CD and Flux watch a repository (or Helm registry), render manifests (Helm/Kustomize), apply them, and report health. This is a delivery mindset shift: you stop “deploying to the cluster” and start “merging to Git,” letting reconciliation perform the deployment.
Reconciliation provides two practical benefits for MLOps: drift detection and consistent rollbacks. Drift happens when someone hot-fixes production with kubectl, or a controller mutates resources. A GitOps controller shows which objects are out of sync and can automatically revert unauthorized changes. For inference services, drift is especially dangerous because a small config edit (timeout, max batch size, auth toggle) can change performance and cost.
Rollbacks become straightforward and auditable: revert a commit, or reset to a previous tag, and reconciliation brings the cluster back. This aligns with the chapter checkpoint: an audited rollback via Git history. Instead of “what did we run last week?”, you can point to a commit SHA, a pull request, and the controller’s sync status.
To make GitOps work well, standardize labels/annotations and health checks. Ensure your manifests include readiness probes and sane resource requests so the controller can assess “Healthy” correctly. When controllers show degraded health, treat it like a failing build: investigate before promoting further.
Environment promotion is the discipline of moving the same artifact through dev → stage → prod with increasing confidence. For inference, the “artifact” typically includes the container image (serving code plus runtime dependencies) and a referenced model version (baked in or fetched). A robust promotion strategy avoids rebuilding different images per environment; instead, promote immutable images by tag or digest and adjust only environment-specific config.
A common workflow is: CI builds an image on merge to main, pushes it to a registry, and writes the image digest into the dev configuration in Git. After validation in dev, you promote by updating the stage config to the same digest (often via a pull request). Finally, production promotion requires explicit approval and change control—reviewers confirm test results, rollout plan, and impact. Tagging helps: you can tag a Git commit (e.g., serving-v1.8.0) and reference that tag in your GitOps controller, or use release branches per environment.
Change control does not have to mean slow. It should mean traceable: who approved, what changed, what tests ran, and what the rollback plan is. For production, include a short “operational diff” in the PR description: expected QPS, resource changes, new endpoints, and whether a canary is enabled.
latest. Digests guarantee that dev/stage/prod run the same bytes.If you use progressive delivery (canary/blue-green), integrate it with promotion gates: stage proves functional correctness, prod canary proves performance and error budgets. The promotion commit should also specify rollout parameters (maxUnavailable, canary weight, analysis intervals) so operations are code-reviewed, not improvised.
Managing secrets across environments is where many otherwise-solid Kubernetes delivery systems fail. The rule is simple: do not store plaintext secrets in Git. Yet GitOps requires the cluster to retrieve configuration from Git, which creates a tension you must resolve with an encryption or indirection strategy.
One pattern is Sealed Secrets (Bitnami) or similar tools that encrypt a Secret into a SealedSecret custom resource that is safe to commit. Only the controller running in the target cluster can decrypt it. This supports per-cluster encryption keys, meaning a dev sealed secret cannot be decrypted in prod and vice versa. It fits well when you have a small number of Kubernetes-native secrets (API keys, basic auth) and want Git-based workflows.
Another pattern is external secret stores such as HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager, combined with External Secrets Operator. In this approach, Git stores references (secret names/paths), and the operator materializes Kubernetes Secrets at runtime. This is often preferable for enterprises because rotation, auditing, and access control are handled centrally.
Regardless of approach, design your charts/overlays so secrets are referenced, not embedded: mount them as volumes or envFrom, and keep secret names stable per environment. Also be careful with logs—avoid printing configuration that might include tokens, and review your liveness/readiness endpoints to ensure they don’t expose sensitive diagnostics.
Most MLOps platforms are multi-tenant: multiple teams deploy models into shared clusters. Without guardrails, one noisy inference service can starve others, or a misconfigured workload can gain unintended access. Delivery tooling must therefore include tenancy primitives as part of “definition of done,” not as an afterthought.
Namespaces are the first boundary. Use a namespace per team or per application, and consider separate namespaces per environment (e.g., team-a-dev, team-a-prod) to simplify RBAC and reduce accidental cross-environment edits. Attach ResourceQuotas and LimitRanges so workloads must declare requests/limits and cannot exceed budget. For inference, quotas also prevent runaway autoscaling from consuming the cluster during a traffic spike.
Policies enforce security and reliability standards. Pod Security (or equivalent admission policies) should require non-root containers and disallow privileged escalation. NetworkPolicies should restrict egress and ingress so model pods can only talk to approved dependencies (feature store, model registry, metrics). If you use service meshes, standardize mTLS and authorization policies for inference endpoints.
Multi-tenancy also affects delivery structure: keep platform-owned components (ingress controllers, cert managers, GitOps controllers) in dedicated namespaces with strict RBAC, and keep tenant applications separate. When performing the chapter checkpoint—an audited rollback—this separation ensures you can revert one model service without disturbing others, while still preserving a clear audit trail of who changed what and when.
1. What core problem does this chapter’s “platform delivery” approach aim to solve compared to manually applying YAML with kubectl?
2. In the chapter’s practical design, what is the intended role of a GitOps controller like Argo CD or Flux?
3. Why does the chapter emphasize dev → stage → prod environment overlays or promotions rather than one shared configuration for all environments?
4. Which combination best matches the chapter’s recommended delivery stack for generating and applying Kubernetes resources?
5. What makes a rollback “audited” in the chapter’s GitOps-based approach?
Model serving on Kubernetes is rarely “done” at deployment time. Production readiness means you can answer three questions quickly and confidently: Is the service healthy right now? If it’s unhealthy, why? And what change will fix it without making things worse? This chapter turns observability into a repeatable workflow you can execute under pressure—exactly the mindset expected in certification-style scenarios and real incident response.
Observability for inference is not about collecting every possible signal. It is about selecting a small set of signals that let you diagnose user-visible impact (latency and errors), capacity risk (saturation), and correctness risk (data/model drift) without drowning in noise. You will instrument your model API with logs, metrics, and traces; define SLOs and dashboards that align to user experience; and practice live debugging using kubectl alongside telemetry.
As you read, keep a production rubric in mind: you should be able to deploy a model service, scale it safely, validate it with concrete measurements, and show that it remains reliable during changes (rollouts, autoscaling, node disruption). The goal is not “observability tooling,” but operational competence: evidence-driven decisions, safe debugging tactics, and a checklist you can apply in an exam-style capstone or a real on-call rotation.
Throughout this chapter, the engineering judgment theme is “minimum sufficient visibility.” Over-instrumentation wastes resources and distracts responders; under-instrumentation turns every incident into guesswork. Your target is a thin, high-value layer of observability that is consistent across services and easy to interpret under time pressure.
Practice note for Instrument model APIs with metrics, logs, and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define SLOs and dashboards for latency, errors, and saturation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Debug live incidents using kubectl and observability signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run an exam-style capstone: deploy, scale, and validate a model service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Final checkpoint: certification-aligned checklist and practice tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Instrument model APIs with metrics, logs, and traces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define SLOs and dashboards for latency, errors, and saturation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Debug live incidents using kubectl and observability signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
For model APIs, the “three pillars” map cleanly to how inference fails. Metrics tell you how much and how fast (rates and latencies). Logs tell you what happened (inputs rejected, timeouts, upstream failures). Traces tell you where time went (queueing, feature fetch, model runtime, postprocessing). The practical aim is correlation: a spike in p95 latency should have a trace pattern and a log signature you can confirm quickly.
Start with a minimal metrics set exposed on /metrics in Prometheus format: request rate (http_requests_total), error rate segmented by status code, and latency histograms (http_request_duration_seconds_bucket). Histograms matter because averages hide tail pain; inference users feel p95/p99. Add saturation signals that directly affect serving: in-flight requests, queue depth, and model runtime duration if it differs from end-to-end latency.
For logs, prefer structured JSON with stable keys: request_id, model_version, status, latency_ms, and error_type. Common mistake: logging raw payloads (privacy risk, cost explosion). Instead, log schema metadata (shape, feature presence), and sample when needed. Make sure the Kubernetes log pipeline (stdout/stderr) is used consistently; writing to local files often breaks collection and fills disks.
For traces, propagate a request ID (or W3C traceparent) from ingress to the app, and annotate spans around major steps: input validation, feature retrieval, model inference, and response serialization. If you only trace one thing, trace the inference path; it is the critical section where tail latency appears. In practice, tracing is your fastest tool for identifying whether the bottleneck is CPU throttling, downstream latency, or application-level contention.
Outcome: you can explain a latency spike with evidence—metrics show p95 up, traces show increased model runtime, logs show CPU throttling or timeouts—rather than guessing and redeploying blindly.
Kubernetes adds a second layer of observability: cluster and workload signals that explain why your application telemetry changed. Treat Kubernetes signals as the “ground truth” for scheduling, resource pressure, and lifecycle transitions. When an incident hits, a reliable sequence is: check Service/Ingress reachability, inspect Pod status and events, then verify resource usage and probe behavior.
Events are your first stop for “why is it restarting or Pending?” Use kubectl describe pod to read events like image pull errors, failed mounts (ConfigMaps/Secrets), probe failures, and OOM kills. A common mistake is looking only at container logs and missing that the Pod never started due to a missing Secret or an invalid service account permission.
Resource metrics connect directly to saturation. With Metrics Server installed, kubectl top pod and kubectl top node tell you whether you are CPU-bound, memory-bound, or hitting noisy neighbor effects. For inference, CPU throttling is particularly deceptive: the Pod is “Running,” but latency climbs because the container is constrained by limits. If you see high CPU usage plus increasing latency, verify requests/limits and consider HPA scaling on CPU or custom request-rate metrics.
Audit trails and API server logs (where available) answer “what changed?” This matters in GitOps and Helm workflows: an unintended rollout can look like an outage. In practical operations, you should be able to map a time window of errors to a Deployment revision, a ConfigMap change, or an HPA scaling event. Also watch Node-level signals: evictions, disk pressure, and network issues can mimic application bugs.
Outcome: you can separate application defects from platform causes—misconfigured manifests, missing RBAC, scheduling failures, resource starvation—and choose the correct fix (manifest change, resource adjustment, or rollback) with minimal disruption.
Alerting is where many teams fail: either they page on every small blip, or they miss real user harm. Tie alerts to SLOs and burn rates so they measure user impact over time, not transient spikes. For model inference, the most defensible SLOs center on availability (successful responses) and latency (p95 under a threshold). Define them per endpoint if you have multiple behaviors (e.g., health checks vs inference).
A burn rate expresses how fast you are consuming your error budget. For example, if your SLO allows 0.1% errors over 30 days, and you suddenly hit 2% errors, you are burning budget rapidly and should page. Use multi-window alerting: a fast window catches acute incidents (e.g., 5–15 minutes), while a slow window catches chronic degradation (e.g., 1–6 hours). This reduces false positives without delaying real response.
Noise control is a production skill and an exam differentiator. Avoid paging on symptoms that responders can’t act on (e.g., a single Pod restart) unless it correlates with user-facing impact. Prefer alerts on: elevated 5xx rate, sustained p95/p99 latency regression, saturation thresholds (CPU throttling, queue depth), and failed rollouts (Deployment not progressing). Route warnings to dashboards and tickets; reserve paging for imminent SLO breach.
Dashboards should answer: what’s the impact, what changed, and where is the bottleneck? A practical dashboard layout is “RED” (Rate, Errors, Duration) plus saturation panels (CPU, memory, in-flight). For inference, add a panel for request size and model version distribution so you can spot a bad rollout or a traffic shift. Common mistake: building dashboards that require expert interpretation; prefer a few panels with clear thresholds and annotations for deploy events.
Outcome: alerts become actionable signals tied to SLO risk, and your dashboards become a decision surface for scaling, rollback, or mitigation—rather than a wall of graphs.
Production debugging is about maximizing insight while minimizing blast radius. Your default stance should be “observe first, change second.” Start with non-invasive checks: kubectl get for status, kubectl describe for events, and logs for error signatures. Then validate traffic flow: Service endpoints, readiness gates, and Ingress routing. Only after you understand the failure mode should you modify configuration or restart components.
Ephemeral containers are the safest way to investigate a running Pod when the application image lacks debugging tools. With kubectl debug -it podname --image=busybox (or a toolbox image), you can inspect DNS, curl local endpoints, and check filesystem mounts without altering the primary container image. This is especially useful when the container crashes quickly or is distroless. A common mistake is rebuilding and redeploying just to add curl—that increases time to mitigate and introduces new variables.
Use “safe prod tactics” deliberately: scale up replicas before experimenting, shift traffic with canary or weighted routing, and keep rollbacks ready (Deployment revision history or GitOps revert). When a model service is failing, avoid deleting Pods as a first step; it can mask root causes and worsen load on remaining replicas. If you suspect a bad rollout, check kubectl rollout status and consider pausing the rollout or rolling back while you investigate traces and logs.
When debugging latency, verify resource constraints: CPU limits causing throttling, memory pressure causing GC or OOM kills, and node contention. Correlate with HPA behavior: if HPA is scaling but latency remains high, you may be bound by a downstream dependency (feature store) or by cold-start/model load time. Confirm readiness probes: an overly eager readiness can send traffic before the model is loaded; an overly strict probe can flap and remove healthy Pods.
Outcome: you can diagnose issues live using kubectl and observability signals, applying reversible, low-risk actions that restore service without guessing or causing collateral outages.
Generic web metrics are necessary but not sufficient for ML. Model services have unique failure modes: tail latency from model load and batching, correctness risk from input drift, and version skew during canaries. Your monitoring should include both system health and model-specific indicators that help you decide whether to scale, rollback, or investigate data quality.
Latency percentiles are your core serving metric. Track p50/p95/p99 and separate “model runtime” from “total request time” if possible. This distinction changes your response: if model runtime increases, investigate CPU/GPU utilization, quantization changes, or library regressions; if only end-to-end increases, look at network, queueing, or downstream calls. Also monitor cold-start latency after rollouts—model initialization time often creates short-lived p99 spikes that are invisible in averages.
For errors, segment by category: input validation (4xx), upstream/downstream timeouts, model execution failures, and resource errors (OOM). This helps you avoid the common mistake of treating all errors as identical. A surge in 422/400 suggests a client/schema change; a surge in timeouts suggests saturation or dependency slowdown; OOM suggests incorrect memory limits or batch size.
Drift monitoring is often misunderstood as “deploy a full data science pipeline in production.” In Kubernetes terms, start with conceptual hooks: log feature summary statistics (min/max, missingness rate) in a privacy-safe way, or emit counters for schema mismatches and out-of-range values. You can run lightweight background jobs (CronJobs) to compare recent feature distributions against a baseline and alert when thresholds are exceeded. Keep it separate from request path to avoid adding latency.
Outcome: you can validate not only that the service is up, but that it is performing as expected for users and that the data feeding the model remains within known operating conditions.
Certification-style tasks reward structured execution under time constraints. Treat your workflow like an incident playbook: clarify the objective, apply a consistent command sequence, and produce verifiable evidence (working endpoint, stable metrics, successful rollout). Timeboxing is crucial: spend a fixed amount of time on diagnosis before choosing a safe action (scale, rollback, adjust resources), then re-measure.
In an exam-style capstone for this course, your rubric is typically: deploy a model service with correct manifests, expose it (Service/Ingress), configure runtime settings safely (ConfigMaps/Secrets/ServiceAccount), scale it (HPA/VPA patterns), and validate reliability (probes, limits, disruption budgets, rollout strategy). Observability weaves through each step: you should prove success by checking readiness, tail latency, and error rates rather than assuming “kubectl apply succeeded.”
Common pitfalls map directly to points lost: missing readiness probes (traffic hits uninitialized model), incorrect resource requests/limits (HPA doesn’t scale, or throttling inflates latency), Services selecting the wrong labels (no endpoints), Secrets mounted incorrectly (CrashLoopBackOff), and dashboards/alerts that watch the wrong signals (CPU looks fine but p99 is broken). Another frequent issue is changing too many variables at once; in a timeboxed environment, make one change, observe, and then proceed.
Practice tasks should mimic production: deploy, generate load, observe p95/p99, trigger a rollout, and confirm the system stabilizes. Validate with kubectl rollout status, endpoint checks, and a quick look at resource usage. If something degrades, use the debug ladder: events → logs → metrics → traces → safe action. Document the evidence you used; even outside exams, that habit is what makes your operations reproducible.
Outcome: you can operate like production—fast, methodical, and evidence-driven—while mapping each action to the reliability and scaling expectations of Kubernetes-focused MLOps certifications.
1. Which set of signals best matches the chapter’s definition of “minimum sufficient visibility” for user impact and capacity risk?
2. What is the primary purpose of defining SLOs and dashboards in this chapter’s workflow?
3. During a live incident, what approach does the chapter recommend for debugging?
4. Which sequence best reflects the repeatable operational workflow taught in the chapter?
5. What outcome best demonstrates “production readiness” in the chapter’s rubric?