HELP

+40 722 606 166

messenger@eduailast.com

Sysadmin to GPU Cluster Operator: Kubernetes Inference Deploy

Career Transitions Into AI — Intermediate

Sysadmin to GPU Cluster Operator: Kubernetes Inference Deploy

Sysadmin to GPU Cluster Operator: Kubernetes Inference Deploy

Go from IT ops to running fast, reliable GPU inference on Kubernetes.

Intermediate kubernetes · gpu · inference · llm-serving

Become the person who can run AI inference in production

This course is a short technical book for working sysadmins and Kubernetes operators who want a realistic path into AI infrastructure—without pretending you need to become a data scientist first. You’ll learn how GPU inference changes the operational game (scheduling, reliability, performance, and cost), then build up a practical, production-minded workflow for deploying and tuning model serving on Kubernetes.

The goal is simple: when a team says “we need to serve an LLM reliably on GPUs,” you’ll know how to make the cluster GPU-ready, deploy a serving stack, scale it safely, measure what matters, and keep it stable during upgrades and incidents.

What you’ll build, chapter by chapter

You’ll start by reframing your existing skills—Linux troubleshooting, networking instincts, change control, and observability—into the responsibilities of a GPU cluster operator. Then you’ll progress through GPU enablement, deployment patterns, scheduling strategy, performance tuning, and production operations.

  • GPU-ready Kubernetes: drivers, container runtime integration, and device plugins that actually schedule GPU workloads.
  • Inference deployment: ship a model endpoint with health checks, config/secrets, and safe rollout patterns.
  • GPU-aware scheduling: node pools, taints/tolerations, affinity, quotas, and strategies for isolation or sharing.
  • Performance tuning: benchmark correctly, tune batching and concurrency, and connect changes to real latency/throughput wins.
  • Production ops: SLOs, metrics/traces/logs, GPU monitoring, security hardening, and incident playbooks.

Who this is for

This course is designed for sysadmins, SREs, platform engineers, and Kubernetes operators who are comfortable with Linux and basic Kubernetes primitives, and want to step into AI infrastructure roles. If you’ve ever owned clusters, managed on-call, or debugged “why is this pod stuck,” you’re in the right place.

You do not need prior ML experience. When we touch model-level concepts (like quantization), it’s strictly from an operator’s perspective: what it is, why it affects performance, and what you need to watch for in production.

How you’ll learn (book-style, operator-first)

Each chapter reads like a focused section of a technical handbook: clear mental models, checklists, deployment patterns, and troubleshooting paths you can reuse on the job. You’ll repeatedly connect three viewpoints that matter in inference operations:

  • User experience: p95 latency, errors, timeouts, and correctness constraints.
  • Cluster reality: scheduling, resource fragmentation, noisy neighbors, and node drift.
  • GPU economics: utilization, right-sizing, and scaling decisions that affect burn rate.

Get started

If you’re ready to turn your operations background into AI infrastructure credibility, start here and work straight through the six chapters—each one builds on the last. When you’re ready, you can Register free to track progress, or browse all courses to pair this with adjacent platform and MLOps topics.

By the end, you’ll have a repeatable blueprint for standing up Kubernetes GPU inference that is measurable, secure, and operable—exactly the skill set hiring teams look for in GPU cluster operators and AI platform engineers.

What You Will Learn

  • Translate sysadmin skills into GPU cluster operator responsibilities and tooling
  • Install and validate NVIDIA GPU support on Kubernetes (device plugin, runtime, drivers)
  • Deploy an inference stack (runtime + API gateway) with secure configuration and secrets
  • Design GPU-aware scheduling: requests/limits, node labeling, taints/tolerations, affinity
  • Measure latency/throughput and tune performance (batching, concurrency, quantization basics)
  • Implement SLO-based observability with logs, metrics, and traces for inference workloads
  • Harden multi-tenant GPU clusters with RBAC, network policies, and image provenance
  • Run day-2 operations: upgrades, capacity planning, autoscaling, and incident response

Requirements

  • Comfort with Linux CLI, systemd, networking basics, and troubleshooting
  • Basic Kubernetes knowledge (pods, deployments, services, namespaces); kubectl experience
  • Access to a Kubernetes cluster (local or cloud) and at least one NVIDIA GPU for hands-on labs
  • Familiarity with containers and images (Docker/OCI) is helpful

Chapter 1: The GPU Inference Operator Mindset (Sysadmin → AI Ops)

  • Map sysadmin competencies to GPU inference operations responsibilities
  • Define the serving problem: latency, throughput, cost, and safety constraints
  • Create a reference architecture for Kubernetes-based inference
  • Set up a lab plan: cluster access, GPU nodes, and toolchain checklist
  • Establish an operational baseline: GitOps, environments, and change control

Chapter 2: Make Kubernetes GPU-Ready (Drivers, Runtime, Plugins)

  • Verify GPU hardware/driver health and baseline performance
  • Enable container GPU access with the correct runtime configuration
  • Install and validate the NVIDIA device plugin on Kubernetes
  • Confirm GPU scheduling works end-to-end with test workloads
  • Document node standards and drift checks for day-2 operations

Chapter 3: Deploy an Inference Service (From Image to Endpoint)

  • Containerize or select a serving runtime image and model artifact strategy
  • Deploy a GPU-backed inference Deployment with health checks
  • Expose the service safely through an Ingress/API gateway path
  • Manage configuration and secrets for model endpoints
  • Add canary and rollback mechanics for safer releases

Chapter 4: GPU Scheduling & Multi-Tenancy (Fair, Reliable, Predictable)

  • Implement GPU-aware resource requests/limits and quality-of-service
  • Control placement with taints, tolerations, affinity, and topology constraints
  • Introduce quotas and priority to prevent noisy neighbor incidents
  • Enable safe sharing strategies (MIG, time-slicing, or isolation)
  • Create runbooks for stuck scheduling and GPU resource leaks

Chapter 5: Performance Tuning for Inference (Latency, Throughput, Cost)

  • Establish a repeatable benchmark methodology for inference services
  • Tune runtime parameters: batching, concurrency, and caching
  • Apply model-level optimizations: quantization basics and compilation awareness
  • Right-size resources and reduce bottlenecks (CPU, memory, network, storage)
  • Create a cost/perf scorecard to guide change decisions

Chapter 6: Operate in Production (Observability, Security, Incidents)

  • Implement SLOs and dashboards that reflect user experience
  • Instrument logs/metrics/traces and set actionable alerts
  • Harden the cluster and the serving supply chain
  • Run incident response for latency spikes, OOMs, and GPU faults
  • Plan upgrades and lifecycle management without downtime surprises

Sofia Chen

Platform Engineer, Kubernetes & GPU Systems

Sofia Chen builds Kubernetes platforms for ML teams, focusing on GPU scheduling, inference reliability, and cost control. She has operated mixed-node clusters in production and designed SLO-driven observability for model serving stacks.

Chapter 1: The GPU Inference Operator Mindset (Sysadmin → AI Ops)

Becoming a GPU inference operator is less about “learning AI” and more about upgrading your operational instincts for a new class of workloads. The sysadmin mindset—control blast radius, standardize builds, observe everything, automate repeatability—translates directly. What changes is the shape of failure and the economics of resources: a single misconfigured runtime or scheduling rule can idle a multi-thousand-dollar GPU, while a seemingly small latency regression can break a product SLO.

This chapter establishes the mental model you’ll use for the rest of the course. You will map familiar sysadmin competencies into GPU inference operations responsibilities; define the serving problem in terms of latency, throughput, cost, and safety; sketch a reference architecture for Kubernetes-based inference; and set up a lab plan and operational baseline that matches real production constraints (GitOps, environments, and change control). The goal is practical: by the end of this chapter, you should know what “good” looks like for an inference platform and what to build first so that later optimizations are measurable, reversible, and safe.

One common mistake in career transitions is copying training-oriented playbooks into serving. Training optimizes for maximizing GPU occupancy over long, batch-heavy jobs. Inference optimizes for predictable response times, fast rollouts, and safe multi-tenancy. Your job becomes a steady trade: keep latency low, keep throughput high, keep cost under control, and keep risk contained—while the model, traffic patterns, and dependencies change underneath you.

Practice note for Map sysadmin competencies to GPU inference operations responsibilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the serving problem: latency, throughput, cost, and safety constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a reference architecture for Kubernetes-based inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up a lab plan: cluster access, GPU nodes, and toolchain checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish an operational baseline: GitOps, environments, and change control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map sysadmin competencies to GPU inference operations responsibilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the serving problem: latency, throughput, cost, and safety constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a reference architecture for Kubernetes-based inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set up a lab plan: cluster access, GPU nodes, and toolchain checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Career transition map—roles, tasks, and vocabulary

As a sysadmin, you already operate complex systems: networks, OS images, IAM, storage, monitoring, and incident response. A GPU inference operator keeps those fundamentals but applies them to an inference platform where “the application” is a model server plus a chain of dependencies. The fastest way to transition is to map old tasks to new responsibilities and learn the vocabulary that engineers and ML teams will use when they ask for help.

  • Golden images → GPU node readiness: instead of a base OS image, you manage drivers, container runtime integration, kernel compatibility, and node feature discovery. “Is the node healthy?” now includes “Does CUDA work in containers?”
  • Capacity planning → GPU scheduling: instead of CPU/RAM, you plan for GPU count, GPU memory, MIG partitions (if used), PCIe topology, and the impact of noisy neighbors. Your knobs are node labels, taints/tolerations, and resource requests/limits.
  • Service reliability → SLO-driven serving: you translate business requirements into p95 latency, error rate, and saturation signals, then build dashboards and alerts that catch regressions before users do.
  • Change management → model and runtime releases: you treat model versions like binaries: staged rollouts, canaries, rollbacks, and artifact provenance. “It’s just a new model” is never “just.”

Vocabulary you must speak comfortably includes: inference runtime (Triton, vLLM, TensorRT-LLM, TorchServe), tokenization, batching (dynamic vs static), concurrency, KV cache, quantization (fp16/int8/int4), router or gateway (rate limiting, auth, routing), and SLO vs SLA. You don’t need to be an ML researcher, but you must be fluent enough to turn requests like “we need faster responses” into concrete operational work: profile latency, adjust batching, right-size GPU requests, or add replicas safely.

The practical outcome: you become the person who can say, “Here is how we’ll run this model in Kubernetes, how it will be secured, how it will scale, and how we’ll know it’s healthy”—using tooling and discipline that looks familiar to any strong sysadmin.

Section 1.2: Inference workloads vs training—what changes operationally

Training and inference both use GPUs, but they behave like different species. Training is typically a long-running, throughput-oriented job that can tolerate queueing, warmup, and occasional retries. Inference is user-facing: it must answer within a budget (latency) and it often experiences bursty demand. Operationally, that flips your priorities from “maximize GPU utilization at all times” to “meet latency SLOs while keeping utilization economically sane.”

Inference introduces three realities that surprise sysadmins coming from general web ops. First, startup and warmup matter. Model servers may take minutes to load weights into GPU memory; autoscaling that ignores this will oscillate and drop traffic. Second, GPU memory is the true hard limit. CPU throttling is annoying; GPU out-of-memory usually kills the process or triggers aggressive fallback behavior. Third, tail latency dominates: a small fraction of slow requests can break the user experience, even if average latency looks fine.

  • Scheduling changes: you must request GPUs explicitly (e.g., nvidia.com/gpu: 1) and ensure nodes advertise the resource via the device plugin. You will also use labels/taints to keep GPU nodes dedicated and predictable.
  • Networking changes: gRPC streaming, long-lived HTTP connections, and large responses are common. Timeouts, keepalives, and load balancer settings can become performance bottlenecks.
  • Security changes: inference endpoints are attractive abuse targets (data exfiltration, prompt injection, quota exhaustion). Rate limiting and authentication are not optional “later.”

A common mistake is treating the model server like a stateless microservice. Many runtimes hold large in-memory caches (e.g., KV cache for LLMs) and behave best when requests are routed with awareness of active sessions and capacity. Another mistake is applying aggressive autoscaling without measuring cold-start time and without protecting against sudden scale-down that evicts warmed replicas.

The practical outcome: you will approach inference like operating a latency-critical service with expensive, scarce accelerators—where correct Kubernetes plumbing (runtime + drivers + device plugin), safe scaling, and disciplined rollouts matter as much as the model.

Section 1.3: Key KPIs—p50/p95 latency, QPS, tokens/sec, GPU utilization

Inference operations lives and dies on measurement. You will be asked, “Is it fast?” “Can it handle traffic?” and “How much does it cost?” Answering with anecdotes creates firefighting; answering with KPIs creates engineering. The core metrics set is small, but you must interpret it correctly and tie it to action.

  • Latency (p50/p95/p99): p50 tells you typical performance; p95/p99 tells you user pain and system contention. Tail latency often spikes due to queueing, CPU bottlenecks (tokenization), or GPU saturation.
  • Throughput: for classic inference, track QPS (queries per second). For LLMs, track tokens/sec (generated tokens per second) and separate prefill vs decode phases when possible.
  • GPU utilization: high utilization is good only if latency remains within SLO. Aim for “efficient” rather than “maxed out.” Also track GPU memory used and SM occupancy if available.
  • Error and saturation signals: HTTP/gRPC error rate, timeouts, queue depth, request rejections, OOM kills, and throttling events.

Engineering judgment is choosing which knob to turn when a metric degrades. If p95 latency worsens while GPU utilization is low, the bottleneck is likely outside the GPU (CPU preprocessing, network, serialization, or an overloaded gateway). If utilization is near 100% and p95 rises sharply, you likely need to reduce per-request cost (quantization, smaller model, faster runtime) or increase capacity (more replicas, more GPUs), possibly with smarter batching and concurrency limits.

Common mistakes include reporting only averages (which hide tail latency), mixing client-side and server-side latency without clarity, and ignoring request mix. A single dashboard number like “QPS” is meaningless if half the requests are short prompts and the other half are long generations. Establish a habit: publish a minimal “serving scorecard” per deployment that includes latency percentiles, throughput, GPU utilization, and an SLO pass/fail indicator.

The practical outcome: you will create a baseline early, then use it to validate changes (new driver, new runtime, new model version, new batching) with confidence rather than guesswork.

Section 1.4: Cluster topology patterns—single vs multi-node, mixed instance types

Your Kubernetes topology choices determine how painful (or smooth) operations will be. Inference platforms usually start small—one GPU node and a few services—but production tends to become multi-node with mixed hardware, multiple environments, and strict separation of duties. Design with growth in mind without overbuilding.

Single-node (all-in-one) clusters are great for learning and early prototypes: one control plane node with a GPU, plus a gateway and runtime. The failure mode is simple, but you can’t test realistic scheduling, rolling updates, or node replacement. Multi-node clusters introduce the real problems: bin packing GPUs, isolating noisy neighbors, and keeping system components away from accelerators.

  • Dedicated GPU node pools: label nodes (e.g., node.kubernetes.io/gpu=true) and use taints (e.g., gpu=true:NoSchedule) so only GPU workloads land there. This prevents “random” system pods from consuming CPU/memory needed by inference.
  • Mixed instance types: you may have different GPU models or sizes. Use labels for GPU type (A10, L4, A100) and apply node affinity so each workload lands on compatible hardware.
  • CPU-heavy sidecars and gateways: keep routers, auth, and observability agents on non-GPU nodes when possible. Inference is often GPU-bound, but preprocessing and TLS termination can be CPU-bound.

Common mistakes: relying on default scheduling (which will place pods wherever it can), forgetting to reserve headroom for DaemonSets on GPU nodes (device plugin, monitoring), and assuming all GPUs are interchangeable. Another frequent error is underestimating network and storage: pulling multi-GB model images repeatedly can saturate registries or node disks. Plan for caching (image pre-pull, local registry mirror, or persistent volumes for model artifacts) early.

The practical outcome: you will be able to describe a reference topology—control plane separation, GPU node pools, labeling/tainting strategy, and upgrade plan—and then implement it consistently across dev, staging, and production.

Section 1.5: Serving stack overview—runtime, router, cache, vector store (optional)

An inference “service” is usually a stack, not a single deployment. Thinking in layers helps you debug faster and secure the right boundary. A practical reference architecture on Kubernetes typically includes: an inference runtime (GPU-bound), an API router/gateway (policy and routing), optional caching, and sometimes retrieval components like a vector store.

  • Runtime: model server (e.g., Triton, vLLM). It owns GPU memory, batching, and concurrency. It should expose health endpoints and metrics and should be configured via ConfigMaps/args, with secrets kept out of images.
  • Router / API gateway: terminates TLS, authenticates requests, rate limits, and routes to the right model version. This is where you enforce safety controls like request size limits and per-tenant quotas.
  • Cache: response caching or embedding caching can reduce cost and improve latency. Even a simple Redis layer can stabilize load during bursts, but it must be sized and monitored like any stateful dependency.
  • Vector store (optional): used for retrieval-augmented generation (RAG). Operationally it adds indexing, backups, and data governance. Treat it as a first-class database dependency with access control and encryption.

Secure configuration is part of the operator mindset. Store credentials (API keys, database passwords, TLS private keys) in Kubernetes Secrets (or an external secrets manager synced into the cluster). Avoid baking secrets into container images or Helm values in plain text. Use least-privilege service accounts and restrict egress where feasible; inference pods often do not need broad internet access in production.

Common mistakes include exposing the runtime directly to the internet (bypassing auth and rate limiting), mixing “admin” endpoints with public endpoints, and skipping request validation. Another is ignoring multi-tenancy: without per-tenant quotas and isolation, one client can saturate GPU concurrency and destroy p95 latency for everyone else.

The practical outcome: you will be able to sketch—and later deploy—a secure inference stack where each component has a clear responsibility, observable health, and controlled configuration, making incidents diagnosable and rollouts safe.

Section 1.6: Lab environment options—kind/k3s vs managed Kubernetes vs bare metal

To learn GPU inference operations, you need a lab where you can break things on purpose: swap drivers, test the NVIDIA device plugin, validate scheduling rules, and deploy a serving stack repeatedly. The “right” lab depends on budget, access, and how closely you need to match production.

kind/k3s is excellent for Kubernetes fundamentals and GitOps workflows, but GPU support varies. kind runs Kubernetes-in-Docker and is typically not ideal for direct GPU passthrough unless you carefully configure the host, runtime, and container toolkit. k3s on a GPU-capable VM or small server can be a good middle ground: lightweight, close to real kubelet behavior, and manageable on a single machine.

Managed Kubernetes (EKS/GKE/AKS) gets you production-grade control plane operations, autoscaling primitives, and well-documented GPU node pools. It is the fastest route to practicing real-world patterns like node labels, taints, PodDisruptionBudgets, and rolling node upgrades. The tradeoff is cost and sometimes reduced visibility into the host OS.

Bare metal (or self-managed VMs) gives maximum control and teaches the most about the GPU stack: kernel/driver compatibility, NVIDIA Container Toolkit, runtime class configuration, and troubleshooting device exposure. It also forces you to practice disciplined change control because “just update the driver” can become an outage if you don’t stage and validate.

  • Toolchain checklist: kubectl, helm or kustomize, a GitOps tool (Argo CD/Flux), container registry access, NVIDIA tooling (nvidia-smi, dcgm-exporter), and a load generator for latency/QPS tests.
  • Validation habit: after any change, verify GPU discovery (nodes report allocatable GPUs), run a small CUDA test pod, and confirm metrics/health endpoints before declaring success.

Establish an operational baseline from day one: GitOps-managed manifests, separate environments (dev/stage/prod or at least dev/prod), and change control that records what changed and why. The practical outcome is repeatability: when you later install drivers, enable the device plugin, deploy an inference runtime, and tune batching/concurrency, you can measure impact and roll back safely.

Chapter milestones
  • Map sysadmin competencies to GPU inference operations responsibilities
  • Define the serving problem: latency, throughput, cost, and safety constraints
  • Create a reference architecture for Kubernetes-based inference
  • Set up a lab plan: cluster access, GPU nodes, and toolchain checklist
  • Establish an operational baseline: GitOps, environments, and change control
Chapter quiz

1. According to the chapter, what is the core shift in becoming a GPU inference operator?

Show answer
Correct answer: Upgrading sysadmin operational instincts to fit new workload failure modes and resource economics
The chapter emphasizes that the transition is less about learning AI and more about adapting sysadmin instincts to inference-specific failure shapes and costs.

2. Which set of constraints best defines the serving problem in this chapter?

Show answer
Correct answer: Latency, throughput, cost, and safety
Serving is framed explicitly in terms of meeting latency and throughput goals within cost and safety constraints.

3. Why can a small configuration mistake be especially costly in GPU inference operations?

Show answer
Correct answer: It can idle a multi-thousand-dollar GPU or cause an SLO-breaking latency regression
The chapter highlights that misconfigured runtimes/scheduling can waste expensive GPU resources and that small latency regressions can break product SLOs.

4. Which approach is identified as a common mistake when transitioning from training-focused work to serving-focused operations?

Show answer
Correct answer: Copying training-oriented playbooks into serving
The text warns that training playbooks optimize for long batch jobs, while inference requires predictable latency, fast rollouts, and safe multi-tenancy.

5. What is the main purpose of establishing an operational baseline (GitOps, environments, change control) early?

Show answer
Correct answer: So later optimizations are measurable, reversible, and safe
The chapter’s goal is to define what “good” looks like first, enabling safe and reversible iteration with measurable improvements.

Chapter 2: Make Kubernetes GPU-Ready (Drivers, Runtime, Plugins)

As a sysadmin, you’re used to building “known-good” server baselines: consistent BIOS settings, predictable kernel versions, and repeatable configuration management. GPU enablement is the same craft, but with more moving parts and tighter compatibility constraints. A Kubernetes node that merely has a GPU isn’t automatically usable for inference. The GPU must be healthy, the driver must match the CUDA expectations of your workloads, the container runtime must pass through device files correctly, and Kubernetes must advertise those resources so the scheduler can place pods.

This chapter turns GPU support into an operator workflow: verify hardware health and baseline performance; enable container GPU access with the correct runtime configuration; install and validate the NVIDIA device plugin; confirm end-to-end scheduling with test workloads; then document node standards and drift checks so day-2 operations stay stable. The goal is not only “it works once,” but “it stays working after upgrades.”

Think of GPU readiness as a chain. If any link breaks, symptoms often look similar—pods stuck Pending, containers failing at runtime, or inference latency swinging wildly. Your job is to isolate the failure domain quickly, using a repeatable validation playbook and a clear node standard. By the end of this chapter, you’ll have a practical baseline for GPU nodes and the checks that prove they’re ready for inference workloads.

Practice note for Verify GPU hardware/driver health and baseline performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable container GPU access with the correct runtime configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Install and validate the NVIDIA device plugin on Kubernetes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Confirm GPU scheduling works end-to-end with test workloads: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document node standards and drift checks for day-2 operations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Verify GPU hardware/driver health and baseline performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable container GPU access with the correct runtime configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Install and validate the NVIDIA device plugin on Kubernetes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Confirm GPU scheduling works end-to-end with test workloads: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: GPU fundamentals for operators—MIG, ECC, clocks, power limits

GPU operations start with understanding what can change underneath you. Unlike CPUs, GPU performance is heavily shaped by firmware and driver settings: ECC, clocks, power limits, and partitioning features like MIG (Multi-Instance GPU). These aren’t “nice to know”—they affect whether inference is stable and whether capacity planning is trustworthy.

MIG (available on certain data center GPUs like A100/H100) partitions one physical GPU into multiple isolated GPU instances. For operators, MIG changes the scheduling unit. Instead of advertising one large GPU, the node may advertise multiple smaller GPU resources. That can dramatically improve utilization for small models, but it also increases operational complexity: you must standardize MIG profiles per node pool, document them, and treat profile changes as a disruptive event (pods may need rescheduling, device enumeration changes, and some frameworks cache device topology).

ECC (error-correcting code memory) improves reliability by detecting/correcting memory errors, but it may reduce usable memory and slightly impact performance. For inference, ECC is usually kept enabled in data center environments. From an operator standpoint, ECC influences how you interpret “out of memory” incidents and capacity. Always capture ECC state in your node inventory and drift checks.

  • Clocks and application clocks: GPUs may downclock under thermal or power constraints, producing latency spikes. If your SLO is tail latency, you care about clock stability as much as average throughput.
  • Power limits: Datacenter GPUs can be capped (e.g., via nvidia-smi -pl). A power cap can look like “mysterious” performance regression after a maintenance window.
  • Thermals and throttling: Poor airflow or fan curves can trigger throttling. Treat these as infrastructure incidents, not “model issues.”

Baseline performance is your early-warning system. Before Kubernetes is involved, run a simple health and telemetry pass on the host: nvidia-smi for inventory and ECC, and a lightweight compute test to establish a reference (even a small CUDA sample or a known inference benchmark). Record GPU name, driver version, power limit, MIG mode, and average utilization under a controlled test. This becomes the “known-good” signature you compare against when a node misbehaves later.

Section 2.2: Driver strategy—host drivers vs images, upgrade paths, compatibility

Drivers are the most common source of GPU node drift because they sit at the boundary between the kernel and your containerized workloads. You have two strategies: install NVIDIA drivers on the host (typical for Kubernetes) or attempt to bundle everything in images. In practice, Kubernetes GPU nodes almost always rely on host drivers, because kernel modules and device management are host responsibilities.

What matters operationally is compatibility: the host driver must support the CUDA runtime expectations of your container images. CUDA in the image does not need to match the driver exactly, but it must be within the supported compatibility range. If you treat inference images as “just another container,” you’ll eventually hit failures like CUDA driver version is insufficient for CUDA runtime version or subtle performance issues when libraries fall back to less efficient code paths.

Adopt an upgrade path that is boring and repeatable:

  • Pin a driver major/minor for a node pool and avoid ad-hoc upgrades on individual nodes. This keeps scheduling behavior and performance consistent.
  • Stage upgrades: canary one node (or a small pool), validate with your smoke tests, then roll forward.
  • Separate node pools by GPU class (and often by driver branch) when you have mixed hardware generations.

A common mistake is upgrading the host OS kernel (or enabling unattended upgrades) without validating the NVIDIA driver kernel module rebuild path. The result is a node that boots but loses GPU functionality. Your node standard should explicitly state: supported OS/kernel versions, driver version, and whether Secure Boot is enabled (Secure Boot can block unsigned kernel modules unless handled properly).

Finally, write down how you will detect drift. “It worked yesterday” is not evidence today. Add a simple host-level check (driver loaded, GPU visible, expected ECC/MIG/power state) and treat mismatches as noncompliance. This is the sysadmin mindset translated directly into GPU operator practice.

Section 2.3: Container runtime setup—containerd, nvidia-container-toolkit

Kubernetes doesn’t talk to GPUs directly. It schedules pods, then the container runtime (commonly containerd) and NVIDIA tooling make GPU devices available inside containers. If the runtime path is wrong, your pod may start but won’t see /dev/nvidia*, or it will fail on initialization when CUDA can’t find a device.

On modern clusters, the standard approach is containerd + NVIDIA Container Toolkit. The toolkit configures an NVIDIA-aware runtime so containers can request GPU access without privileged hacks. Operationally, your goal is to make GPU access explicit and least-privilege: only pods that request GPU resources should receive GPU devices and libraries.

Practical setup principles:

  • Configure the NVIDIA runtime correctly for containerd (often via a runtime class or runtime configuration) rather than relying on legacy Docker flags.
  • Avoid “works on this node” configurations: ensure the same runtime config is applied across the node pool via automation (cloud-init, Ansible, image baking, or a managed node image pipeline).
  • Know what lives where: the kernel driver and device files are on the host; CUDA libraries may come from the container image; the toolkit handles injection/mounting and environment setup.

Common mistakes include forgetting to restart containerd after changing runtime configuration, mixing incompatible toolkit and driver versions, or assuming that installing CUDA toolkit packages on the host is required (usually it is not for Kubernetes inference; you primarily need the driver).

Enable container GPU access, then validate it outside Kubernetes first if you can: run a simple container that invokes nvidia-smi and a minimal CUDA call. If the container cannot see the GPU on a node, Kubernetes troubleshooting will be noisy and misleading. Getting this layer right narrows your failure domain and makes subsequent device plugin verification straightforward.

Section 2.4: NVIDIA device plugin—installation, verification, common failures

The NVIDIA device plugin is what turns “a GPU exists on this node” into a schedulable Kubernetes resource. Without it, the scheduler can’t see GPUs, and nvidia.com/gpu (or MIG resources) won’t appear in node capacity. Installation is typically done as a DaemonSet in the kube-system namespace or a dedicated GPU-operators namespace, depending on your stack.

Operator workflow for installation and verification:

  • Install the plugin using the vendor manifests or Helm chart that matches your Kubernetes version and runtime setup.
  • Confirm the DaemonSet is running on GPU nodes (and only where intended). If it schedules onto CPU-only nodes, you’ll waste resources and create confusing logs.
  • Verify node resources: check that nodes report Capacity and Allocatable for GPU resources. This is the key “Kubernetes sees it” checkpoint.

Then validate with a test workload that requests a GPU. A pod that requests nvidia.com/gpu: 1 should transition from Pending to Running on a GPU node, and inside the container, nvidia-smi should report the device. This confirms the end-to-end chain: scheduler → device plugin advertisement → runtime device injection.

Common failures map to specific layers:

  • Pods stuck Pending: often means the device plugin didn’t advertise resources (plugin not running, wrong node selectors/tolerations, or the node is tainted and the DaemonSet can’t land).
  • Container starts but no GPU visible: runtime/toolkit misconfiguration, or the container image lacks the expected utilities (you might not have nvidia-smi installed in the image even though GPU access works).
  • Plugin logs show NVML errors: typically a driver problem—module not loaded, incompatible driver, or permissions issues accessing device files.

As an operator, treat the device plugin as critical infrastructure. Pin its version, monitor its DaemonSet health, and include its logs in your standard incident triage. If the plugin is unstable, everything above it will appear unstable too.

Section 2.5: Node labeling and inventory—feature discovery, GPU classes

Once GPUs are schedulable, the next step is making scheduling predictable. This is where sysadmin inventory habits become cluster policy. You need a consistent way to answer: “Which nodes have which GPUs, which driver branch, which MIG profile, and which performance class?” Kubernetes labels and taints are the control plane vocabulary for that answer.

Start with a labeling scheme that is stable and meaningful. Examples include GPU vendor/model, memory size, MIG enabled/disabled, and a high-level GPU class label (e.g., gpu-class=inference-small, gpu-class=inference-large) that abstracts away exact SKUs. The abstraction is useful when procurement changes hardware but you want workloads to keep targeting “equivalent” capacity.

Node Feature Discovery (NFD) is a common way to populate hardware labels automatically. It reduces manual errors and helps with day-2 drift detection: if a node loses its GPU driver, labels/resources may change and the node can be quarantined. Consider pairing NFD with a policy that prevents inference workloads from landing on nodes that fail GPU feature checks.

  • Use taints for exclusivity: taint GPU nodes (e.g., gpu=true:NoSchedule) so only workloads with explicit tolerations can land there.
  • Use affinity for intent: prefer certain GPU classes for latency-sensitive inference, and reserve others for batch or testing.
  • Label for operations: labels like driver-branch=535 or mig-profile=1g.10gb help you roll upgrades and debug issues quickly.

Document these as part of your node standard: which labels must exist, which taints are applied, and what “compliant” looks like. This is not bureaucracy—it prevents surprise scheduling outcomes and makes capacity planning accurate. When an incident occurs, labels also let you scope blast radius: “Only nodes in gpu-class=X with driver-branch=Y are affected.”

Section 2.6: Validation playbook—smoke tests, burn-in, and acceptance criteria

GPU readiness is only real if you can prove it repeatedly. Create a validation playbook that you run during provisioning, after upgrades, and when investigating performance anomalies. The playbook should cover smoke tests (fast, automated), burn-in (longer confidence checks), and explicit acceptance criteria (what “ready” means).

Smoke tests are your first gate:

  • Host checks: nvidia-smi returns expected GPU inventory; driver version matches the node pool standard; ECC/MIG/power limits match policy.
  • Kubernetes checks: device plugin pods are Running on GPU nodes; nodes report GPU resources in allocatable; a GPU-requesting test pod schedules successfully.
  • Container checks: inside the test pod, confirm GPU visibility and run a minimal CUDA/inference call that exercises compute (not just device enumeration).

Burn-in catches what smoke tests miss: intermittent PCIe errors, thermal throttling under sustained load, or power limit misconfigurations. Run a controlled stress or repeated inference loop for a set duration (e.g., 30–60 minutes) and watch for XID errors, clock drops, or increasing error counts. Keep a simple baseline: expected throughput range and stable temperature/clock behavior. If you can’t hold steady under burn-in, you won’t hold an inference latency SLO in production.

Acceptance criteria should be unambiguous and auditable. For example: “Node reports nvidia.com/gpu capacity, device plugin healthy, test workload completes, no XID errors during burn-in, and performance within ±10% of baseline.” Define what happens on failure: cordon the node, label it noncompliant, and route it to remediation rather than letting workloads randomly fail.

Finally, document node standards and drift checks as day-2 operational guardrails. Store expected driver/toolkit/plugin versions, required labels/taints, and the commands/manifests for your tests. This becomes the GPU equivalent of a golden image checklist—something you can hand to another operator and get the same outcome every time.

Chapter milestones
  • Verify GPU hardware/driver health and baseline performance
  • Enable container GPU access with the correct runtime configuration
  • Install and validate the NVIDIA device plugin on Kubernetes
  • Confirm GPU scheduling works end-to-end with test workloads
  • Document node standards and drift checks for day-2 operations
Chapter quiz

1. A Kubernetes node has a GPU installed, but inference pods can’t use it. Which chain of requirements best describes what must be true for the GPU to be usable by workloads?

Show answer
Correct answer: GPU health is verified, the driver matches workload CUDA expectations, the container runtime passes through GPU devices, and Kubernetes advertises GPU resources for scheduling
The chapter emphasizes GPU readiness as a chain: hardware health, compatible drivers, correct runtime pass-through, and Kubernetes resource advertisement for scheduling.

2. If GPU readiness is a chain, what is the most practical operator mindset when troubleshooting symptoms like Pending pods or runtime failures?

Show answer
Correct answer: Isolate which link is broken by following a repeatable validation playbook across hardware, drivers, runtime, and Kubernetes advertisement
The chapter’s goal is quick isolation of the failure domain using a repeatable validation playbook, since different breaks can look similar.

3. Why does the chapter stress verifying GPU health and baseline performance before focusing on Kubernetes components like the device plugin?

Show answer
Correct answer: Because a healthy, predictable baseline helps distinguish hardware/driver issues from later runtime or scheduling misconfigurations
A known-good baseline is needed to ensure later issues aren’t caused by underlying GPU health or driver problems.

4. What role does the NVIDIA device plugin play in making GPUs schedulable for pods?

Show answer
Correct answer: It ensures Kubernetes can advertise GPU resources so the scheduler can place pods that request them
The chapter highlights that Kubernetes must advertise GPU resources for scheduling, which is addressed by installing and validating the NVIDIA device plugin.

5. Which statement best captures the chapter’s day-2 operations goal for GPU nodes?

Show answer
Correct answer: Document node standards and drift checks so GPU readiness stays stable through upgrades
The chapter aims for “it stays working after upgrades,” achieved by documenting node standards and drift checks.

Chapter 3: Deploy an Inference Service (From Image to Endpoint)

This chapter turns a container image and a model artifact into a reachable, production-leaning endpoint on Kubernetes. As a sysadmin transitioning into GPU cluster operations, you already understand repeatability, change control, and failure domains. Inference deployment is the same story—just with stricter latency constraints and a new resource type (GPUs) that amplifies scheduling mistakes. Your goal is not only “it runs,” but “it runs predictably,” with health checks, secure configuration, controlled traffic exposure, and release mechanics that let you roll forward and roll back safely.

We’ll walk a practical workflow: pick a serving runtime, decide how the model is packaged and loaded, assemble the core Kubernetes objects (Deployment/Service/probes and a few guardrails), expose the service through an ingress layer with TLS and proper timeouts, wire configuration and secrets correctly, and then deploy with progressive delivery patterns (canary/blue-green) that fit inference risks. Along the way, watch for common mistakes: building a custom server when an off-the-shelf runtime would do; shipping weights inside the image with no update path; forgetting startup probes so pods get killed during model load; setting ingress timeouts too low for long generation; and leaking API keys via environment dumps or logs.

By the end, you should be able to take a GPU-capable cluster and deliver a stable endpoint that can be tested, monitored, and updated without drama—exactly the kind of operational confidence that differentiates a cluster operator from a “kubectl deployer.”

Practice note for Containerize or select a serving runtime image and model artifact strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deploy a GPU-backed inference Deployment with health checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Expose the service safely through an Ingress/API gateway path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Manage configuration and secrets for model endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Add canary and rollback mechanics for safer releases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Containerize or select a serving runtime image and model artifact strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Deploy a GPU-backed inference Deployment with health checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Expose the service safely through an Ingress/API gateway path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Manage configuration and secrets for model endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Serving runtime options—Triton, vLLM, TGI, custom FastAPI

Section 3.1: Serving runtime options—Triton, vLLM, TGI, custom FastAPI

Your first decision is the serving runtime image. This is the “PID 1” inside the pod: it owns model loading, batching, GPU memory management, request handling, and metrics exposure. Choosing well saves weeks of custom work and prevents performance traps.

NVIDIA Triton is a general-purpose inference server for multiple frameworks (TensorRT, ONNX Runtime, PyTorch, etc.). It shines when you need mature features: dynamic batching, model repository management, multi-model serving, and strong observability. Triton is often the best default for classic inference (vision, speech, tabular) and for teams that want consistent operations across models.

vLLM focuses on large language model serving with high throughput using paged attention. It’s a strong choice when you care about token throughput and want an OpenAI-compatible API option. It typically expects GPU nodes with enough memory and benefits from careful concurrency settings. vLLM is less about “many frameworks” and more about “LLMs done efficiently.”

TGI (Text Generation Inference) is another LLM-serving runtime with production features (token streaming, batching, quantization support depending on setup). It’s a common choice when you want an opinionated, ready-to-run LLM server with decent defaults and predictable behavior.

Custom FastAPI (or similar) is appropriate when your inference logic is truly bespoke: custom pre/post-processing, multi-step pipelines, or nonstandard request/response formats. The tradeoff is that you become responsible for performance engineering (batching, threading, GPU contention), health endpoints, metrics, and safe reload behavior. A common mistake is defaulting to custom FastAPI for “control,” then rebuilding features Triton/vLLM/TGI already provide.

Operational judgement: start with an off-the-shelf runtime unless you can name the missing feature and the cost of implementing it. Also, validate GPU support early: ensure your runtime image matches your CUDA and driver expectations, and that the container runtime on the node can expose nvidia.com/gpu resources to pods.

Section 3.2: Model packaging—weights, config, init containers, persistent volumes

Section 3.2: Model packaging—weights, config, init containers, persistent volumes

Next, decide how model artifacts (weights, tokenizer files, config) reach the pod. This choice impacts build times, rollout speed, cache behavior, and incident response. Treat model artifacts like large, frequently updated binaries: they need a controlled distribution strategy.

Option A: Bake weights into the image. This is simple—one image tag implies code + model. But the image becomes huge, registry pulls are slow, and rolling back means rolling back the entire image even if only weights changed. It’s acceptable for small models or early prototypes, but it fights the operational need for fast redeploys.

Option B: Mount weights from a Persistent Volume (PV). You build a smaller runtime image and mount a PVC at runtime (for example, /models). This supports faster deployments and can reuse cached weights across pod restarts on the same node (depending on storage). The risk is storage performance: slow network volumes can add seconds to minutes of startup time, and you must manage consistency (what happens if the model updates while pods are running?).

Option C: Init container downloads artifacts. A common production pattern is: an init container pulls a specific model version from object storage into an emptyDir volume shared with the main container. This makes the model version explicit and repeatable (pin by checksum or immutable path) and avoids shipping weights inside the main image. It also lets you add validation (hash check) before the server starts. The tradeoff is startup latency; mitigate it with node-local caching where possible.

  • Keep model versions immutable: reference model:v123 or a content hash path, not “latest.”
  • Separate config from weights: config belongs in Kubernetes objects (ConfigMaps) or the model repo layout, not hardcoded in the image.
  • Plan for cold starts: large models need time to download and load into GPU memory; your health strategy must reflect that.

Common mistake: treating model files like ordinary config and placing them in a ConfigMap. ConfigMaps are not for multi-GB artifacts and will fail or behave poorly. Instead, use PVs, init downloads, or a proper model repository mechanism (Triton model repo, HF cache volumes, etc.).

Section 3.3: Kubernetes primitives—Deployments, Services, probes, PodDisruptionBudgets

Section 3.3: Kubernetes primitives—Deployments, Services, probes, PodDisruptionBudgets

With a runtime and packaging plan, you assemble the core Kubernetes objects. At minimum: a Deployment (or StatefulSet if you have strong identity/storage requirements) and a Service to provide stable discovery. For GPU workloads, the Deployment spec must be GPU-aware: request a GPU via resources.limits: nvidia.com/gpu: 1 (and often requests equal limits). Pair that with appropriate CPU/memory requests so the scheduler can place the pod correctly; starving CPU can increase latency because tokenization and networking still run on CPU.

Health checks are where inference differs from typical web apps. Model load can be slow, and GPUs can OOM during warmup. Use three probes intentionally:

  • Startup probe to allow long initialization (download + load + warmup) without the kubelet killing the pod prematurely.
  • Readiness probe that flips “ready” only when the server can accept traffic (model loaded, GPU allocated, critical dependencies available).
  • Liveness probe to detect deadlocks or crashes, but tuned so transient load spikes don’t cause restarts.

Engineering judgement: avoid “ping returns 200” readiness checks that don’t validate model availability. Many runtimes expose dedicated endpoints (for example, Triton’s readiness endpoints). If using a custom server, implement a readiness check that verifies the model is loaded and a lightweight inference path is functional.

Add a PodDisruptionBudget (PDB) to prevent voluntary disruptions (node drains, upgrades) from taking down all replicas at once. For example, with two replicas you might set minAvailable: 1. Without a PDB, routine maintenance can become an outage. Pair this with topology spread or anti-affinity if you have multiple GPU nodes, so replicas don’t land on the same node and fail together.

Common mistake: running a single replica “for cost savings” and then being surprised by downtime during node maintenance or runtime crashes. Inference endpoints are often customer-facing; budget for at least two replicas when availability matters.

Section 3.4: Traffic ingress—Ingress controller, TLS, timeouts, request limits

Section 3.4: Traffic ingress—Ingress controller, TLS, timeouts, request limits

Once the Service works inside the cluster, you need a controlled way to accept external traffic. Most teams use an Ingress controller (NGINX Ingress, Traefik, HAProxy) or a gateway (Kong, Envoy Gateway, API Gateway products). The key is to treat ingress as part of the inference system: it must understand long-lived requests, streaming, and bursty traffic.

TLS should be non-negotiable. Terminate TLS at the ingress/gateway, automate certificates (for example, cert-manager), and prefer modern TLS settings. If your org requires mTLS internally, handle that between gateway and service mesh or directly between gateway and backend. Avoid exposing your inference Service as a plain LoadBalancer with no authentication or TLS—it’s an easy way to leak a costly GPU endpoint to the internet.

Inference requests can be slow relative to typical web APIs, especially for LLM generation. Configure timeouts deliberately:

  • Ingress proxy/read timeouts long enough for worst-case generation or batch inference.
  • Body size limits if clients can send large prompts or images; enforce sane maxima.
  • Connection and rate limits to protect GPUs from sudden floods that cause queue buildup and timeouts.

If you support streaming responses, ensure the ingress supports it and doesn’t buffer responses unexpectedly. Misconfigured buffering can break token streaming and inflate latency. Also consider path-based routing, such as /v1/models/my-model, to provide a stable contract while allowing internal service changes.

Operational outcome: by placing policy (TLS, auth, timeouts, request limits) at the edge, your backend pods can focus on inference. When incidents happen, you can throttle or shed load at the gateway instead of crashing GPU pods in a feedback loop.

Section 3.5: Config and secrets—ConfigMaps, Secrets, external secret stores

Section 3.5: Config and secrets—ConfigMaps, Secrets, external secret stores

Inference deployments often fail operational reviews not because of accuracy, but because of poor configuration hygiene. Treat configuration as an interface: easy to change, validated, and separated from secrets. Kubernetes gives you primitives, but you must use them with discipline.

Use ConfigMaps for non-sensitive settings: model name/version selectors, runtime flags (batch size, max tokens), logging level, and feature toggles. Mount them as files when the runtime expects config files, or inject as environment variables when that’s simpler. Prefer explicit configuration over “magic defaults,” because performance tuning (concurrency, batching) becomes iterative.

Use Secrets for sensitive data: API keys for upstream services, private model repository credentials, TLS client keys, database passwords, and signing keys. Avoid putting secrets in container images, command-line args (visible in process lists), or ConfigMaps. Be cautious with environment variables too—debug endpoints and crash dumps can leak them. Mounting secrets as files with least privilege is often safer.

For higher maturity, integrate an external secret store (Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) via External Secrets Operator or CSI Secret Store. This improves rotation and auditability. The operational judgement is to keep the Kubernetes Secret as a short-lived projection of a source-of-truth secret, not a long-lived artifact manually edited in-cluster.

  • Validate config at startup: fail fast if required values are missing or invalid, rather than serving partial functionality.
  • Limit RBAC: the service account running inference pods should not have broad read access to secrets across namespaces.
  • Don’t log secrets: sanitize config dumps and request logs, especially for prompt content and auth headers.

Common mistake: mixing “model endpoint configuration” with “cluster operational configuration.” Keep application config in the app namespace, and keep cluster-wide ingress/controller config owned by platform operators. This separation enables safer delegation and clearer incident ownership.

Section 3.6: Release patterns—blue/green, canary, progressive delivery basics

Section 3.6: Release patterns—blue/green, canary, progressive delivery basics

Inference releases are risky because changes can affect not just availability but output quality, latency, and cost. You need mechanics that support fast rollback and controlled exposure. Kubernetes Deployments provide rolling updates, but rolling updates alone are not always enough when “bad responses” are worse than “no response.”

Blue/green is conceptually simple: run two versions side by side (blue = current, green = new). You validate green with internal traffic and then switch the gateway/Service selector to green. Rollback is a switch back to blue. This works well when you can afford duplicate GPU capacity during the cutover window and want very clear operational control.

Canary releases shift a small percentage of traffic to the new version first. You watch metrics—error rate, p95 latency, GPU memory usage, and domain-specific signals (response validity checks, moderation rates, or offline eval proxies). If healthy, increase traffic gradually. If not, reduce to zero and investigate. Canary is more cost-efficient than blue/green but requires better traffic shaping (via gateway, service mesh, or an ingress that supports weighted routing).

Progressive delivery basics means coupling deployment steps to signals. Tools like Argo Rollouts or Flagger can automate canaries based on metrics, but even manual progressive delivery follows the same discipline: define what “good” looks like, measure it quickly, and have a rehearsed rollback path. For inference, include performance SLOs in the go/no-go gate; a model that is “correct” but doubles latency can still be a production regression.

  • Keep old replicas warm during initial rollout to avoid cold-start spikes and to enable instant rollback.
  • Version your endpoints (path or header) when clients need stability across model changes.
  • Automate rollback triggers when feasible: sustained 5xx, readiness flaps, or p95 latency beyond threshold.

Common mistake: releasing a new model by overwriting the existing tag (for example, :latest) and letting nodes pull “whatever.” Immutable tags and controlled rollout patterns are how operators make inference systems boring—in the best way.

Chapter milestones
  • Containerize or select a serving runtime image and model artifact strategy
  • Deploy a GPU-backed inference Deployment with health checks
  • Expose the service safely through an Ingress/API gateway path
  • Manage configuration and secrets for model endpoints
  • Add canary and rollback mechanics for safer releases
Chapter quiz

1. What is the primary operational goal when deploying a GPU inference service on Kubernetes in this chapter?

Show answer
Correct answer: Make it run predictably with health checks, secure config, controlled exposure, and safe release mechanics
The chapter emphasizes predictable operation—health checks, security, traffic control, and safe roll forward/rollback—not just “it runs.”

2. Which choice best reflects the chapter’s guidance on selecting a serving approach?

Show answer
Correct answer: Prefer an off-the-shelf serving runtime when it meets needs, instead of building a custom server by default
A common mistake called out is building a custom server when an existing runtime would suffice.

3. Why does the chapter highlight using startup probes in addition to other health checks for inference pods?

Show answer
Correct answer: To prevent pods from being killed while the model is still loading during startup
Without startup probes, pods can fail health checks during model load and get restarted unnecessarily.

4. When exposing an inference service through an ingress layer, what configuration concern is specifically called out for long-running generation?

Show answer
Correct answer: Ingress timeouts that are too low can break long generation requests
The chapter warns about ingress timeouts being set too low for longer inference/generation responses.

5. Which deployment approach best matches the chapter’s recommended strategy for safer inference releases?

Show answer
Correct answer: Use progressive delivery patterns (e.g., canary/blue-green) to enable controlled rollouts and rollbacks
Progressive delivery reduces risk by controlling traffic while allowing safe forward and backward moves.

Chapter 4: GPU Scheduling & Multi-Tenancy (Fair, Reliable, Predictable)

As a sysadmin moving into GPU cluster operations, you’re no longer just keeping nodes “up.” You’re shaping how expensive, scarce accelerators are allocated so teams can ship reliable inference without stepping on each other. The difference between a cluster that feels predictable and one that feels chaotic is usually not the model code—it’s resource modeling, placement controls, and guardrails that prevent noisy-neighbor behavior.

GPU scheduling in Kubernetes is deceptively simple on the surface: request nvidia.com/gpu and the scheduler places your pod on a node with that many GPUs available. In practice, inference workloads couple GPU, CPU, memory, storage, and network. If you get those couplings wrong, you’ll see “mysterious” latency spikes, intermittent OOMs, or pods stuck in Pending even though “there are GPUs free.” This chapter gives you the operator’s toolkit: how to model resources, steer placement, enforce fairness, choose safe sharing strategies, and build runbooks that shorten incidents.

Your goal is the same one you’ve always had in operations: fair sharing under load, reliable execution under failure, and predictable performance under change. The GPU simply raises the stakes.

Practice note for Implement GPU-aware resource requests/limits and quality-of-service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Control placement with taints, tolerations, affinity, and topology constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Introduce quotas and priority to prevent noisy neighbor incidents: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable safe sharing strategies (MIG, time-slicing, or isolation): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for stuck scheduling and GPU resource leaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement GPU-aware resource requests/limits and quality-of-service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Control placement with taints, tolerations, affinity, and topology constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Introduce quotas and priority to prevent noisy neighbor incidents: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Enable safe sharing strategies (MIG, time-slicing, or isolation): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create runbooks for stuck scheduling and GPU resource leaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Resource modeling—GPU as a schedulable resource; CPU/memory coupling

Kubernetes treats GPUs as an extended resource, typically exposed by the NVIDIA device plugin as nvidia.com/gpu. Unlike CPU and memory, GPUs are not overcommitted by default: requesting 1 GPU generally means exclusive assignment of a whole device to that container (unless you intentionally enable a sharing mechanism covered later). That exclusivity is good for predictability, but it can trick operators into under-modeling the rest of the pod.

Inference services usually need non-trivial CPU for request parsing, tokenization, post-processing, TLS, and telemetry. If you request a GPU but starve CPU, the pod will schedule fine and then fail your SLOs because the GPU sits idle waiting for CPU-side work. Similarly, memory requests matter even when “the model is on the GPU.” CPU RAM is used for queues, KV caches (depending on architecture), runtime buffers, and batching. Treat GPU requests as the anchor and size CPU/memory around it.

For Quality of Service (QoS), the practical rule is: set requests to what you need to hit steady-state SLOs, and set limits to cap worst-case behavior. For CPU, allowing some burst (limit > request) can be useful, but for memory you typically want request=limit to avoid node-level eviction surprises. GPU is usually request=limit (since it’s an integer device count). A common mistake is leaving CPU requests at tiny defaults (or none), which can place many GPU pods on the same node and create a CPU bottleneck that looks like “GPU latency.”

  • Model a single replica first. Measure p95 latency and throughput while varying CPU requests (e.g., 1, 2, 4 cores) with a fixed GPU request. Find where GPU utilization increases and tail latency decreases.
  • Account for concurrency. If you run multiple workers per pod, CPU and memory scale with concurrency even if GPU count stays constant.
  • Budget for sidecars. Service meshes, log shippers, and metrics exporters consume CPU/memory and can become the hidden limiter on a “GPU” node.

Operator outcome: you can justify resource requests with measurements and prevent QoS-related flakiness. You also gain a clear language for teams: “A 1-GPU inference pod is also a 4-core, 16GiB pod,” which makes capacity planning realistic.

Section 4.2: Node pools and placement—labels, taints, tolerations, affinity

GPU clusters become manageable when you stop thinking of “nodes” and start thinking of pools: groups of machines with the same GPU type, driver stack, and performance profile. A pool might be “L4 inference,” “A100 high-throughput,” or “T4 dev/test.” Node pools let you control cost, blast radius, and upgrade cadence (for example, rolling the CUDA driver only on one pool at a time).

Placement control in Kubernetes is a layered toolkit:

  • Labels describe nodes (gpu.nvidia.com/class=L4, node.kubernetes.io/instance-type=...).
  • Taints repel pods unless they explicitly tolerate them (e.g., nvidia.com/gpu=true:NoSchedule on GPU nodes).
  • Tolerations opt a pod into a tainted pool (your “I know what I’m doing” switch).
  • Affinity/anti-affinity expresses preferences or hard requirements for co-location or separation (e.g., keep replicas on different nodes).

A strong default is: taint all GPU nodes and require inference pods to tolerate the taint. This prevents accidental scheduling of non-GPU workloads onto expensive nodes and reduces “surprise” GPU node pressure from background jobs. Then, use labels plus nodeSelector or node affinity to select the right GPU class. Keep the selection logic simple and explicit; complex affinity expressions are hard to debug during incidents.

Common mistake: using only labels without taints. Labels alone don’t stop a generic CPU workload from landing on a GPU node if it fits CPU/memory. That seems harmless until CPU-heavy system jobs crowd out inference CPU, creating latency spikes while GPUs remain allocated but underfed.

Practical outcome: you can safely run mixed clusters where CPU-only tenants don’t “steal” GPU nodes, and GPU tenants land on the correct hardware class. This also sets you up for predictable upgrades: drain and rotate one labeled pool at a time without impacting unrelated tenants.

Section 4.3: Topology and performance—NUMA, PCIe locality, multi-GPU considerations

When inference gets serious, the scheduler decision “has a GPU available” is not enough. Topology matters: CPU sockets, NUMA nodes, PCIe lanes, and GPU interconnect (NVLink) can change throughput and tail latency. As an operator, you don’t need to memorize hardware diagrams, but you do need the instinct to ask: “Is the pod’s CPU close to its GPU, and are multi-GPU jobs using the right GPUs?”

NUMA effects show up when a pod’s CPU threads and memory allocations land on a different NUMA node than the GPU’s PCIe root. The symptom is counterintuitive: GPU utilization may appear fine, but end-to-end latency rises due to slower host-to-device transfers and memory access. For multi-GPU inference (tensor parallelism, large models, or high batch throughput), PCIe locality and NVLink topology affect collective communication and can dominate performance.

  • Prefer “one pod per GPU” for latency-sensitive services unless you have a deliberate sharing plan. This makes locality easier and reduces cross-talk.
  • Use pod anti-affinity for replicas so a single node failure doesn’t drop all replicas and to avoid local contention (CPU, NIC, disk) on one machine.
  • For multi-GPU pods, request all required GPUs in one pod rather than spreading across pods; Kubernetes allocates devices to a container, but the interconnect between those devices is hardware-specific.

In practice, topology tuning is iterative: run a load test, observe p95/p99 latency, then correlate with node-level metrics (CPU steal, memory bandwidth, PCIe errors, NIC saturation). A common operator misstep is focusing only on GPU metrics (utilization, memory) and ignoring host-side bottlenecks. If your gateway, tokenizer, or batching thread is constrained, the GPU can be “busy” while the service still misses SLOs.

Practical outcome: you learn to treat inference as a full-system workload, not just a device allocation problem. This skill pays off when teams upgrade models and suddenly demand multi-GPU placement or tighter tail latency.

Section 4.4: Multi-tenancy controls—namespaces, quotas, priority classes, limit ranges

Multi-tenancy is where GPU clusters fail most often: one team’s experiment can starve another team’s production inference. Kubernetes gives you strong controls, but only if you actually use them. Start by treating namespaces as tenancy boundaries for policy and accounting: per-team namespaces with scoped RBAC, network policies, and secrets management.

Then add fairness controls that are meaningful for GPUs:

  • ResourceQuota: cap total nvidia.com/gpu, CPU, and memory per namespace. This prevents runaway scaling or “just one more replica” incidents that drain the pool.
  • LimitRange: enforce minimum/maximum requests and defaults. This stops under-requesting CPU (leading to noisy neighbor CPU contention) and discourages over-requesting memory (leading to fragmentation and poor bin packing).
  • PriorityClass: express business importance. Production inference should preempt best-effort batch jobs when the cluster is full.

The engineering judgment is in choosing quotas and priorities that reflect reality. If you set quotas too low, teams will route around policy (multiple namespaces, shadow clusters). If you set them too high, you have no guardrail. A practical approach is to allocate a baseline quota per team plus a “burst” pool managed by a separate namespace or cluster autoscaler policy, with explicit approval for temporary increases.

Priority and preemption are powerful but dangerous. Preemption can evict lower-priority pods to schedule higher-priority ones, which is great for protecting production SLOs. But if you allow preemption without disruption budgets or without clear communication, you’ll create confusing outages for lower-tier workloads. Define expectations: which jobs are preemptible, how they should checkpoint, and what “best effort” means in your org.

Practical outcome: you can prevent noisy neighbor incidents before they happen, explain capacity tradeoffs in policy terms, and maintain a cluster where teams trust scheduling outcomes instead of fighting them.

Section 4.5: GPU sharing approaches—MIG, time-slicing, and when not to share

GPU sharing is tempting because it improves utilization, but it can destroy predictability if applied blindly. You have three broad strategies: hard partitioning, soft sharing, or isolation (no sharing). The correct choice depends on workload shape (latency vs throughput), tenant trust, and the blast radius you can tolerate.

MIG (Multi-Instance GPU) is hard partitioning available on certain NVIDIA GPUs (notably A100/H100 class). It splits one physical GPU into hardware-isolated slices with dedicated memory and compute resources. MIG is often the best option for multi-tenancy because it provides stronger isolation than simple sharing, and scheduling becomes “request a MIG slice” rather than “share a GPU.” Operationally, MIG adds complexity: you must configure MIG profiles on nodes and align Kubernetes resource names with what the device plugin advertises.

Time-slicing (or software-based sharing) allows multiple pods to share a GPU by time-multiplexing. It can raise aggregate throughput for bursty or dev workloads but can increase tail latency due to contention and context switching. It also increases the chance that one tenant’s behavior impacts another’s performance. Use time-slicing for non-SLO workloads, internal experimentation, or batch inference where latency variance is acceptable.

  • When not to share: strict p99 latency SLOs, untrusted tenants, large models that nearly fill GPU memory, or workloads with unpredictable spikes.
  • When sharing works: small models, stable traffic, clear per-tenant limits, and good observability of queueing and latency.

A common mistake is enabling sharing to “fix” capacity issues without adding tenant controls (quotas, priorities) or without updating runbooks. Sharing changes failure modes: a single bad deployment can now degrade multiple services on the same GPU rather than just consuming “its” device.

Practical outcome: you can choose a sharing method that matches the organization’s risk tolerance, communicate the tradeoffs to stakeholders, and avoid the trap of chasing utilization at the expense of reliability.

Section 4.6: Troubleshooting scheduling—pending pods, device plugin issues, fragmentation

GPU scheduling incidents often look like “pods stuck in Pending” or “pod scheduled but no GPU available at runtime.” Your runbook should start with fast triage and then branch by symptom. The goal is to distinguish: (1) genuine lack of capacity, (2) placement constraints you created, (3) node-level GPU stack failures, and (4) bin-packing fragmentation.

  • Pending pods: run kubectl describe pod and read the scheduler events. Look for messages like “0/10 nodes available: insufficient nvidia.com/gpu” (capacity), “node(s) had taint … that the pod didn’t tolerate” (policy), or “didn’t match node affinity” (placement).
  • Device plugin issues: confirm the device plugin DaemonSet is running on GPU nodes, and that nodes advertise allocatable GPUs. If allocatable is zero, check driver health, runtime configuration, and the plugin logs. A node with a broken driver often stays Ready but advertises no GPU resources.
  • Runtime failures: if the pod schedules but the container can’t see the GPU, validate the container runtime configuration (NVIDIA runtime), and confirm the expected device files and libraries are present in the container environment.

Fragmentation is a frequent “we have GPUs, why can’t we schedule?” problem. Example: a cluster has many nodes with 1 free GPU each, but a workload requests 2 GPUs in a single pod; it will remain pending. Similarly, a pod might need GPU plus high CPU/memory; a node may have a free GPU but not enough CPU requested, so the scheduler can’t place it. This is why resource modeling (Section 4.1) and pool design (Section 4.2) matter: they reduce odd-shaped requests that don’t fit.

Add a second runbook track for GPU resource leaks: cases where a pod terminates but the node appears to have GPUs “in use.” Often this is an accounting issue (stale pod objects, kubelet delays) or a hung process holding device files. Your steps: verify Kubernetes thinks the GPU is allocated (node allocatable vs allocated), check for orphaned processes on the node, and consider draining the node if it can’t recover cleanly. Document when a reboot is acceptable and how to do it safely for the pool.

Practical outcome: you can resolve scheduling stalls quickly, avoid guesswork, and translate symptoms into concrete fixes—whether that’s adjusting tolerations, repairing the GPU stack, or redesigning requests to reduce fragmentation.

Chapter milestones
  • Implement GPU-aware resource requests/limits and quality-of-service
  • Control placement with taints, tolerations, affinity, and topology constraints
  • Introduce quotas and priority to prevent noisy neighbor incidents
  • Enable safe sharing strategies (MIG, time-slicing, or isolation)
  • Create runbooks for stuck scheduling and GPU resource leaks
Chapter quiz

1. A team reports pods stuck in Pending even though "there are GPUs free." Based on the chapter, what is a likely cause to check first?

Show answer
Correct answer: A coupling mismatch (CPU/memory/storage/network) prevents placement despite available GPUs
The chapter emphasizes that inference workloads couple GPU with CPU/memory/storage/network; getting those wrong can leave pods Pending even when GPUs appear available.

2. Which set of controls is primarily used to steer *where* GPU workloads run for predictable placement?

Show answer
Correct answer: Taints/tolerations, affinity, and topology constraints
The chapter calls out taints, tolerations, affinity, and topology constraints as placement controls to make scheduling predictable.

3. What is the main purpose of introducing quotas and priority in a multi-tenant GPU cluster?

Show answer
Correct answer: To prevent noisy-neighbor incidents by enforcing fairness under load
Quotas and priority are described as guardrails that prevent one tenant from crowding out others (noisy-neighbor behavior).

4. Which option best describes *safe sharing strategies* for expensive GPU accelerators in this chapter?

Show answer
Correct answer: Using MIG, time-slicing, or isolation to share while controlling interference
The chapter explicitly lists MIG, time-slicing, and isolation as strategies to enable safe sharing.

5. Why does the chapter recommend creating runbooks for stuck scheduling and GPU resource leaks?

Show answer
Correct answer: To reduce incident time by providing repeatable operator steps for common GPU scheduling/resource issues
Runbooks are positioned as an operator tool to shorten incidents when pods won’t schedule or GPU resources leak.

Chapter 5: Performance Tuning for Inference (Latency, Throughput, Cost)

Inference is where your Kubernetes and sysadmin instincts matter most: you are turning scarce GPU time into a user-visible service with measurable latency and controllable cost. “Fast” is not a single number. A model can have excellent tokens/second but terrible p95 latency under bursty traffic; it can also be low-latency but expensive due to poor GPU utilization. This chapter gives you a repeatable workflow: benchmark like you mean it, tune the runtime, apply model-level optimizations, remove system bottlenecks, scale safely, and then translate results into a cost/performance scorecard that supports change decisions.

As a GPU cluster operator, your job is not to “make graphs look good.” Your job is to meet an SLO (for example: p95 end-to-end latency under 800 ms for short prompts; or p99 under 4 s for long responses) while keeping throughput high and cost predictable. The practical loop looks like: define a workload profile, measure under controlled conditions, change one variable at a time, and record both performance and cost impact. If you can’t reproduce a benchmark, you can’t trust improvements.

Throughout the sections, you’ll see the same pattern repeated: establish a baseline; separate cold-start behavior from steady-state; watch queueing; validate resource headroom; and always track p50/p95/p99, not just averages. Your output should be a small set of dashboards and a written tuning log that another operator can follow, not a one-off “hero” tuning session.

Practice note for Establish a repeatable benchmark methodology for inference services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune runtime parameters: batching, concurrency, and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Apply model-level optimizations: quantization basics and compilation awareness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Right-size resources and reduce bottlenecks (CPU, memory, network, storage): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a cost/perf scorecard to guide change decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish a repeatable benchmark methodology for inference services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune runtime parameters: batching, concurrency, and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Apply model-level optimizations: quantization basics and compilation awareness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Right-size resources and reduce bottlenecks (CPU, memory, network, storage): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Benchmark design—workloads, datasets, warmup, and statistical pitfalls

Section 5.1: Benchmark design—workloads, datasets, warmup, and statistical pitfalls

Performance tuning starts with a benchmark you can rerun after every change. For inference services, your “workload” is more than requests per second (RPS). It includes input sizes, output sizes (max tokens), request mix (chat vs embeddings vs rerank), and concurrency patterns (steady load, spiky bursts, diurnal ramps). Start by writing down two to three representative scenarios, such as: (1) short prompts, short outputs; (2) long prompts, short outputs; (3) long prompts, long outputs. Tie each scenario to an SLO and a business intent (interactive chat vs background summarization).

Choose or create a dataset that matches production shape. Avoid benchmarking with a single fixed prompt: it hides tokenizer variance and cache behavior. For LLMs, record distributions: prompt tokens, completion tokens, and sampling parameters. For vision or speech, record typical input resolution/length. Save the dataset in version control or object storage with a content hash so you can prove comparability across runs.

Warmup is mandatory. GPUs, model runtimes, and kernels have “first-run” costs (graph compilation, CUDA context init, page faults). If you include those in your steady-state latency, you’ll chase the wrong problem. Run a warmup phase until metrics stabilize, then start measuring. Separate cold-start SLOs (pod startup + model load) from steady-state SLOs (request latency). In Kubernetes terms, measure: time-to-Ready, time-to-first-token (TTFT), and time-to-last-token (TTLT).

  • Measure the right percentiles: report p50/p95/p99. Averages lie under queueing.
  • Use enough samples: small N makes p99 meaningless. Increase duration or request count.
  • Control variables: pin model version, runtime version, node type, and driver stack. Record them in the benchmark output.
  • Beware coordinated omission: if your load generator waits for responses before sending more, you may under-measure latency under saturation. Use an open-loop generator when testing capacity.

Finally, capture both service-side and client-side timing. Client-side includes network and gateway overhead; service-side isolates model execution and queue time. If you can’t separate “compute time” from “waiting in line,” you’ll misattribute bottlenecks and apply the wrong tuning knob.

Section 5.2: Runtime knobs—batching, speculative decoding concepts, max tokens, concurrency

Section 5.2: Runtime knobs—batching, speculative decoding concepts, max tokens, concurrency

Most inference performance wins come from runtime configuration rather than Kubernetes changes. The core tradeoff is simple: you want higher GPU utilization (throughput) without creating excessive queueing (latency). Batching and concurrency are the two levers that control how requests share GPU time.

Batching combines multiple requests into a single GPU execution step. It improves throughput by amortizing overhead, but it can increase p95 latency because requests wait to form a batch. Use dynamic batching with a small max batch size and a short batch delay (for example, a few milliseconds) for interactive workloads. A common mistake is setting batch delay too high, which makes TTFT feel “stuck” even when tokens/sec looks great. Benchmark both TTFT and TTLT when adjusting batching.

Concurrency is how many in-flight requests the runtime accepts. Too low and you underutilize the GPU; too high and you create queueing and memory pressure (KV cache growth for LLMs). Many operators tune concurrency by watching two metrics: GPU utilization and request queue time. Increase concurrency until utilization is consistently high, then stop when p95 latency starts rising sharply—this knee is your practical operating point.

Max tokens (and related limits like max sequence length) are safety rails that also shape performance. Unbounded outputs can hijack capacity and destroy tail latency. Set sane defaults and enforce per-tenant limits if you run multi-tenant inference. For chat systems, consider separate pools: a low-latency pool with tight max tokens and a “long-form” pool with looser limits and different SLOs.

Speculative decoding (conceptually) uses a smaller “draft” model to propose tokens and a larger model to verify them, often improving perceived latency and throughput. Operationally, it introduces new knobs: draft model selection, acceptance rate, and extra memory footprint. Treat it like a runtime feature that changes your resource profile; benchmark it under your real prompt lengths, because acceptance rates vary with task type. Do not assume it’s always a win—under some distributions it adds overhead.

Caching can mean prompt caching, prefix caching, or KV cache reuse depending on runtime. It can dramatically reduce compute for repeated prefixes (system prompts, templates), but it increases memory usage and can complicate multi-tenant isolation. The operational judgement is to enable caching when request repetition is high and memory headroom exists; otherwise, disable it to protect tail latency and avoid OOM cascades.

Section 5.3: Kernel/model optimizations—FP16/INT8, quantization tradeoffs, TensorRT awareness

Section 5.3: Kernel/model optimizations—FP16/INT8, quantization tradeoffs, TensorRT awareness

Once runtime knobs are sensible, model-level and kernel-level optimizations are the next step. As an operator, you don’t need to invent new quantization methods, but you do need to understand what changes risk accuracy, what changes affect memory, and what changes require different deployment artifacts.

FP16/BF16 is usually the default for modern GPU inference because it reduces memory bandwidth and often speeds kernels while keeping accuracy close to FP32. The practical win is often higher throughput and lower memory use (larger effective batch/concurrency). Verify that your runtime and model weights are actually using the intended precision; it’s easy to think you’re on FP16 while falling back to FP32 due to unsupported ops.

INT8 quantization can yield significant speedups and memory savings, especially for transformer-heavy workloads, but it introduces calibration and potential accuracy regressions. The key tradeoff: better cost/perf versus quality risk and operational complexity. Treat INT8 as a controlled rollout: benchmark latency/throughput and run task-relevant quality checks (even lightweight ones) before full promotion. A common mistake is relying on generic perplexity metrics while your production task is summarization, extraction, or code completion—choose a quality proxy that matches the workload.

Quantization formats matter operationally. Post-training quantization is simpler to adopt; quantization-aware training can yield better quality but requires training infrastructure. Some runtimes support weight-only quantization (reduces weight memory) while compute remains higher precision; others quantize activations too. Ask: what’s the bottleneck—compute, memory, or bandwidth? Pick the method that addresses the real constraint.

Compilation awareness (TensorRT, ahead-of-time engines) changes the deployment lifecycle. A TensorRT engine can be faster than a generic runtime path, but it is sensitive to GPU architecture, driver/CUDA versions, and sometimes dynamic shapes. That means you must version and cache the compiled artifacts, and you may need a build step per GPU type. Operationally, build engines in CI or a controlled build job, store them in an artifact repository, and ensure your pods validate engine compatibility at startup. If you compile on startup, cold-start times can explode and break your readiness expectations.

Finally, keep a “known-good” baseline path (for example, FP16 without compilation) so you can roll back quickly if an optimization introduces instability. Performance wins are only wins if the service stays reliable.

Section 5.4: System bottlenecks—CPU pinning, IO, networking, request queueing

Section 5.4: System bottlenecks—CPU pinning, IO, networking, request queueing

When GPU utilization is low, the GPU is often not the problem. Inference services have CPU work (tokenization, request parsing, TLS, logging), memory pressure (KV cache, page cache), and network overhead (gateway hops, gRPC/HTTP2). Your sysadmin background shines here: treat the node like a system, not just a scheduler target.

CPU pinning and isolation: Tokenization and networking can become CPU-bound, especially at high RPS. Ensure your pods request enough CPU and consider pinning critical pods to dedicated cores on GPU nodes (where supported by your cluster policy). A common failure mode is running the GPU container with tiny CPU requests; Kubernetes then throttles it under contention, creating “mysterious” latency spikes while GPU sits idle.

IO and model loading: Cold starts are often dominated by pulling large images and loading multi-GB weights. Use local SSD or a fast network filesystem for model artifacts, and avoid repeated downloads by using node-level caching or an init container that validates presence. Measure time-to-Ready separately from request latency so you don’t confuse scaling issues with inference performance.

Networking: If you route through an API gateway and service mesh, you add hops and sometimes per-request overhead (mTLS, retries). For high-throughput internal services, gRPC can reduce overhead versus JSON/HTTP, but be consistent. Watch for packet drops, conntrack exhaustion, or overly aggressive timeouts that cause retries and amplify load. Always include gateway latency and upstream timeouts in your tracing so tail latency is attributable.

Request queueing: Queueing is the invisible killer of p95/p99. You may see stable compute time but rising end-to-end latency because requests pile up. Expose metrics for queue depth, time spent waiting, and active in-flight requests. If queue time rises, you can respond by lowering concurrency (to protect latency), adding replicas (to increase capacity), or changing routing (to protect interactive traffic). The mistake is to push concurrency higher “to use the GPU,” which can worsen tail latency and trigger OOM due to growing per-request state.

Practical outcome: you should be able to explain low GPU utilization using a short list of bottlenecks (CPU throttle, network overhead, queueing, IO) and validate fixes with before/after traces and node-level metrics.

Section 5.5: Autoscaling for inference—HPA/VPA concepts, custom metrics, scale limits

Section 5.5: Autoscaling for inference—HPA/VPA concepts, custom metrics, scale limits

Autoscaling for GPU inference is different from stateless web apps because scale events are slow (image pull + model load) and GPUs are expensive. You scale to protect SLOs, not to chase perfect utilization. Use Horizontal Pod Autoscaler (HPA) when you can add replicas; consider Vertical Pod Autoscaler (VPA) for right-sizing CPU/memory requests, but be cautious with GPU workloads because restarts are disruptive.

HPA signals: CPU utilization is usually the wrong primary metric for GPU inference. Prefer custom metrics that correlate with user experience and saturation: request queue depth, in-flight requests, p95 latency, or GPU duty cycle (with care). Queue depth is often the most actionable: it rises early and directly indicates backlog. Configure HPA to react before p95 explodes, and validate with load tests that include bursts.

Scale limits and stabilization: Set a max replicas limit based on GPU capacity and a min replicas based on cold-start tolerance. Use stabilization windows to avoid thrashing when traffic fluctuates. A classic mistake is letting HPA scale to zero for large models, then discovering that the first user after idle waits minutes for a warm pod. For interactive services, keep a warm pool (min replicas > 0) or use a smaller “routing” deployment that can shed load gracefully.

VPA for CPU/memory: Tokenization and networking need consistent CPU headroom; VPA can help discover realistic requests/limits and reduce throttling. Apply VPA in recommendation mode first, then roll changes intentionally. Avoid frequent evictions during peak hours.

Cluster-level constraints: Pods cannot scale beyond available GPUs. Pair HPA with Cluster Autoscaler (or a GPU-aware node provisioning system) and make sure node provisioning time is part of your SLO strategy. If adding nodes takes 10–15 minutes, HPA alone won’t save p95 during sudden spikes. In that case, plan reserved headroom, or implement admission control and graceful degradation (lower max tokens, faster model) under load.

The operator deliverable here is a scaling policy that is predictable: it should be obvious when and why the service adds replicas, what the upper bound is, and what happens when you hit it.

Section 5.6: FinOps for GPUs—utilization targets, bin-packing, and reserved capacity strategy

Section 5.6: FinOps for GPUs—utilization targets, bin-packing, and reserved capacity strategy

Performance tuning is incomplete without cost discipline. GPUs magnify waste: a small misconfiguration can burn thousands per month. FinOps for inference starts with a utilization target that matches your risk tolerance. For interactive workloads, you might accept 40–60% steady utilization to preserve headroom for bursts; for batch workloads, you might target 70–90% with looser latency SLOs.

Build a cost/perf scorecard: For every meaningful change (batching, quantization, new runtime, autoscaling tweak), record: p50/p95/p99 latency, throughput (requests/sec and tokens/sec), GPU utilization, GPU memory headroom, error rate, and cost per 1k requests (or per 1M tokens). Tie the scorecard to a specific node type and pricing model so the math is explicit. This turns tuning from opinion into evidence.

Bin-packing and scheduling: Use GPU-aware scheduling to pack compatible workloads onto the same node when safe. Techniques include node labeling by GPU type, taints/tolerations for isolation, and affinity to keep latency-sensitive pods on less-contended nodes. If you use MIG or fractional GPU sharing, track contention carefully—bin-packing can improve cost but hurt tail latency if memory or PCIe bandwidth becomes shared bottlenecks.

Reserved capacity strategy: If your traffic baseline is stable, reserved instances/committed use can reduce unit cost significantly. Keep on-demand capacity for bursts and experimentation. The operational trick is to align reservations with “always-on” services and keep a smaller flexible pool for unpredictable demand. Review monthly utilization and resize reservations; over-reserving is just as wasteful as underutilizing GPUs.

Common cost mistakes: running too many warm replicas “just in case,” using oversized GPU types for small models, ignoring CPU/memory overhead that forces bigger nodes, and failing to cap max tokens (which can create unbounded cost per request). Your best lever is often policy: enforce limits, route long requests differently, and publish a clear SLO/cost contract to internal users.

Practical outcome: you can justify why a configuration is “best” not only because it’s fastest, but because it meets SLOs at the lowest cost per unit of useful work.

Chapter milestones
  • Establish a repeatable benchmark methodology for inference services
  • Tune runtime parameters: batching, concurrency, and caching
  • Apply model-level optimizations: quantization basics and compilation awareness
  • Right-size resources and reduce bottlenecks (CPU, memory, network, storage)
  • Create a cost/perf scorecard to guide change decisions
Chapter quiz

1. Why does the chapter argue that “fast” is not a single number for inference services?

Show answer
Correct answer: Because high throughput (tokens/second) can still coincide with poor p95 latency under bursty traffic, and low latency can still be expensive due to poor GPU utilization
The chapter emphasizes balancing latency percentiles, throughput, and cost; optimizing one can worsen another.

2. Which workflow best matches the chapter’s recommended repeatable benchmarking loop?

Show answer
Correct answer: Define a workload profile, measure under controlled conditions, change one variable at a time, and record performance and cost impact
Repeatability and isolating variables are required to trust performance gains and understand cost impact.

3. When validating inference performance against an SLO, which metric approach does the chapter prioritize?

Show answer
Correct answer: Track p50/p95/p99 (not just averages) to reflect user experience under varying conditions
Percentiles capture tail behavior (e.g., p95/p99) that determines whether SLOs are met.

4. What is the purpose of separating cold-start behavior from steady-state measurements in benchmarks?

Show answer
Correct answer: To avoid mixing initialization effects with normal serving performance and to make results more reproducible
Cold starts can skew latency and throughput; separating them helps produce trustworthy baselines and comparisons.

5. According to the chapter, what should the output of a tuning effort look like for another operator to follow?

Show answer
Correct answer: A small set of dashboards plus a written tuning log documenting baseline, changes, and cost/perf outcomes
The chapter stresses reproducibility and handoff: dashboards and a tuning log are more valuable than a one-off “hero” session.

Chapter 6: Operate in Production (Observability, Security, Incidents)

Running GPU inference in production is less about “getting a model to respond” and more about continuously delivering predictable user experience under changing load, changing code, and changing infrastructure. As a sysadmin transitioning into a GPU cluster operator, your advantage is discipline: you already think in terms of uptime, change control, blast radius, and recovery. In this chapter you’ll translate that discipline into SLO-driven operations, observability that is actionable (not noisy), practical security controls, and incident response patterns specific to GPU inference.

The biggest mistake teams make in production inference is optimizing the wrong thing. They obsess over peak GPU utilization while users complain about tail latency; they add alerts on every metric and end up ignoring all of them; they treat security as a one-time checklist rather than an operational posture. Your goal is to build a system where you can answer, quickly and confidently: “Are users getting the experience we promised?” and “If not, what should we do next?”

We’ll structure operations around five realities: (1) inference is latency-sensitive and tail latency matters; (2) GPUs are scarce and saturation is common; (3) failures often present as performance degradation (not hard downtime); (4) the cluster is a security boundary as much as it is a scheduler; and (5) upgrades are inevitable—plan for them or they will plan for you. The sections below walk through concrete workflows you can apply immediately.

Practice note for Implement SLOs and dashboards that reflect user experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Instrument logs/metrics/traces and set actionable alerts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Harden the cluster and the serving supply chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run incident response for latency spikes, OOMs, and GPU faults: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Plan upgrades and lifecycle management without downtime surprises: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement SLOs and dashboards that reflect user experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Instrument logs/metrics/traces and set actionable alerts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Harden the cluster and the serving supply chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run incident response for latency spikes, OOMs, and GPU faults: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: SLOs for inference—error budgets, latency objectives, saturation signals

Start production operations by defining Service Level Objectives (SLOs) that reflect user experience. For inference, “availability” alone is insufficient; a 200 OK that arrives in 10 seconds is often a failure. A practical SLO set is: request success rate, latency (p50/p95/p99), and quality-of-service constraints (timeouts, max queue time). Choose objectives based on product needs, then map them to measurable signals at the API boundary (gateway or inference service).

Define an error budget: if your SLO is 99.9% success per 30 days, your budget is ~43 minutes of “badness.” Badness includes errors and slow responses beyond the latency SLO. This is where engineering judgment matters: decide what counts as an error for your users. Common approach: any request that exceeds a hard timeout or exceeds p99 latency target is a budget burn event. Tie the budget to change policy: when burn rate is high, freeze risky deploys and focus on reliability work.

Latency objectives should include tail latency. A typical set might be p95 < 300ms and p99 < 800ms for a small model, but you must calibrate to model size and batching strategy. Beware of averaging: mean latency can look fine while p99 is awful due to queueing. To catch this, measure both service time (model execution) and queue time (time waiting for a worker/GPU slot). Queue time is often the first indicator of saturation.

  • Golden signals for inference: latency (p95/p99), traffic (RPS), errors (5xx, timeouts), saturation (queue depth, GPU memory, GPU compute, CPU throttling).
  • Budget policies: “If 2-hour burn rate > 5x, stop feature releases; if 6-hour burn rate > 2x, prioritize mitigation and capacity.”

Common mistakes: setting SLOs on internal metrics (GPU utilization) rather than user-facing metrics; setting one global SLO for all endpoints (different models and tenants behave differently); and alerting on static thresholds without burn-rate context. The practical outcome you want is a small number of dashboards and alerts that tell you whether you are consuming error budget and why.

Section 6.2: Observability stack—Prometheus/Grafana, OpenTelemetry, structured logging

Once SLOs exist, build an observability stack that can explain SLO burn. A common, production-proven foundation in Kubernetes is Prometheus for metrics collection, Grafana for dashboards, and OpenTelemetry (OTel) for traces and metrics export. Logging should be structured (JSON), consistent across services, and correlated with traces.

Metrics: instrument at the gateway and the inference service. At minimum export request count, request duration histograms, response codes, and queue length. Prefer histograms over summaries so you can aggregate correctly across pods. Use labels carefully: label cardinality can melt Prometheus if you include user IDs, prompt text, or high-cardinality request attributes. Use stable labels like model name, route, status class, and tenant (if bounded).

Tracing: use OTel auto-instrumentation where possible for HTTP/gRPC, then add custom spans for the parts that matter: admission/queueing, tokenization/preprocessing, model execution, postprocessing, and downstream calls (vector DB, feature store). For GPU inference, tracing helps you separate “model is slow” from “we are waiting for a worker.” Propagate trace IDs through the gateway and include them in logs so you can pivot from an alert to a specific slow request.

  • Actionable alert example: “p99 latency > SLO for 10m AND burn-rate > 2x” rather than “GPU utilization > 90%.”
  • Log fields to standardize: timestamp, severity, request_id/trace_id, model, endpoint, latency_ms, queue_ms, batch_size, tokens_in/out, error_type.

Common mistakes: collecting everything but using nothing; alerting on symptoms without a runbook; and forgetting to test alerts. Treat alerts as code: version them, review them, and verify they page you only for conditions requiring human intervention. The practical outcome is fast diagnosis: from a single latency alert you can determine whether the issue is load-driven queueing, a bad deployment, a downstream dependency, or infrastructure instability.

Section 6.3: GPU monitoring—DCGM metrics, utilization, memory, throttling, ECC errors

GPU inference adds a hardware dimension to operations. NVIDIA’s Data Center GPU Manager (DCGM) exposes metrics that let you distinguish “the model is busy” from “the GPU is unhealthy.” In Kubernetes, DCGM Exporter is commonly deployed as a DaemonSet, scraping per-GPU metrics into Prometheus. This is the basis for dashboards and alerts that catch GPU-specific failure modes before users feel them.

Focus on a small set of signals: GPU utilization (compute), memory used/total, memory bandwidth, temperature, power draw, and throttling reasons. A GPU can show high utilization while performance degrades due to thermal or power throttling. Similarly, memory pressure can cause fragmentation and OOMs at the framework level even when total memory appears “just below” capacity—watch allocation failures and framework logs alongside DCGM memory metrics.

Reliability signals: ECC error counts and Xid errors are critical. ECC correctable errors trending upward may predict a failing card; uncorrectable errors can crash workloads. Xid errors often correlate with driver issues, PCIe problems, or a dying GPU. Build alerts that page only when action is required (e.g., uncorrectable ECC, repeated Xid errors within a window), and create runbooks that include steps to cordon/drain the node, capture diagnostics, and quarantine hardware.

  • Dashboards to include: per-node GPU health (temp/power/throttle), per-deployment GPU usage (util/mem), and “saturation view” correlating queue depth with GPU metrics.
  • Incident pattern: latency spike + stable traffic + rising throttling = likely thermal/power; latency spike + rising queue + high mem = capacity; errors + Xid = hardware/driver.

Common mistakes: using GPU utilization as a success metric (it’s a cost metric); ignoring throttling flags; and failing to separate “node-level GPU problems” from “deployment-level load problems.” The practical outcome is a feedback loop: you can justify capacity adds, identify bad nodes quickly, and avoid chasing phantom application bugs when the GPU is faulting.

Section 6.4: Security fundamentals—RBAC, network policies, secrets, pod security controls

Production inference clusters are attractive targets: they host valuable models, credentials, and high-cost compute. Security fundamentals are operational controls that reduce blast radius and prevent common misconfigurations from becoming incidents. Start with identity and authorization: apply least-privilege RBAC. Create separate namespaces for system components, inference workloads, and shared services. Bind service accounts to narrowly scoped roles; avoid giving workload namespaces access to cluster-wide resources unless there is a clear need.

Network policies: default-deny ingress/egress at the namespace level, then explicitly allow traffic between gateway and inference pods, and from inference pods to required dependencies (object storage, vector DB). This prevents lateral movement and accidental data exfiltration. Remember that many ML runtimes try to download models or dependencies at startup; in production, prefer pre-baked images or controlled artifact repositories so egress can be constrained.

Secrets: never mount cloud credentials broadly. Use Kubernetes Secrets or an external secrets manager, but treat both as sensitive. Rotate regularly and restrict who can read them via RBAC. For inference, common secrets include model registry tokens, TLS private keys, and API gateway credentials. Encrypt secrets at rest (KMS) and ensure pods do not log secrets—structured logging helps here by enforcing field-level hygiene.

  • Pod security controls: run as non-root, drop Linux capabilities, read-only root filesystem where possible, and restrict hostPath mounts (especially dangerous on GPU nodes).
  • Node hardening: restrict SSH access, patch kernel/drivers, and isolate GPU node pools if you run mixed workloads.

Common mistakes: using a single “admin” kubeconfig in automation; allowing unrestricted egress; and mounting Docker socket or privileged containers “just to make it work.” The practical outcome is that a compromised pod cannot trivially reach the control plane, other namespaces, or sensitive systems, and your cluster remains a controlled environment even under pressure during incidents.

Section 6.5: Supply chain security—image scanning, signing, SBOMs, provenance policies

Inference workloads rely heavily on third-party dependencies: CUDA base images, Python packages, model artifacts, and custom operators. Supply chain security turns this dependency graph from an implicit risk into an explicit, enforceable policy. The goal is not perfection; it’s preventing the most likely and most damaging compromises (typosquatting, outdated vulnerable layers, untrusted artifacts).

Implement three layers: scanning, signing, and provenance. First, scan container images in CI and again in the registry for known vulnerabilities (CVEs). Treat scan results as data: set thresholds (e.g., no critical CVEs in runtime images) and build an exception process for unavoidable findings with documented compensating controls. Second, sign images (e.g., Sigstore/cosign) so the cluster can verify that only approved build pipelines produce deployable artifacts. Third, generate SBOMs (Software Bill of Materials) so you can answer “where is log4j?”-style questions quickly across your fleet.

Provenance policies and admission control make this real: use a policy engine (e.g., Gatekeeper/Kyverno) to require signatures, enforce allowed registries, and block privileged settings. For model artifacts, apply the same thinking: store models in a controlled registry/bucket, version them, and restrict who can publish. If your inference pod downloads model weights at runtime, verify checksums and require TLS; better yet, promote models through environments and pin exact versions.

  • Practical policy examples: “Only images from registry.company.com are allowed,” “All images must be signed by CI identity,” “Disallow :latest tags,” “Require SBOM attachment for release images.”

Common mistakes: scanning only application code but ignoring base images; allowing developers to bypass policy “temporarily” with no expiration; and treating models as data rather than as executable supply chain inputs. The practical outcome is faster, safer releases: when a vulnerability drops, you can identify affected workloads, rebuild confidently, and enforce that only verified artifacts reach production.

Section 6.6: Day-2 ops—upgrades, capacity planning, disaster recovery, postmortems

Day-2 operations is where GPU clusters succeed or fail: drivers change, Kubernetes versions advance, workloads grow, and hardware ages. Plan upgrades like you would for any critical service, with added care for GPU drivers, CUDA compatibility, and device plugins. Maintain an upgrade matrix: Kubernetes version, NVIDIA driver, container runtime, device plugin, and inference runtime versions that are known-good together. Test upgrades in a staging cluster that mirrors production node types and workloads, including canary inference traffic and synthetic load.

To avoid downtime surprises, use strategies that respect GPU scarcity. Configure PodDisruptionBudgets so you don’t evict too many inference replicas at once, and use node pools so you can roll nodes gradually (surge capacity helps). For model servers that take time to warm up, implement readiness gates that only flip once the model is loaded and a health-check inference succeeds. During upgrades, watch queue depth and p99 latency; this is often where hidden capacity tightness shows up.

Capacity planning should be SLO-driven: forecast based on peak RPS, target latency, and concurrency per GPU. Track headroom explicitly (e.g., “we run at 60% peak GPU memory and 70% peak queue utilization”) rather than hoping autoscaling saves you. Autoscaling for GPU is slower and more expensive; combine horizontal pod autoscaling (when possible) with node autoscaling and a clear procurement lead-time plan.

  • Disaster recovery basics: back up cluster state where relevant, store manifests in Git, replicate model artifacts, and document rebuild steps. Practice a “node pool loss” scenario: can you drain bad nodes and maintain SLOs?
  • Postmortems: write blameless, timeline-driven reports. Identify contributing factors (alerts, runbooks, capacity, change process) and track action items to completion.

Common mistakes: upgrading drivers directly on live nodes without a drain/cordon workflow; lacking canaries for inference; and skipping postmortems because the service “came back.” The practical outcome is operational confidence: you can ship changes, absorb incidents like latency spikes or GPU faults, and improve reliability over time instead of relearning the same lessons.

Chapter milestones
  • Implement SLOs and dashboards that reflect user experience
  • Instrument logs/metrics/traces and set actionable alerts
  • Harden the cluster and the serving supply chain
  • Run incident response for latency spikes, OOMs, and GPU faults
  • Plan upgrades and lifecycle management without downtime surprises
Chapter quiz

1. In production GPU inference, which operational focus best reflects the chapter’s guidance on delivering predictable user experience?

Show answer
Correct answer: Optimize tail latency and measure against user-experience SLOs
The chapter emphasizes SLO-driven operations and that tail latency matters more than peak utilization when users judge experience.

2. What is the primary purpose of implementing SLOs and dashboards in this chapter’s operating model?

Show answer
Correct answer: To quickly answer whether users are getting the promised experience and what to do next
SLOs and dashboards should enable fast, confident decisions about user experience and next actions, not just resource efficiency.

3. Which alerting approach aligns with the chapter’s definition of actionable observability (not noisy)?

Show answer
Correct answer: Alert only on signals that map to user impact and drive a clear response
The chapter warns that alerting on everything creates noise; alerts should be tied to user experience and trigger specific actions.

4. How does the chapter suggest thinking about security for a production inference platform?

Show answer
Correct answer: As an operational posture that includes hardening the cluster and the serving supply chain
It explicitly cautions against checklist security and frames the cluster and supply chain as ongoing security boundaries to harden.

5. According to the chapter, why must incident response and upgrade planning account for failures differently in GPU inference environments?

Show answer
Correct answer: Because failures often show up as performance degradation and upgrades are inevitable, so planning prevents downtime surprises
The chapter highlights that inference failures often manifest as latency/performance degradation and that upgrades will happen—plan to avoid surprises.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.