Career Transitions Into AI — Intermediate
Go from IT ops to running fast, reliable GPU inference on Kubernetes.
This course is a short technical book for working sysadmins and Kubernetes operators who want a realistic path into AI infrastructure—without pretending you need to become a data scientist first. You’ll learn how GPU inference changes the operational game (scheduling, reliability, performance, and cost), then build up a practical, production-minded workflow for deploying and tuning model serving on Kubernetes.
The goal is simple: when a team says “we need to serve an LLM reliably on GPUs,” you’ll know how to make the cluster GPU-ready, deploy a serving stack, scale it safely, measure what matters, and keep it stable during upgrades and incidents.
You’ll start by reframing your existing skills—Linux troubleshooting, networking instincts, change control, and observability—into the responsibilities of a GPU cluster operator. Then you’ll progress through GPU enablement, deployment patterns, scheduling strategy, performance tuning, and production operations.
This course is designed for sysadmins, SREs, platform engineers, and Kubernetes operators who are comfortable with Linux and basic Kubernetes primitives, and want to step into AI infrastructure roles. If you’ve ever owned clusters, managed on-call, or debugged “why is this pod stuck,” you’re in the right place.
You do not need prior ML experience. When we touch model-level concepts (like quantization), it’s strictly from an operator’s perspective: what it is, why it affects performance, and what you need to watch for in production.
Each chapter reads like a focused section of a technical handbook: clear mental models, checklists, deployment patterns, and troubleshooting paths you can reuse on the job. You’ll repeatedly connect three viewpoints that matter in inference operations:
If you’re ready to turn your operations background into AI infrastructure credibility, start here and work straight through the six chapters—each one builds on the last. When you’re ready, you can Register free to track progress, or browse all courses to pair this with adjacent platform and MLOps topics.
By the end, you’ll have a repeatable blueprint for standing up Kubernetes GPU inference that is measurable, secure, and operable—exactly the skill set hiring teams look for in GPU cluster operators and AI platform engineers.
Platform Engineer, Kubernetes & GPU Systems
Sofia Chen builds Kubernetes platforms for ML teams, focusing on GPU scheduling, inference reliability, and cost control. She has operated mixed-node clusters in production and designed SLO-driven observability for model serving stacks.
Becoming a GPU inference operator is less about “learning AI” and more about upgrading your operational instincts for a new class of workloads. The sysadmin mindset—control blast radius, standardize builds, observe everything, automate repeatability—translates directly. What changes is the shape of failure and the economics of resources: a single misconfigured runtime or scheduling rule can idle a multi-thousand-dollar GPU, while a seemingly small latency regression can break a product SLO.
This chapter establishes the mental model you’ll use for the rest of the course. You will map familiar sysadmin competencies into GPU inference operations responsibilities; define the serving problem in terms of latency, throughput, cost, and safety; sketch a reference architecture for Kubernetes-based inference; and set up a lab plan and operational baseline that matches real production constraints (GitOps, environments, and change control). The goal is practical: by the end of this chapter, you should know what “good” looks like for an inference platform and what to build first so that later optimizations are measurable, reversible, and safe.
One common mistake in career transitions is copying training-oriented playbooks into serving. Training optimizes for maximizing GPU occupancy over long, batch-heavy jobs. Inference optimizes for predictable response times, fast rollouts, and safe multi-tenancy. Your job becomes a steady trade: keep latency low, keep throughput high, keep cost under control, and keep risk contained—while the model, traffic patterns, and dependencies change underneath you.
Practice note for Map sysadmin competencies to GPU inference operations responsibilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the serving problem: latency, throughput, cost, and safety constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a reference architecture for Kubernetes-based inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up a lab plan: cluster access, GPU nodes, and toolchain checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Establish an operational baseline: GitOps, environments, and change control: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Map sysadmin competencies to GPU inference operations responsibilities: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the serving problem: latency, throughput, cost, and safety constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a reference architecture for Kubernetes-based inference: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up a lab plan: cluster access, GPU nodes, and toolchain checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
As a sysadmin, you already operate complex systems: networks, OS images, IAM, storage, monitoring, and incident response. A GPU inference operator keeps those fundamentals but applies them to an inference platform where “the application” is a model server plus a chain of dependencies. The fastest way to transition is to map old tasks to new responsibilities and learn the vocabulary that engineers and ML teams will use when they ask for help.
Vocabulary you must speak comfortably includes: inference runtime (Triton, vLLM, TensorRT-LLM, TorchServe), tokenization, batching (dynamic vs static), concurrency, KV cache, quantization (fp16/int8/int4), router or gateway (rate limiting, auth, routing), and SLO vs SLA. You don’t need to be an ML researcher, but you must be fluent enough to turn requests like “we need faster responses” into concrete operational work: profile latency, adjust batching, right-size GPU requests, or add replicas safely.
The practical outcome: you become the person who can say, “Here is how we’ll run this model in Kubernetes, how it will be secured, how it will scale, and how we’ll know it’s healthy”—using tooling and discipline that looks familiar to any strong sysadmin.
Training and inference both use GPUs, but they behave like different species. Training is typically a long-running, throughput-oriented job that can tolerate queueing, warmup, and occasional retries. Inference is user-facing: it must answer within a budget (latency) and it often experiences bursty demand. Operationally, that flips your priorities from “maximize GPU utilization at all times” to “meet latency SLOs while keeping utilization economically sane.”
Inference introduces three realities that surprise sysadmins coming from general web ops. First, startup and warmup matter. Model servers may take minutes to load weights into GPU memory; autoscaling that ignores this will oscillate and drop traffic. Second, GPU memory is the true hard limit. CPU throttling is annoying; GPU out-of-memory usually kills the process or triggers aggressive fallback behavior. Third, tail latency dominates: a small fraction of slow requests can break the user experience, even if average latency looks fine.
nvidia.com/gpu: 1) and ensure nodes advertise the resource via the device plugin. You will also use labels/taints to keep GPU nodes dedicated and predictable.A common mistake is treating the model server like a stateless microservice. Many runtimes hold large in-memory caches (e.g., KV cache for LLMs) and behave best when requests are routed with awareness of active sessions and capacity. Another mistake is applying aggressive autoscaling without measuring cold-start time and without protecting against sudden scale-down that evicts warmed replicas.
The practical outcome: you will approach inference like operating a latency-critical service with expensive, scarce accelerators—where correct Kubernetes plumbing (runtime + drivers + device plugin), safe scaling, and disciplined rollouts matter as much as the model.
Inference operations lives and dies on measurement. You will be asked, “Is it fast?” “Can it handle traffic?” and “How much does it cost?” Answering with anecdotes creates firefighting; answering with KPIs creates engineering. The core metrics set is small, but you must interpret it correctly and tie it to action.
Engineering judgment is choosing which knob to turn when a metric degrades. If p95 latency worsens while GPU utilization is low, the bottleneck is likely outside the GPU (CPU preprocessing, network, serialization, or an overloaded gateway). If utilization is near 100% and p95 rises sharply, you likely need to reduce per-request cost (quantization, smaller model, faster runtime) or increase capacity (more replicas, more GPUs), possibly with smarter batching and concurrency limits.
Common mistakes include reporting only averages (which hide tail latency), mixing client-side and server-side latency without clarity, and ignoring request mix. A single dashboard number like “QPS” is meaningless if half the requests are short prompts and the other half are long generations. Establish a habit: publish a minimal “serving scorecard” per deployment that includes latency percentiles, throughput, GPU utilization, and an SLO pass/fail indicator.
The practical outcome: you will create a baseline early, then use it to validate changes (new driver, new runtime, new model version, new batching) with confidence rather than guesswork.
Your Kubernetes topology choices determine how painful (or smooth) operations will be. Inference platforms usually start small—one GPU node and a few services—but production tends to become multi-node with mixed hardware, multiple environments, and strict separation of duties. Design with growth in mind without overbuilding.
Single-node (all-in-one) clusters are great for learning and early prototypes: one control plane node with a GPU, plus a gateway and runtime. The failure mode is simple, but you can’t test realistic scheduling, rolling updates, or node replacement. Multi-node clusters introduce the real problems: bin packing GPUs, isolating noisy neighbors, and keeping system components away from accelerators.
node.kubernetes.io/gpu=true) and use taints (e.g., gpu=true:NoSchedule) so only GPU workloads land there. This prevents “random” system pods from consuming CPU/memory needed by inference.Common mistakes: relying on default scheduling (which will place pods wherever it can), forgetting to reserve headroom for DaemonSets on GPU nodes (device plugin, monitoring), and assuming all GPUs are interchangeable. Another frequent error is underestimating network and storage: pulling multi-GB model images repeatedly can saturate registries or node disks. Plan for caching (image pre-pull, local registry mirror, or persistent volumes for model artifacts) early.
The practical outcome: you will be able to describe a reference topology—control plane separation, GPU node pools, labeling/tainting strategy, and upgrade plan—and then implement it consistently across dev, staging, and production.
An inference “service” is usually a stack, not a single deployment. Thinking in layers helps you debug faster and secure the right boundary. A practical reference architecture on Kubernetes typically includes: an inference runtime (GPU-bound), an API router/gateway (policy and routing), optional caching, and sometimes retrieval components like a vector store.
Secure configuration is part of the operator mindset. Store credentials (API keys, database passwords, TLS private keys) in Kubernetes Secrets (or an external secrets manager synced into the cluster). Avoid baking secrets into container images or Helm values in plain text. Use least-privilege service accounts and restrict egress where feasible; inference pods often do not need broad internet access in production.
Common mistakes include exposing the runtime directly to the internet (bypassing auth and rate limiting), mixing “admin” endpoints with public endpoints, and skipping request validation. Another is ignoring multi-tenancy: without per-tenant quotas and isolation, one client can saturate GPU concurrency and destroy p95 latency for everyone else.
The practical outcome: you will be able to sketch—and later deploy—a secure inference stack where each component has a clear responsibility, observable health, and controlled configuration, making incidents diagnosable and rollouts safe.
To learn GPU inference operations, you need a lab where you can break things on purpose: swap drivers, test the NVIDIA device plugin, validate scheduling rules, and deploy a serving stack repeatedly. The “right” lab depends on budget, access, and how closely you need to match production.
kind/k3s is excellent for Kubernetes fundamentals and GitOps workflows, but GPU support varies. kind runs Kubernetes-in-Docker and is typically not ideal for direct GPU passthrough unless you carefully configure the host, runtime, and container toolkit. k3s on a GPU-capable VM or small server can be a good middle ground: lightweight, close to real kubelet behavior, and manageable on a single machine.
Managed Kubernetes (EKS/GKE/AKS) gets you production-grade control plane operations, autoscaling primitives, and well-documented GPU node pools. It is the fastest route to practicing real-world patterns like node labels, taints, PodDisruptionBudgets, and rolling node upgrades. The tradeoff is cost and sometimes reduced visibility into the host OS.
Bare metal (or self-managed VMs) gives maximum control and teaches the most about the GPU stack: kernel/driver compatibility, NVIDIA Container Toolkit, runtime class configuration, and troubleshooting device exposure. It also forces you to practice disciplined change control because “just update the driver” can become an outage if you don’t stage and validate.
Establish an operational baseline from day one: GitOps-managed manifests, separate environments (dev/stage/prod or at least dev/prod), and change control that records what changed and why. The practical outcome is repeatability: when you later install drivers, enable the device plugin, deploy an inference runtime, and tune batching/concurrency, you can measure impact and roll back safely.
1. According to the chapter, what is the core shift in becoming a GPU inference operator?
2. Which set of constraints best defines the serving problem in this chapter?
3. Why can a small configuration mistake be especially costly in GPU inference operations?
4. Which approach is identified as a common mistake when transitioning from training-focused work to serving-focused operations?
5. What is the main purpose of establishing an operational baseline (GitOps, environments, change control) early?
As a sysadmin, you’re used to building “known-good” server baselines: consistent BIOS settings, predictable kernel versions, and repeatable configuration management. GPU enablement is the same craft, but with more moving parts and tighter compatibility constraints. A Kubernetes node that merely has a GPU isn’t automatically usable for inference. The GPU must be healthy, the driver must match the CUDA expectations of your workloads, the container runtime must pass through device files correctly, and Kubernetes must advertise those resources so the scheduler can place pods.
This chapter turns GPU support into an operator workflow: verify hardware health and baseline performance; enable container GPU access with the correct runtime configuration; install and validate the NVIDIA device plugin; confirm end-to-end scheduling with test workloads; then document node standards and drift checks so day-2 operations stay stable. The goal is not only “it works once,” but “it stays working after upgrades.”
Think of GPU readiness as a chain. If any link breaks, symptoms often look similar—pods stuck Pending, containers failing at runtime, or inference latency swinging wildly. Your job is to isolate the failure domain quickly, using a repeatable validation playbook and a clear node standard. By the end of this chapter, you’ll have a practical baseline for GPU nodes and the checks that prove they’re ready for inference workloads.
Practice note for Verify GPU hardware/driver health and baseline performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Enable container GPU access with the correct runtime configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Install and validate the NVIDIA device plugin on Kubernetes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Confirm GPU scheduling works end-to-end with test workloads: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Document node standards and drift checks for day-2 operations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Verify GPU hardware/driver health and baseline performance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Enable container GPU access with the correct runtime configuration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Install and validate the NVIDIA device plugin on Kubernetes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Confirm GPU scheduling works end-to-end with test workloads: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
GPU operations start with understanding what can change underneath you. Unlike CPUs, GPU performance is heavily shaped by firmware and driver settings: ECC, clocks, power limits, and partitioning features like MIG (Multi-Instance GPU). These aren’t “nice to know”—they affect whether inference is stable and whether capacity planning is trustworthy.
MIG (available on certain data center GPUs like A100/H100) partitions one physical GPU into multiple isolated GPU instances. For operators, MIG changes the scheduling unit. Instead of advertising one large GPU, the node may advertise multiple smaller GPU resources. That can dramatically improve utilization for small models, but it also increases operational complexity: you must standardize MIG profiles per node pool, document them, and treat profile changes as a disruptive event (pods may need rescheduling, device enumeration changes, and some frameworks cache device topology).
ECC (error-correcting code memory) improves reliability by detecting/correcting memory errors, but it may reduce usable memory and slightly impact performance. For inference, ECC is usually kept enabled in data center environments. From an operator standpoint, ECC influences how you interpret “out of memory” incidents and capacity. Always capture ECC state in your node inventory and drift checks.
nvidia-smi -pl). A power cap can look like “mysterious” performance regression after a maintenance window.Baseline performance is your early-warning system. Before Kubernetes is involved, run a simple health and telemetry pass on the host: nvidia-smi for inventory and ECC, and a lightweight compute test to establish a reference (even a small CUDA sample or a known inference benchmark). Record GPU name, driver version, power limit, MIG mode, and average utilization under a controlled test. This becomes the “known-good” signature you compare against when a node misbehaves later.
Drivers are the most common source of GPU node drift because they sit at the boundary between the kernel and your containerized workloads. You have two strategies: install NVIDIA drivers on the host (typical for Kubernetes) or attempt to bundle everything in images. In practice, Kubernetes GPU nodes almost always rely on host drivers, because kernel modules and device management are host responsibilities.
What matters operationally is compatibility: the host driver must support the CUDA runtime expectations of your container images. CUDA in the image does not need to match the driver exactly, but it must be within the supported compatibility range. If you treat inference images as “just another container,” you’ll eventually hit failures like CUDA driver version is insufficient for CUDA runtime version or subtle performance issues when libraries fall back to less efficient code paths.
Adopt an upgrade path that is boring and repeatable:
A common mistake is upgrading the host OS kernel (or enabling unattended upgrades) without validating the NVIDIA driver kernel module rebuild path. The result is a node that boots but loses GPU functionality. Your node standard should explicitly state: supported OS/kernel versions, driver version, and whether Secure Boot is enabled (Secure Boot can block unsigned kernel modules unless handled properly).
Finally, write down how you will detect drift. “It worked yesterday” is not evidence today. Add a simple host-level check (driver loaded, GPU visible, expected ECC/MIG/power state) and treat mismatches as noncompliance. This is the sysadmin mindset translated directly into GPU operator practice.
Kubernetes doesn’t talk to GPUs directly. It schedules pods, then the container runtime (commonly containerd) and NVIDIA tooling make GPU devices available inside containers. If the runtime path is wrong, your pod may start but won’t see /dev/nvidia*, or it will fail on initialization when CUDA can’t find a device.
On modern clusters, the standard approach is containerd + NVIDIA Container Toolkit. The toolkit configures an NVIDIA-aware runtime so containers can request GPU access without privileged hacks. Operationally, your goal is to make GPU access explicit and least-privilege: only pods that request GPU resources should receive GPU devices and libraries.
Practical setup principles:
Common mistakes include forgetting to restart containerd after changing runtime configuration, mixing incompatible toolkit and driver versions, or assuming that installing CUDA toolkit packages on the host is required (usually it is not for Kubernetes inference; you primarily need the driver).
Enable container GPU access, then validate it outside Kubernetes first if you can: run a simple container that invokes nvidia-smi and a minimal CUDA call. If the container cannot see the GPU on a node, Kubernetes troubleshooting will be noisy and misleading. Getting this layer right narrows your failure domain and makes subsequent device plugin verification straightforward.
The NVIDIA device plugin is what turns “a GPU exists on this node” into a schedulable Kubernetes resource. Without it, the scheduler can’t see GPUs, and nvidia.com/gpu (or MIG resources) won’t appear in node capacity. Installation is typically done as a DaemonSet in the kube-system namespace or a dedicated GPU-operators namespace, depending on your stack.
Operator workflow for installation and verification:
Capacity and Allocatable for GPU resources. This is the key “Kubernetes sees it” checkpoint.Then validate with a test workload that requests a GPU. A pod that requests nvidia.com/gpu: 1 should transition from Pending to Running on a GPU node, and inside the container, nvidia-smi should report the device. This confirms the end-to-end chain: scheduler → device plugin advertisement → runtime device injection.
Common failures map to specific layers:
nvidia-smi installed in the image even though GPU access works).As an operator, treat the device plugin as critical infrastructure. Pin its version, monitor its DaemonSet health, and include its logs in your standard incident triage. If the plugin is unstable, everything above it will appear unstable too.
Once GPUs are schedulable, the next step is making scheduling predictable. This is where sysadmin inventory habits become cluster policy. You need a consistent way to answer: “Which nodes have which GPUs, which driver branch, which MIG profile, and which performance class?” Kubernetes labels and taints are the control plane vocabulary for that answer.
Start with a labeling scheme that is stable and meaningful. Examples include GPU vendor/model, memory size, MIG enabled/disabled, and a high-level GPU class label (e.g., gpu-class=inference-small, gpu-class=inference-large) that abstracts away exact SKUs. The abstraction is useful when procurement changes hardware but you want workloads to keep targeting “equivalent” capacity.
Node Feature Discovery (NFD) is a common way to populate hardware labels automatically. It reduces manual errors and helps with day-2 drift detection: if a node loses its GPU driver, labels/resources may change and the node can be quarantined. Consider pairing NFD with a policy that prevents inference workloads from landing on nodes that fail GPU feature checks.
gpu=true:NoSchedule) so only workloads with explicit tolerations can land there.driver-branch=535 or mig-profile=1g.10gb help you roll upgrades and debug issues quickly.Document these as part of your node standard: which labels must exist, which taints are applied, and what “compliant” looks like. This is not bureaucracy—it prevents surprise scheduling outcomes and makes capacity planning accurate. When an incident occurs, labels also let you scope blast radius: “Only nodes in gpu-class=X with driver-branch=Y are affected.”
GPU readiness is only real if you can prove it repeatedly. Create a validation playbook that you run during provisioning, after upgrades, and when investigating performance anomalies. The playbook should cover smoke tests (fast, automated), burn-in (longer confidence checks), and explicit acceptance criteria (what “ready” means).
Smoke tests are your first gate:
nvidia-smi returns expected GPU inventory; driver version matches the node pool standard; ECC/MIG/power limits match policy.Burn-in catches what smoke tests miss: intermittent PCIe errors, thermal throttling under sustained load, or power limit misconfigurations. Run a controlled stress or repeated inference loop for a set duration (e.g., 30–60 minutes) and watch for XID errors, clock drops, or increasing error counts. Keep a simple baseline: expected throughput range and stable temperature/clock behavior. If you can’t hold steady under burn-in, you won’t hold an inference latency SLO in production.
Acceptance criteria should be unambiguous and auditable. For example: “Node reports nvidia.com/gpu capacity, device plugin healthy, test workload completes, no XID errors during burn-in, and performance within ±10% of baseline.” Define what happens on failure: cordon the node, label it noncompliant, and route it to remediation rather than letting workloads randomly fail.
Finally, document node standards and drift checks as day-2 operational guardrails. Store expected driver/toolkit/plugin versions, required labels/taints, and the commands/manifests for your tests. This becomes the GPU equivalent of a golden image checklist—something you can hand to another operator and get the same outcome every time.
1. A Kubernetes node has a GPU installed, but inference pods can’t use it. Which chain of requirements best describes what must be true for the GPU to be usable by workloads?
2. If GPU readiness is a chain, what is the most practical operator mindset when troubleshooting symptoms like Pending pods or runtime failures?
3. Why does the chapter stress verifying GPU health and baseline performance before focusing on Kubernetes components like the device plugin?
4. What role does the NVIDIA device plugin play in making GPUs schedulable for pods?
5. Which statement best captures the chapter’s day-2 operations goal for GPU nodes?
This chapter turns a container image and a model artifact into a reachable, production-leaning endpoint on Kubernetes. As a sysadmin transitioning into GPU cluster operations, you already understand repeatability, change control, and failure domains. Inference deployment is the same story—just with stricter latency constraints and a new resource type (GPUs) that amplifies scheduling mistakes. Your goal is not only “it runs,” but “it runs predictably,” with health checks, secure configuration, controlled traffic exposure, and release mechanics that let you roll forward and roll back safely.
We’ll walk a practical workflow: pick a serving runtime, decide how the model is packaged and loaded, assemble the core Kubernetes objects (Deployment/Service/probes and a few guardrails), expose the service through an ingress layer with TLS and proper timeouts, wire configuration and secrets correctly, and then deploy with progressive delivery patterns (canary/blue-green) that fit inference risks. Along the way, watch for common mistakes: building a custom server when an off-the-shelf runtime would do; shipping weights inside the image with no update path; forgetting startup probes so pods get killed during model load; setting ingress timeouts too low for long generation; and leaking API keys via environment dumps or logs.
By the end, you should be able to take a GPU-capable cluster and deliver a stable endpoint that can be tested, monitored, and updated without drama—exactly the kind of operational confidence that differentiates a cluster operator from a “kubectl deployer.”
Practice note for Containerize or select a serving runtime image and model artifact strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deploy a GPU-backed inference Deployment with health checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Expose the service safely through an Ingress/API gateway path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Manage configuration and secrets for model endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add canary and rollback mechanics for safer releases: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Containerize or select a serving runtime image and model artifact strategy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Deploy a GPU-backed inference Deployment with health checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Expose the service safely through an Ingress/API gateway path: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Manage configuration and secrets for model endpoints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your first decision is the serving runtime image. This is the “PID 1” inside the pod: it owns model loading, batching, GPU memory management, request handling, and metrics exposure. Choosing well saves weeks of custom work and prevents performance traps.
NVIDIA Triton is a general-purpose inference server for multiple frameworks (TensorRT, ONNX Runtime, PyTorch, etc.). It shines when you need mature features: dynamic batching, model repository management, multi-model serving, and strong observability. Triton is often the best default for classic inference (vision, speech, tabular) and for teams that want consistent operations across models.
vLLM focuses on large language model serving with high throughput using paged attention. It’s a strong choice when you care about token throughput and want an OpenAI-compatible API option. It typically expects GPU nodes with enough memory and benefits from careful concurrency settings. vLLM is less about “many frameworks” and more about “LLMs done efficiently.”
TGI (Text Generation Inference) is another LLM-serving runtime with production features (token streaming, batching, quantization support depending on setup). It’s a common choice when you want an opinionated, ready-to-run LLM server with decent defaults and predictable behavior.
Custom FastAPI (or similar) is appropriate when your inference logic is truly bespoke: custom pre/post-processing, multi-step pipelines, or nonstandard request/response formats. The tradeoff is that you become responsible for performance engineering (batching, threading, GPU contention), health endpoints, metrics, and safe reload behavior. A common mistake is defaulting to custom FastAPI for “control,” then rebuilding features Triton/vLLM/TGI already provide.
Operational judgement: start with an off-the-shelf runtime unless you can name the missing feature and the cost of implementing it. Also, validate GPU support early: ensure your runtime image matches your CUDA and driver expectations, and that the container runtime on the node can expose nvidia.com/gpu resources to pods.
Next, decide how model artifacts (weights, tokenizer files, config) reach the pod. This choice impacts build times, rollout speed, cache behavior, and incident response. Treat model artifacts like large, frequently updated binaries: they need a controlled distribution strategy.
Option A: Bake weights into the image. This is simple—one image tag implies code + model. But the image becomes huge, registry pulls are slow, and rolling back means rolling back the entire image even if only weights changed. It’s acceptable for small models or early prototypes, but it fights the operational need for fast redeploys.
Option B: Mount weights from a Persistent Volume (PV). You build a smaller runtime image and mount a PVC at runtime (for example, /models). This supports faster deployments and can reuse cached weights across pod restarts on the same node (depending on storage). The risk is storage performance: slow network volumes can add seconds to minutes of startup time, and you must manage consistency (what happens if the model updates while pods are running?).
Option C: Init container downloads artifacts. A common production pattern is: an init container pulls a specific model version from object storage into an emptyDir volume shared with the main container. This makes the model version explicit and repeatable (pin by checksum or immutable path) and avoids shipping weights inside the main image. It also lets you add validation (hash check) before the server starts. The tradeoff is startup latency; mitigate it with node-local caching where possible.
model:v123 or a content hash path, not “latest.”Common mistake: treating model files like ordinary config and placing them in a ConfigMap. ConfigMaps are not for multi-GB artifacts and will fail or behave poorly. Instead, use PVs, init downloads, or a proper model repository mechanism (Triton model repo, HF cache volumes, etc.).
With a runtime and packaging plan, you assemble the core Kubernetes objects. At minimum: a Deployment (or StatefulSet if you have strong identity/storage requirements) and a Service to provide stable discovery. For GPU workloads, the Deployment spec must be GPU-aware: request a GPU via resources.limits: nvidia.com/gpu: 1 (and often requests equal limits). Pair that with appropriate CPU/memory requests so the scheduler can place the pod correctly; starving CPU can increase latency because tokenization and networking still run on CPU.
Health checks are where inference differs from typical web apps. Model load can be slow, and GPUs can OOM during warmup. Use three probes intentionally:
Engineering judgement: avoid “ping returns 200” readiness checks that don’t validate model availability. Many runtimes expose dedicated endpoints (for example, Triton’s readiness endpoints). If using a custom server, implement a readiness check that verifies the model is loaded and a lightweight inference path is functional.
Add a PodDisruptionBudget (PDB) to prevent voluntary disruptions (node drains, upgrades) from taking down all replicas at once. For example, with two replicas you might set minAvailable: 1. Without a PDB, routine maintenance can become an outage. Pair this with topology spread or anti-affinity if you have multiple GPU nodes, so replicas don’t land on the same node and fail together.
Common mistake: running a single replica “for cost savings” and then being surprised by downtime during node maintenance or runtime crashes. Inference endpoints are often customer-facing; budget for at least two replicas when availability matters.
Once the Service works inside the cluster, you need a controlled way to accept external traffic. Most teams use an Ingress controller (NGINX Ingress, Traefik, HAProxy) or a gateway (Kong, Envoy Gateway, API Gateway products). The key is to treat ingress as part of the inference system: it must understand long-lived requests, streaming, and bursty traffic.
TLS should be non-negotiable. Terminate TLS at the ingress/gateway, automate certificates (for example, cert-manager), and prefer modern TLS settings. If your org requires mTLS internally, handle that between gateway and service mesh or directly between gateway and backend. Avoid exposing your inference Service as a plain LoadBalancer with no authentication or TLS—it’s an easy way to leak a costly GPU endpoint to the internet.
Inference requests can be slow relative to typical web APIs, especially for LLM generation. Configure timeouts deliberately:
If you support streaming responses, ensure the ingress supports it and doesn’t buffer responses unexpectedly. Misconfigured buffering can break token streaming and inflate latency. Also consider path-based routing, such as /v1/models/my-model, to provide a stable contract while allowing internal service changes.
Operational outcome: by placing policy (TLS, auth, timeouts, request limits) at the edge, your backend pods can focus on inference. When incidents happen, you can throttle or shed load at the gateway instead of crashing GPU pods in a feedback loop.
Inference deployments often fail operational reviews not because of accuracy, but because of poor configuration hygiene. Treat configuration as an interface: easy to change, validated, and separated from secrets. Kubernetes gives you primitives, but you must use them with discipline.
Use ConfigMaps for non-sensitive settings: model name/version selectors, runtime flags (batch size, max tokens), logging level, and feature toggles. Mount them as files when the runtime expects config files, or inject as environment variables when that’s simpler. Prefer explicit configuration over “magic defaults,” because performance tuning (concurrency, batching) becomes iterative.
Use Secrets for sensitive data: API keys for upstream services, private model repository credentials, TLS client keys, database passwords, and signing keys. Avoid putting secrets in container images, command-line args (visible in process lists), or ConfigMaps. Be cautious with environment variables too—debug endpoints and crash dumps can leak them. Mounting secrets as files with least privilege is often safer.
For higher maturity, integrate an external secret store (Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) via External Secrets Operator or CSI Secret Store. This improves rotation and auditability. The operational judgement is to keep the Kubernetes Secret as a short-lived projection of a source-of-truth secret, not a long-lived artifact manually edited in-cluster.
Common mistake: mixing “model endpoint configuration” with “cluster operational configuration.” Keep application config in the app namespace, and keep cluster-wide ingress/controller config owned by platform operators. This separation enables safer delegation and clearer incident ownership.
Inference releases are risky because changes can affect not just availability but output quality, latency, and cost. You need mechanics that support fast rollback and controlled exposure. Kubernetes Deployments provide rolling updates, but rolling updates alone are not always enough when “bad responses” are worse than “no response.”
Blue/green is conceptually simple: run two versions side by side (blue = current, green = new). You validate green with internal traffic and then switch the gateway/Service selector to green. Rollback is a switch back to blue. This works well when you can afford duplicate GPU capacity during the cutover window and want very clear operational control.
Canary releases shift a small percentage of traffic to the new version first. You watch metrics—error rate, p95 latency, GPU memory usage, and domain-specific signals (response validity checks, moderation rates, or offline eval proxies). If healthy, increase traffic gradually. If not, reduce to zero and investigate. Canary is more cost-efficient than blue/green but requires better traffic shaping (via gateway, service mesh, or an ingress that supports weighted routing).
Progressive delivery basics means coupling deployment steps to signals. Tools like Argo Rollouts or Flagger can automate canaries based on metrics, but even manual progressive delivery follows the same discipline: define what “good” looks like, measure it quickly, and have a rehearsed rollback path. For inference, include performance SLOs in the go/no-go gate; a model that is “correct” but doubles latency can still be a production regression.
Common mistake: releasing a new model by overwriting the existing tag (for example, :latest) and letting nodes pull “whatever.” Immutable tags and controlled rollout patterns are how operators make inference systems boring—in the best way.
1. What is the primary operational goal when deploying a GPU inference service on Kubernetes in this chapter?
2. Which choice best reflects the chapter’s guidance on selecting a serving approach?
3. Why does the chapter highlight using startup probes in addition to other health checks for inference pods?
4. When exposing an inference service through an ingress layer, what configuration concern is specifically called out for long-running generation?
5. Which deployment approach best matches the chapter’s recommended strategy for safer inference releases?
As a sysadmin moving into GPU cluster operations, you’re no longer just keeping nodes “up.” You’re shaping how expensive, scarce accelerators are allocated so teams can ship reliable inference without stepping on each other. The difference between a cluster that feels predictable and one that feels chaotic is usually not the model code—it’s resource modeling, placement controls, and guardrails that prevent noisy-neighbor behavior.
GPU scheduling in Kubernetes is deceptively simple on the surface: request nvidia.com/gpu and the scheduler places your pod on a node with that many GPUs available. In practice, inference workloads couple GPU, CPU, memory, storage, and network. If you get those couplings wrong, you’ll see “mysterious” latency spikes, intermittent OOMs, or pods stuck in Pending even though “there are GPUs free.” This chapter gives you the operator’s toolkit: how to model resources, steer placement, enforce fairness, choose safe sharing strategies, and build runbooks that shorten incidents.
Your goal is the same one you’ve always had in operations: fair sharing under load, reliable execution under failure, and predictable performance under change. The GPU simply raises the stakes.
Practice note for Implement GPU-aware resource requests/limits and quality-of-service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Control placement with taints, tolerations, affinity, and topology constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Introduce quotas and priority to prevent noisy neighbor incidents: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Enable safe sharing strategies (MIG, time-slicing, or isolation): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create runbooks for stuck scheduling and GPU resource leaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement GPU-aware resource requests/limits and quality-of-service: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Control placement with taints, tolerations, affinity, and topology constraints: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Introduce quotas and priority to prevent noisy neighbor incidents: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Enable safe sharing strategies (MIG, time-slicing, or isolation): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create runbooks for stuck scheduling and GPU resource leaks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Kubernetes treats GPUs as an extended resource, typically exposed by the NVIDIA device plugin as nvidia.com/gpu. Unlike CPU and memory, GPUs are not overcommitted by default: requesting 1 GPU generally means exclusive assignment of a whole device to that container (unless you intentionally enable a sharing mechanism covered later). That exclusivity is good for predictability, but it can trick operators into under-modeling the rest of the pod.
Inference services usually need non-trivial CPU for request parsing, tokenization, post-processing, TLS, and telemetry. If you request a GPU but starve CPU, the pod will schedule fine and then fail your SLOs because the GPU sits idle waiting for CPU-side work. Similarly, memory requests matter even when “the model is on the GPU.” CPU RAM is used for queues, KV caches (depending on architecture), runtime buffers, and batching. Treat GPU requests as the anchor and size CPU/memory around it.
For Quality of Service (QoS), the practical rule is: set requests to what you need to hit steady-state SLOs, and set limits to cap worst-case behavior. For CPU, allowing some burst (limit > request) can be useful, but for memory you typically want request=limit to avoid node-level eviction surprises. GPU is usually request=limit (since it’s an integer device count). A common mistake is leaving CPU requests at tiny defaults (or none), which can place many GPU pods on the same node and create a CPU bottleneck that looks like “GPU latency.”
Operator outcome: you can justify resource requests with measurements and prevent QoS-related flakiness. You also gain a clear language for teams: “A 1-GPU inference pod is also a 4-core, 16GiB pod,” which makes capacity planning realistic.
GPU clusters become manageable when you stop thinking of “nodes” and start thinking of pools: groups of machines with the same GPU type, driver stack, and performance profile. A pool might be “L4 inference,” “A100 high-throughput,” or “T4 dev/test.” Node pools let you control cost, blast radius, and upgrade cadence (for example, rolling the CUDA driver only on one pool at a time).
Placement control in Kubernetes is a layered toolkit:
gpu.nvidia.com/class=L4, node.kubernetes.io/instance-type=...).nvidia.com/gpu=true:NoSchedule on GPU nodes).A strong default is: taint all GPU nodes and require inference pods to tolerate the taint. This prevents accidental scheduling of non-GPU workloads onto expensive nodes and reduces “surprise” GPU node pressure from background jobs. Then, use labels plus nodeSelector or node affinity to select the right GPU class. Keep the selection logic simple and explicit; complex affinity expressions are hard to debug during incidents.
Common mistake: using only labels without taints. Labels alone don’t stop a generic CPU workload from landing on a GPU node if it fits CPU/memory. That seems harmless until CPU-heavy system jobs crowd out inference CPU, creating latency spikes while GPUs remain allocated but underfed.
Practical outcome: you can safely run mixed clusters where CPU-only tenants don’t “steal” GPU nodes, and GPU tenants land on the correct hardware class. This also sets you up for predictable upgrades: drain and rotate one labeled pool at a time without impacting unrelated tenants.
When inference gets serious, the scheduler decision “has a GPU available” is not enough. Topology matters: CPU sockets, NUMA nodes, PCIe lanes, and GPU interconnect (NVLink) can change throughput and tail latency. As an operator, you don’t need to memorize hardware diagrams, but you do need the instinct to ask: “Is the pod’s CPU close to its GPU, and are multi-GPU jobs using the right GPUs?”
NUMA effects show up when a pod’s CPU threads and memory allocations land on a different NUMA node than the GPU’s PCIe root. The symptom is counterintuitive: GPU utilization may appear fine, but end-to-end latency rises due to slower host-to-device transfers and memory access. For multi-GPU inference (tensor parallelism, large models, or high batch throughput), PCIe locality and NVLink topology affect collective communication and can dominate performance.
In practice, topology tuning is iterative: run a load test, observe p95/p99 latency, then correlate with node-level metrics (CPU steal, memory bandwidth, PCIe errors, NIC saturation). A common operator misstep is focusing only on GPU metrics (utilization, memory) and ignoring host-side bottlenecks. If your gateway, tokenizer, or batching thread is constrained, the GPU can be “busy” while the service still misses SLOs.
Practical outcome: you learn to treat inference as a full-system workload, not just a device allocation problem. This skill pays off when teams upgrade models and suddenly demand multi-GPU placement or tighter tail latency.
Multi-tenancy is where GPU clusters fail most often: one team’s experiment can starve another team’s production inference. Kubernetes gives you strong controls, but only if you actually use them. Start by treating namespaces as tenancy boundaries for policy and accounting: per-team namespaces with scoped RBAC, network policies, and secrets management.
Then add fairness controls that are meaningful for GPUs:
nvidia.com/gpu, CPU, and memory per namespace. This prevents runaway scaling or “just one more replica” incidents that drain the pool.The engineering judgment is in choosing quotas and priorities that reflect reality. If you set quotas too low, teams will route around policy (multiple namespaces, shadow clusters). If you set them too high, you have no guardrail. A practical approach is to allocate a baseline quota per team plus a “burst” pool managed by a separate namespace or cluster autoscaler policy, with explicit approval for temporary increases.
Priority and preemption are powerful but dangerous. Preemption can evict lower-priority pods to schedule higher-priority ones, which is great for protecting production SLOs. But if you allow preemption without disruption budgets or without clear communication, you’ll create confusing outages for lower-tier workloads. Define expectations: which jobs are preemptible, how they should checkpoint, and what “best effort” means in your org.
Practical outcome: you can prevent noisy neighbor incidents before they happen, explain capacity tradeoffs in policy terms, and maintain a cluster where teams trust scheduling outcomes instead of fighting them.
GPU sharing is tempting because it improves utilization, but it can destroy predictability if applied blindly. You have three broad strategies: hard partitioning, soft sharing, or isolation (no sharing). The correct choice depends on workload shape (latency vs throughput), tenant trust, and the blast radius you can tolerate.
MIG (Multi-Instance GPU) is hard partitioning available on certain NVIDIA GPUs (notably A100/H100 class). It splits one physical GPU into hardware-isolated slices with dedicated memory and compute resources. MIG is often the best option for multi-tenancy because it provides stronger isolation than simple sharing, and scheduling becomes “request a MIG slice” rather than “share a GPU.” Operationally, MIG adds complexity: you must configure MIG profiles on nodes and align Kubernetes resource names with what the device plugin advertises.
Time-slicing (or software-based sharing) allows multiple pods to share a GPU by time-multiplexing. It can raise aggregate throughput for bursty or dev workloads but can increase tail latency due to contention and context switching. It also increases the chance that one tenant’s behavior impacts another’s performance. Use time-slicing for non-SLO workloads, internal experimentation, or batch inference where latency variance is acceptable.
A common mistake is enabling sharing to “fix” capacity issues without adding tenant controls (quotas, priorities) or without updating runbooks. Sharing changes failure modes: a single bad deployment can now degrade multiple services on the same GPU rather than just consuming “its” device.
Practical outcome: you can choose a sharing method that matches the organization’s risk tolerance, communicate the tradeoffs to stakeholders, and avoid the trap of chasing utilization at the expense of reliability.
GPU scheduling incidents often look like “pods stuck in Pending” or “pod scheduled but no GPU available at runtime.” Your runbook should start with fast triage and then branch by symptom. The goal is to distinguish: (1) genuine lack of capacity, (2) placement constraints you created, (3) node-level GPU stack failures, and (4) bin-packing fragmentation.
kubectl describe pod and read the scheduler events. Look for messages like “0/10 nodes available: insufficient nvidia.com/gpu” (capacity), “node(s) had taint … that the pod didn’t tolerate” (policy), or “didn’t match node affinity” (placement).Fragmentation is a frequent “we have GPUs, why can’t we schedule?” problem. Example: a cluster has many nodes with 1 free GPU each, but a workload requests 2 GPUs in a single pod; it will remain pending. Similarly, a pod might need GPU plus high CPU/memory; a node may have a free GPU but not enough CPU requested, so the scheduler can’t place it. This is why resource modeling (Section 4.1) and pool design (Section 4.2) matter: they reduce odd-shaped requests that don’t fit.
Add a second runbook track for GPU resource leaks: cases where a pod terminates but the node appears to have GPUs “in use.” Often this is an accounting issue (stale pod objects, kubelet delays) or a hung process holding device files. Your steps: verify Kubernetes thinks the GPU is allocated (node allocatable vs allocated), check for orphaned processes on the node, and consider draining the node if it can’t recover cleanly. Document when a reboot is acceptable and how to do it safely for the pool.
Practical outcome: you can resolve scheduling stalls quickly, avoid guesswork, and translate symptoms into concrete fixes—whether that’s adjusting tolerations, repairing the GPU stack, or redesigning requests to reduce fragmentation.
1. A team reports pods stuck in Pending even though "there are GPUs free." Based on the chapter, what is a likely cause to check first?
2. Which set of controls is primarily used to steer *where* GPU workloads run for predictable placement?
3. What is the main purpose of introducing quotas and priority in a multi-tenant GPU cluster?
4. Which option best describes *safe sharing strategies* for expensive GPU accelerators in this chapter?
5. Why does the chapter recommend creating runbooks for stuck scheduling and GPU resource leaks?
Inference is where your Kubernetes and sysadmin instincts matter most: you are turning scarce GPU time into a user-visible service with measurable latency and controllable cost. “Fast” is not a single number. A model can have excellent tokens/second but terrible p95 latency under bursty traffic; it can also be low-latency but expensive due to poor GPU utilization. This chapter gives you a repeatable workflow: benchmark like you mean it, tune the runtime, apply model-level optimizations, remove system bottlenecks, scale safely, and then translate results into a cost/performance scorecard that supports change decisions.
As a GPU cluster operator, your job is not to “make graphs look good.” Your job is to meet an SLO (for example: p95 end-to-end latency under 800 ms for short prompts; or p99 under 4 s for long responses) while keeping throughput high and cost predictable. The practical loop looks like: define a workload profile, measure under controlled conditions, change one variable at a time, and record both performance and cost impact. If you can’t reproduce a benchmark, you can’t trust improvements.
Throughout the sections, you’ll see the same pattern repeated: establish a baseline; separate cold-start behavior from steady-state; watch queueing; validate resource headroom; and always track p50/p95/p99, not just averages. Your output should be a small set of dashboards and a written tuning log that another operator can follow, not a one-off “hero” tuning session.
Practice note for Establish a repeatable benchmark methodology for inference services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune runtime parameters: batching, concurrency, and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Apply model-level optimizations: quantization basics and compilation awareness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Right-size resources and reduce bottlenecks (CPU, memory, network, storage): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a cost/perf scorecard to guide change decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Establish a repeatable benchmark methodology for inference services: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune runtime parameters: batching, concurrency, and caching: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Apply model-level optimizations: quantization basics and compilation awareness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Right-size resources and reduce bottlenecks (CPU, memory, network, storage): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Performance tuning starts with a benchmark you can rerun after every change. For inference services, your “workload” is more than requests per second (RPS). It includes input sizes, output sizes (max tokens), request mix (chat vs embeddings vs rerank), and concurrency patterns (steady load, spiky bursts, diurnal ramps). Start by writing down two to three representative scenarios, such as: (1) short prompts, short outputs; (2) long prompts, short outputs; (3) long prompts, long outputs. Tie each scenario to an SLO and a business intent (interactive chat vs background summarization).
Choose or create a dataset that matches production shape. Avoid benchmarking with a single fixed prompt: it hides tokenizer variance and cache behavior. For LLMs, record distributions: prompt tokens, completion tokens, and sampling parameters. For vision or speech, record typical input resolution/length. Save the dataset in version control or object storage with a content hash so you can prove comparability across runs.
Warmup is mandatory. GPUs, model runtimes, and kernels have “first-run” costs (graph compilation, CUDA context init, page faults). If you include those in your steady-state latency, you’ll chase the wrong problem. Run a warmup phase until metrics stabilize, then start measuring. Separate cold-start SLOs (pod startup + model load) from steady-state SLOs (request latency). In Kubernetes terms, measure: time-to-Ready, time-to-first-token (TTFT), and time-to-last-token (TTLT).
Finally, capture both service-side and client-side timing. Client-side includes network and gateway overhead; service-side isolates model execution and queue time. If you can’t separate “compute time” from “waiting in line,” you’ll misattribute bottlenecks and apply the wrong tuning knob.
Most inference performance wins come from runtime configuration rather than Kubernetes changes. The core tradeoff is simple: you want higher GPU utilization (throughput) without creating excessive queueing (latency). Batching and concurrency are the two levers that control how requests share GPU time.
Batching combines multiple requests into a single GPU execution step. It improves throughput by amortizing overhead, but it can increase p95 latency because requests wait to form a batch. Use dynamic batching with a small max batch size and a short batch delay (for example, a few milliseconds) for interactive workloads. A common mistake is setting batch delay too high, which makes TTFT feel “stuck” even when tokens/sec looks great. Benchmark both TTFT and TTLT when adjusting batching.
Concurrency is how many in-flight requests the runtime accepts. Too low and you underutilize the GPU; too high and you create queueing and memory pressure (KV cache growth for LLMs). Many operators tune concurrency by watching two metrics: GPU utilization and request queue time. Increase concurrency until utilization is consistently high, then stop when p95 latency starts rising sharply—this knee is your practical operating point.
Max tokens (and related limits like max sequence length) are safety rails that also shape performance. Unbounded outputs can hijack capacity and destroy tail latency. Set sane defaults and enforce per-tenant limits if you run multi-tenant inference. For chat systems, consider separate pools: a low-latency pool with tight max tokens and a “long-form” pool with looser limits and different SLOs.
Speculative decoding (conceptually) uses a smaller “draft” model to propose tokens and a larger model to verify them, often improving perceived latency and throughput. Operationally, it introduces new knobs: draft model selection, acceptance rate, and extra memory footprint. Treat it like a runtime feature that changes your resource profile; benchmark it under your real prompt lengths, because acceptance rates vary with task type. Do not assume it’s always a win—under some distributions it adds overhead.
Caching can mean prompt caching, prefix caching, or KV cache reuse depending on runtime. It can dramatically reduce compute for repeated prefixes (system prompts, templates), but it increases memory usage and can complicate multi-tenant isolation. The operational judgement is to enable caching when request repetition is high and memory headroom exists; otherwise, disable it to protect tail latency and avoid OOM cascades.
Once runtime knobs are sensible, model-level and kernel-level optimizations are the next step. As an operator, you don’t need to invent new quantization methods, but you do need to understand what changes risk accuracy, what changes affect memory, and what changes require different deployment artifacts.
FP16/BF16 is usually the default for modern GPU inference because it reduces memory bandwidth and often speeds kernels while keeping accuracy close to FP32. The practical win is often higher throughput and lower memory use (larger effective batch/concurrency). Verify that your runtime and model weights are actually using the intended precision; it’s easy to think you’re on FP16 while falling back to FP32 due to unsupported ops.
INT8 quantization can yield significant speedups and memory savings, especially for transformer-heavy workloads, but it introduces calibration and potential accuracy regressions. The key tradeoff: better cost/perf versus quality risk and operational complexity. Treat INT8 as a controlled rollout: benchmark latency/throughput and run task-relevant quality checks (even lightweight ones) before full promotion. A common mistake is relying on generic perplexity metrics while your production task is summarization, extraction, or code completion—choose a quality proxy that matches the workload.
Quantization formats matter operationally. Post-training quantization is simpler to adopt; quantization-aware training can yield better quality but requires training infrastructure. Some runtimes support weight-only quantization (reduces weight memory) while compute remains higher precision; others quantize activations too. Ask: what’s the bottleneck—compute, memory, or bandwidth? Pick the method that addresses the real constraint.
Compilation awareness (TensorRT, ahead-of-time engines) changes the deployment lifecycle. A TensorRT engine can be faster than a generic runtime path, but it is sensitive to GPU architecture, driver/CUDA versions, and sometimes dynamic shapes. That means you must version and cache the compiled artifacts, and you may need a build step per GPU type. Operationally, build engines in CI or a controlled build job, store them in an artifact repository, and ensure your pods validate engine compatibility at startup. If you compile on startup, cold-start times can explode and break your readiness expectations.
Finally, keep a “known-good” baseline path (for example, FP16 without compilation) so you can roll back quickly if an optimization introduces instability. Performance wins are only wins if the service stays reliable.
When GPU utilization is low, the GPU is often not the problem. Inference services have CPU work (tokenization, request parsing, TLS, logging), memory pressure (KV cache, page cache), and network overhead (gateway hops, gRPC/HTTP2). Your sysadmin background shines here: treat the node like a system, not just a scheduler target.
CPU pinning and isolation: Tokenization and networking can become CPU-bound, especially at high RPS. Ensure your pods request enough CPU and consider pinning critical pods to dedicated cores on GPU nodes (where supported by your cluster policy). A common failure mode is running the GPU container with tiny CPU requests; Kubernetes then throttles it under contention, creating “mysterious” latency spikes while GPU sits idle.
IO and model loading: Cold starts are often dominated by pulling large images and loading multi-GB weights. Use local SSD or a fast network filesystem for model artifacts, and avoid repeated downloads by using node-level caching or an init container that validates presence. Measure time-to-Ready separately from request latency so you don’t confuse scaling issues with inference performance.
Networking: If you route through an API gateway and service mesh, you add hops and sometimes per-request overhead (mTLS, retries). For high-throughput internal services, gRPC can reduce overhead versus JSON/HTTP, but be consistent. Watch for packet drops, conntrack exhaustion, or overly aggressive timeouts that cause retries and amplify load. Always include gateway latency and upstream timeouts in your tracing so tail latency is attributable.
Request queueing: Queueing is the invisible killer of p95/p99. You may see stable compute time but rising end-to-end latency because requests pile up. Expose metrics for queue depth, time spent waiting, and active in-flight requests. If queue time rises, you can respond by lowering concurrency (to protect latency), adding replicas (to increase capacity), or changing routing (to protect interactive traffic). The mistake is to push concurrency higher “to use the GPU,” which can worsen tail latency and trigger OOM due to growing per-request state.
Practical outcome: you should be able to explain low GPU utilization using a short list of bottlenecks (CPU throttle, network overhead, queueing, IO) and validate fixes with before/after traces and node-level metrics.
Autoscaling for GPU inference is different from stateless web apps because scale events are slow (image pull + model load) and GPUs are expensive. You scale to protect SLOs, not to chase perfect utilization. Use Horizontal Pod Autoscaler (HPA) when you can add replicas; consider Vertical Pod Autoscaler (VPA) for right-sizing CPU/memory requests, but be cautious with GPU workloads because restarts are disruptive.
HPA signals: CPU utilization is usually the wrong primary metric for GPU inference. Prefer custom metrics that correlate with user experience and saturation: request queue depth, in-flight requests, p95 latency, or GPU duty cycle (with care). Queue depth is often the most actionable: it rises early and directly indicates backlog. Configure HPA to react before p95 explodes, and validate with load tests that include bursts.
Scale limits and stabilization: Set a max replicas limit based on GPU capacity and a min replicas based on cold-start tolerance. Use stabilization windows to avoid thrashing when traffic fluctuates. A classic mistake is letting HPA scale to zero for large models, then discovering that the first user after idle waits minutes for a warm pod. For interactive services, keep a warm pool (min replicas > 0) or use a smaller “routing” deployment that can shed load gracefully.
VPA for CPU/memory: Tokenization and networking need consistent CPU headroom; VPA can help discover realistic requests/limits and reduce throttling. Apply VPA in recommendation mode first, then roll changes intentionally. Avoid frequent evictions during peak hours.
Cluster-level constraints: Pods cannot scale beyond available GPUs. Pair HPA with Cluster Autoscaler (or a GPU-aware node provisioning system) and make sure node provisioning time is part of your SLO strategy. If adding nodes takes 10–15 minutes, HPA alone won’t save p95 during sudden spikes. In that case, plan reserved headroom, or implement admission control and graceful degradation (lower max tokens, faster model) under load.
The operator deliverable here is a scaling policy that is predictable: it should be obvious when and why the service adds replicas, what the upper bound is, and what happens when you hit it.
Performance tuning is incomplete without cost discipline. GPUs magnify waste: a small misconfiguration can burn thousands per month. FinOps for inference starts with a utilization target that matches your risk tolerance. For interactive workloads, you might accept 40–60% steady utilization to preserve headroom for bursts; for batch workloads, you might target 70–90% with looser latency SLOs.
Build a cost/perf scorecard: For every meaningful change (batching, quantization, new runtime, autoscaling tweak), record: p50/p95/p99 latency, throughput (requests/sec and tokens/sec), GPU utilization, GPU memory headroom, error rate, and cost per 1k requests (or per 1M tokens). Tie the scorecard to a specific node type and pricing model so the math is explicit. This turns tuning from opinion into evidence.
Bin-packing and scheduling: Use GPU-aware scheduling to pack compatible workloads onto the same node when safe. Techniques include node labeling by GPU type, taints/tolerations for isolation, and affinity to keep latency-sensitive pods on less-contended nodes. If you use MIG or fractional GPU sharing, track contention carefully—bin-packing can improve cost but hurt tail latency if memory or PCIe bandwidth becomes shared bottlenecks.
Reserved capacity strategy: If your traffic baseline is stable, reserved instances/committed use can reduce unit cost significantly. Keep on-demand capacity for bursts and experimentation. The operational trick is to align reservations with “always-on” services and keep a smaller flexible pool for unpredictable demand. Review monthly utilization and resize reservations; over-reserving is just as wasteful as underutilizing GPUs.
Common cost mistakes: running too many warm replicas “just in case,” using oversized GPU types for small models, ignoring CPU/memory overhead that forces bigger nodes, and failing to cap max tokens (which can create unbounded cost per request). Your best lever is often policy: enforce limits, route long requests differently, and publish a clear SLO/cost contract to internal users.
Practical outcome: you can justify why a configuration is “best” not only because it’s fastest, but because it meets SLOs at the lowest cost per unit of useful work.
1. Why does the chapter argue that “fast” is not a single number for inference services?
2. Which workflow best matches the chapter’s recommended repeatable benchmarking loop?
3. When validating inference performance against an SLO, which metric approach does the chapter prioritize?
4. What is the purpose of separating cold-start behavior from steady-state measurements in benchmarks?
5. According to the chapter, what should the output of a tuning effort look like for another operator to follow?
Running GPU inference in production is less about “getting a model to respond” and more about continuously delivering predictable user experience under changing load, changing code, and changing infrastructure. As a sysadmin transitioning into a GPU cluster operator, your advantage is discipline: you already think in terms of uptime, change control, blast radius, and recovery. In this chapter you’ll translate that discipline into SLO-driven operations, observability that is actionable (not noisy), practical security controls, and incident response patterns specific to GPU inference.
The biggest mistake teams make in production inference is optimizing the wrong thing. They obsess over peak GPU utilization while users complain about tail latency; they add alerts on every metric and end up ignoring all of them; they treat security as a one-time checklist rather than an operational posture. Your goal is to build a system where you can answer, quickly and confidently: “Are users getting the experience we promised?” and “If not, what should we do next?”
We’ll structure operations around five realities: (1) inference is latency-sensitive and tail latency matters; (2) GPUs are scarce and saturation is common; (3) failures often present as performance degradation (not hard downtime); (4) the cluster is a security boundary as much as it is a scheduler; and (5) upgrades are inevitable—plan for them or they will plan for you. The sections below walk through concrete workflows you can apply immediately.
Practice note for Implement SLOs and dashboards that reflect user experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Instrument logs/metrics/traces and set actionable alerts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Harden the cluster and the serving supply chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run incident response for latency spikes, OOMs, and GPU faults: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Plan upgrades and lifecycle management without downtime surprises: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement SLOs and dashboards that reflect user experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Instrument logs/metrics/traces and set actionable alerts: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Harden the cluster and the serving supply chain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run incident response for latency spikes, OOMs, and GPU faults: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start production operations by defining Service Level Objectives (SLOs) that reflect user experience. For inference, “availability” alone is insufficient; a 200 OK that arrives in 10 seconds is often a failure. A practical SLO set is: request success rate, latency (p50/p95/p99), and quality-of-service constraints (timeouts, max queue time). Choose objectives based on product needs, then map them to measurable signals at the API boundary (gateway or inference service).
Define an error budget: if your SLO is 99.9% success per 30 days, your budget is ~43 minutes of “badness.” Badness includes errors and slow responses beyond the latency SLO. This is where engineering judgment matters: decide what counts as an error for your users. Common approach: any request that exceeds a hard timeout or exceeds p99 latency target is a budget burn event. Tie the budget to change policy: when burn rate is high, freeze risky deploys and focus on reliability work.
Latency objectives should include tail latency. A typical set might be p95 < 300ms and p99 < 800ms for a small model, but you must calibrate to model size and batching strategy. Beware of averaging: mean latency can look fine while p99 is awful due to queueing. To catch this, measure both service time (model execution) and queue time (time waiting for a worker/GPU slot). Queue time is often the first indicator of saturation.
Common mistakes: setting SLOs on internal metrics (GPU utilization) rather than user-facing metrics; setting one global SLO for all endpoints (different models and tenants behave differently); and alerting on static thresholds without burn-rate context. The practical outcome you want is a small number of dashboards and alerts that tell you whether you are consuming error budget and why.
Once SLOs exist, build an observability stack that can explain SLO burn. A common, production-proven foundation in Kubernetes is Prometheus for metrics collection, Grafana for dashboards, and OpenTelemetry (OTel) for traces and metrics export. Logging should be structured (JSON), consistent across services, and correlated with traces.
Metrics: instrument at the gateway and the inference service. At minimum export request count, request duration histograms, response codes, and queue length. Prefer histograms over summaries so you can aggregate correctly across pods. Use labels carefully: label cardinality can melt Prometheus if you include user IDs, prompt text, or high-cardinality request attributes. Use stable labels like model name, route, status class, and tenant (if bounded).
Tracing: use OTel auto-instrumentation where possible for HTTP/gRPC, then add custom spans for the parts that matter: admission/queueing, tokenization/preprocessing, model execution, postprocessing, and downstream calls (vector DB, feature store). For GPU inference, tracing helps you separate “model is slow” from “we are waiting for a worker.” Propagate trace IDs through the gateway and include them in logs so you can pivot from an alert to a specific slow request.
Common mistakes: collecting everything but using nothing; alerting on symptoms without a runbook; and forgetting to test alerts. Treat alerts as code: version them, review them, and verify they page you only for conditions requiring human intervention. The practical outcome is fast diagnosis: from a single latency alert you can determine whether the issue is load-driven queueing, a bad deployment, a downstream dependency, or infrastructure instability.
GPU inference adds a hardware dimension to operations. NVIDIA’s Data Center GPU Manager (DCGM) exposes metrics that let you distinguish “the model is busy” from “the GPU is unhealthy.” In Kubernetes, DCGM Exporter is commonly deployed as a DaemonSet, scraping per-GPU metrics into Prometheus. This is the basis for dashboards and alerts that catch GPU-specific failure modes before users feel them.
Focus on a small set of signals: GPU utilization (compute), memory used/total, memory bandwidth, temperature, power draw, and throttling reasons. A GPU can show high utilization while performance degrades due to thermal or power throttling. Similarly, memory pressure can cause fragmentation and OOMs at the framework level even when total memory appears “just below” capacity—watch allocation failures and framework logs alongside DCGM memory metrics.
Reliability signals: ECC error counts and Xid errors are critical. ECC correctable errors trending upward may predict a failing card; uncorrectable errors can crash workloads. Xid errors often correlate with driver issues, PCIe problems, or a dying GPU. Build alerts that page only when action is required (e.g., uncorrectable ECC, repeated Xid errors within a window), and create runbooks that include steps to cordon/drain the node, capture diagnostics, and quarantine hardware.
Common mistakes: using GPU utilization as a success metric (it’s a cost metric); ignoring throttling flags; and failing to separate “node-level GPU problems” from “deployment-level load problems.” The practical outcome is a feedback loop: you can justify capacity adds, identify bad nodes quickly, and avoid chasing phantom application bugs when the GPU is faulting.
Production inference clusters are attractive targets: they host valuable models, credentials, and high-cost compute. Security fundamentals are operational controls that reduce blast radius and prevent common misconfigurations from becoming incidents. Start with identity and authorization: apply least-privilege RBAC. Create separate namespaces for system components, inference workloads, and shared services. Bind service accounts to narrowly scoped roles; avoid giving workload namespaces access to cluster-wide resources unless there is a clear need.
Network policies: default-deny ingress/egress at the namespace level, then explicitly allow traffic between gateway and inference pods, and from inference pods to required dependencies (object storage, vector DB). This prevents lateral movement and accidental data exfiltration. Remember that many ML runtimes try to download models or dependencies at startup; in production, prefer pre-baked images or controlled artifact repositories so egress can be constrained.
Secrets: never mount cloud credentials broadly. Use Kubernetes Secrets or an external secrets manager, but treat both as sensitive. Rotate regularly and restrict who can read them via RBAC. For inference, common secrets include model registry tokens, TLS private keys, and API gateway credentials. Encrypt secrets at rest (KMS) and ensure pods do not log secrets—structured logging helps here by enforcing field-level hygiene.
Common mistakes: using a single “admin” kubeconfig in automation; allowing unrestricted egress; and mounting Docker socket or privileged containers “just to make it work.” The practical outcome is that a compromised pod cannot trivially reach the control plane, other namespaces, or sensitive systems, and your cluster remains a controlled environment even under pressure during incidents.
Inference workloads rely heavily on third-party dependencies: CUDA base images, Python packages, model artifacts, and custom operators. Supply chain security turns this dependency graph from an implicit risk into an explicit, enforceable policy. The goal is not perfection; it’s preventing the most likely and most damaging compromises (typosquatting, outdated vulnerable layers, untrusted artifacts).
Implement three layers: scanning, signing, and provenance. First, scan container images in CI and again in the registry for known vulnerabilities (CVEs). Treat scan results as data: set thresholds (e.g., no critical CVEs in runtime images) and build an exception process for unavoidable findings with documented compensating controls. Second, sign images (e.g., Sigstore/cosign) so the cluster can verify that only approved build pipelines produce deployable artifacts. Third, generate SBOMs (Software Bill of Materials) so you can answer “where is log4j?”-style questions quickly across your fleet.
Provenance policies and admission control make this real: use a policy engine (e.g., Gatekeeper/Kyverno) to require signatures, enforce allowed registries, and block privileged settings. For model artifacts, apply the same thinking: store models in a controlled registry/bucket, version them, and restrict who can publish. If your inference pod downloads model weights at runtime, verify checksums and require TLS; better yet, promote models through environments and pin exact versions.
Common mistakes: scanning only application code but ignoring base images; allowing developers to bypass policy “temporarily” with no expiration; and treating models as data rather than as executable supply chain inputs. The practical outcome is faster, safer releases: when a vulnerability drops, you can identify affected workloads, rebuild confidently, and enforce that only verified artifacts reach production.
Day-2 operations is where GPU clusters succeed or fail: drivers change, Kubernetes versions advance, workloads grow, and hardware ages. Plan upgrades like you would for any critical service, with added care for GPU drivers, CUDA compatibility, and device plugins. Maintain an upgrade matrix: Kubernetes version, NVIDIA driver, container runtime, device plugin, and inference runtime versions that are known-good together. Test upgrades in a staging cluster that mirrors production node types and workloads, including canary inference traffic and synthetic load.
To avoid downtime surprises, use strategies that respect GPU scarcity. Configure PodDisruptionBudgets so you don’t evict too many inference replicas at once, and use node pools so you can roll nodes gradually (surge capacity helps). For model servers that take time to warm up, implement readiness gates that only flip once the model is loaded and a health-check inference succeeds. During upgrades, watch queue depth and p99 latency; this is often where hidden capacity tightness shows up.
Capacity planning should be SLO-driven: forecast based on peak RPS, target latency, and concurrency per GPU. Track headroom explicitly (e.g., “we run at 60% peak GPU memory and 70% peak queue utilization”) rather than hoping autoscaling saves you. Autoscaling for GPU is slower and more expensive; combine horizontal pod autoscaling (when possible) with node autoscaling and a clear procurement lead-time plan.
Common mistakes: upgrading drivers directly on live nodes without a drain/cordon workflow; lacking canaries for inference; and skipping postmortems because the service “came back.” The practical outcome is operational confidence: you can ship changes, absorb incidents like latency spikes or GPU faults, and improve reliability over time instead of relearning the same lessons.
1. In production GPU inference, which operational focus best reflects the chapter’s guidance on delivering predictable user experience?
2. What is the primary purpose of implementing SLOs and dashboards in this chapter’s operating model?
3. Which alerting approach aligns with the chapter’s definition of actionable observability (not noisy)?
4. How does the chapter suggest thinking about security for a production inference platform?
5. According to the chapter, why must incident response and upgrade planning account for failures differently in GPU inference environments?