AI Certifications & Exam Prep — Intermediate
Run GPU-accelerated AI on Kubernetes—fast, scalable, and cost-aware.
AI workloads stress Kubernetes in ways typical web apps do not: GPUs are scarce and expensive, scheduling constraints are stricter, scaling signals are different, and a single misconfigured request can burn budget fast. This book-style lab course teaches you the practical patterns used to run training jobs and inference services on GPU-enabled Kubernetes clusters—while staying exam-ready and cost-aware.
You’ll work through a coherent six-chapter progression: from building a GPU-capable lab environment, to enforcing correct placement and multi-tenant controls, to autoscaling, troubleshooting, and finally governance and cost guardrails. Each chapter is structured like a short technical book chapter with milestones and sub-sections so you can study sequentially or revisit specific topics when preparing for certification-style tasks.
By the end, you will have a repeatable approach for deploying GPU-backed workloads with clear resource sizing, predictable scheduling behavior, and observability that surfaces GPU bottlenecks quickly. You’ll also implement the policies and controls that reduce accidental spend and enforce team boundaries in shared clusters.
This course is designed for Kubernetes users who can already read YAML and operate basic workloads, and now need production-grade patterns for AI and GPU use cases. If you’re preparing for an AI platform, MLOps, or Kubernetes-adjacent certification that expects hands-on competence, the chapter milestones will feel like timed lab objectives.
Chapter 1 establishes the lab and validates GPU capability so every later exercise is grounded in a working environment. Chapter 2 focuses on the scheduling primitives that determine whether GPU pods land where they should—and why they sometimes don’t. Chapter 3 applies those primitives to real workload types: inference services and batch training jobs with storage, rollouts, and performance hygiene. Chapter 4 adds scaling: selecting the right signals and safely combining pod scaling with node scaling. Chapter 5 teaches you to interpret the system under pressure, using metrics, events, and capacity analysis to troubleshoot latency, OOMs, and fragmentation. Chapter 6 closes with governance and cost controls—then a capstone sequence that mirrors common certification lab tasks.
If you’re ready to build exam-ready Kubernetes AI skills through a structured, lab-first book format, you can Register free and start the first chapter. Want to compare options before committing? You can also browse all courses on Edu AI.
Senior Platform Engineer (Kubernetes, MLOps, FinOps)
Sofia Chen is a senior platform engineer specializing in Kubernetes platform design for ML and GPU workloads. She has led cost-optimization and autoscaling initiatives across multi-tenant clusters, integrating observability and policy controls to keep AI infrastructure reliable and auditable.
This lab course assumes you already know basic Kubernetes objects and can read YAML. The goal of Chapter 1 is to make your cluster “GPU-ready” in a way that is repeatable under exam-style time pressure: you will build a consistent toolchain, provision a GPU node pool, validate drivers, install the NVIDIA device plugin, and stand up enough observability to troubleshoot scheduling and performance issues without guesswork.
GPU enablement is not a single switch. A working setup requires alignment across layers: the cloud (or local) GPU hardware, the host OS drivers, the container runtime configuration, Kubernetes scheduling and admission behavior, and a plugin that advertises GPU resources to the kubelet. The most common failure mode is to validate only one layer (for example, “nvidia-smi works on the node”) and assume the rest will follow. In this chapter, you’ll verify each dependency in sequence and record a checklist you can re-run later.
Engineering judgment matters even in a lab: you’ll choose between local and cloud environments based on cost and reproducibility; you’ll select Kubernetes versions that match plugin support; and you’ll decide what “minimum viable observability” looks like for GPU workloads. By the end, you should be able to deploy a pod that requests nvidia.com/gpu and confirm it is scheduled correctly, visible to metrics, and ready for the later chapters on scheduling constraints, autoscaling patterns, multi-tenancy guardrails, and FinOps controls.
Practice note for Milestone 1: Build the lab environment and toolchain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Provision a GPU node pool and validate drivers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Install NVIDIA device plugin and run a smoke test: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Baseline observability for nodes, pods, and GPUs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Capture a reproducible lab checklist for exam-style tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Build the lab environment and toolchain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Provision a GPU node pool and validate drivers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Install NVIDIA device plugin and run a smoke test: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Baseline observability for nodes, pods, and GPUs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your first design decision is where to run the lab. Local clusters (kind, k3d, MicroK8s) are great for learning Kubernetes objects, but GPU labs add constraints: you need a CUDA-capable GPU, matching drivers on the host, and a container runtime stack that can pass the device into containers. If you already have an NVIDIA GPU workstation, local can be fast and cheap. If you do not, cloud GPU nodes usually save time and reduce driver friction.
Cloud clusters (managed Kubernetes such as EKS/AKS/GKE) are the most exam-realistic because they resemble production environments: separate node pools, autoscaling, IAM policies, and standardized images. They also align naturally with later outcomes like cost guardrails and cost-aware scheduling. The tradeoff is spend. For labs, optimize for “short, repeatable sessions”: use a dedicated GPU node pool with small instance types, scale it to zero when not in use, and apply time-based shutdown automation if your platform supports it.
Milestone 1 (toolchain) starts here: choose one environment, then standardize your tools. At minimum install kubectl, helm, a terminal YAML editor, and a way to authenticate to the cluster. Keep a single “lab repo” with manifests and notes so you can reproduce tasks quickly—this is essential for exam-style performance.
Before installing anything GPU-related, confirm your Kubernetes baseline is compatible. The NVIDIA device plugin tracks Kubernetes feature changes (especially around device allocation, security contexts, and runtime behavior). As a rule, use a supported Kubernetes version within the “current minus a few” range recommended by the plugin’s documentation and your managed service. Avoid combining very old Kubernetes with new container runtimes or vice versa; GPU enablement is sensitive to version skew.
Next, identify your container runtime. Most clusters today use containerd. Docker Engine is less common for kubelets, and “Docker shim” is gone in modern Kubernetes. GPU workloads require the runtime to understand the NVIDIA container stack (or a compatible CDI configuration), otherwise containers will start but not see devices. Don’t postpone this check: many “device plugin installed but no GPUs appear” issues are runtime misconfiguration, not the plugin itself.
Milestone 2 (provision a GPU node pool) begins with prerequisites: pick a node image that is known to work with GPUs. Managed services often offer GPU-optimized images or add-ons. If you roll your own nodes, ensure the OS kernel version supports the driver version you plan to install. Also confirm node sizing: GPU nodes need adequate CPU and memory for data loading and preprocessing; starving the node will make GPU utilization look “bad” even when scheduling is correct.
kubectl get nodes -o wide and ensure nodes are Ready.kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.containerRuntimeVersion}'.nodepool=gpu) to support later scheduling with node affinity and taints.Common mistake: mixing multiple runtimes or custom runtime configs across node pools. For a lab, standardize the GPU pool first, then expand complexity later when you study multi-tenancy and cost-aware scheduling.
The GPU hardware is invisible to Kubernetes until the host OS can drive it and containers can access it. Start with the node-level truth: SSH into a GPU node (or use your cloud provider’s session manager) and run nvidia-smi. If nvidia-smi fails, do not proceed to Kubernetes plugin steps; fix drivers first. This is Milestone 2’s “validate drivers” checkpoint.
Driver installation strategy depends on your platform. Many managed Kubernetes services provide an official GPU driver installer (as an add-on or a DaemonSet). That approach is generally safer than manual installs because it aligns kernel modules, driver versions, and reboot behavior. If you install drivers yourself, match the driver to the GPU generation and the CUDA compatibility you need. For labs, you don’t need the newest CUDA, you need a stable match.
Next is the container layer: NVIDIA Container Toolkit (or a CDI-based setup) configures the runtime to mount GPU devices and inject required libraries into containers. On containerd, this usually means configuring an NVIDIA runtime or enabling CDI and restarting containerd. A frequent pitfall is to install drivers and toolkit but forget the runtime restart, leaving pods unable to see GPUs until the node is recycled.
Finally, consider using a RuntimeClass to make GPU runtime selection explicit. In some environments you may define a runtime class (e.g., nvidia) pointing to an NVIDIA-aware runtime handler. This reduces ambiguity and makes your manifests clearer for exams and for later multi-tenant controls. If your environment uses a single runtime with CDI enabled, you may not need a runtime class—but you should still understand when it’s required.
nvidia-smi shows GPU and driver version.Milestone 3 is where Kubernetes becomes GPU-aware: the NVIDIA device plugin registers GPU resources with the kubelet so that schedulers can place pods based on GPU requests. Without it, your nodes may have GPUs physically, but Kubernetes will not advertise nvidia.com/gpu, and pods requesting GPUs will remain Pending.
Deploy the plugin using the vendor-recommended manifest or Helm chart. In a lab, prefer the official installation path because it encodes tolerations, security contexts, and host mounts that change over time. Once installed, the plugin typically runs as a DaemonSet on GPU-capable nodes. If it schedules on non-GPU nodes, that’s not always harmful, but it can create noise; use node selectors or affinity to target your GPU pool labels.
Common pitfalls to recognize quickly:
nvidia-smi and runtime config.Also note the relationship to later scheduling outcomes: once GPUs are advertised as allocatable resources, Kubernetes will enforce GPU requests as integer resources. This is the foundation for cost control (don’t run GPU pods without requesting GPUs) and for safe multi-tenancy (quotas and priority classes depend on accurate resource accounting).
Milestone 3 ends only when you can prove, via Kubernetes, that GPUs are allocatable and usable from a pod. Start with cluster-level inspection. Run kubectl describe node <gpu-node> and look for Capacity and Allocatable entries for nvidia.com/gpu. If the resource isn’t listed, the device plugin isn’t registering devices (or is running on the wrong nodes). Also review kubectl get pods -n kube-system to ensure the plugin DaemonSet has a Ready pod on each GPU node.
Next, run a smoke test pod that requests a GPU. Keep the spec minimal and explicit because you’ll reuse it later when debugging scheduling behavior with taints/tolerations and node affinity. The key is the resource request/limit: GPUs are typically requested via resources.limits (and sometimes requests) as an integer. If you forget the GPU limit, the pod may schedule onto a GPU node but won’t be granted a device, leading to confusing runtime errors.
When the pod starts, execute nvidia-smi inside the container (or run a CUDA sample) to confirm device visibility. If nvidia-smi works in the container, you’ve validated the full chain: driver → runtime → device plugin → Kubernetes scheduling → container execution.
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpukubectl get pod -o wide and confirm it landed on the GPU node pool.kubectl describe pod and read Events for “Insufficient nvidia.com/gpu” or missing tolerations.Milestone 5 begins here: capture the exact commands you used and the success criteria (what output proves GPUs are working). In exam scenarios, you’re graded on outcomes; your checklist should map “symptom → command → expected signal → fix.”
Milestone 4 establishes baseline observability so you can troubleshoot performance and cost signals later. For GPU workloads, “pod is Running” is not enough: you need to know whether the GPU is actually utilized, whether the node is CPU/memory bottlenecked, and whether your workload is throttled by I/O or networking. A minimal but effective stack combines node-level metrics, Kubernetes state metrics, and GPU-specific telemetry.
At the node level, node exporter provides CPU, memory, disk, and network metrics. Pair it with Kubernetes metrics sources (commonly kube-state-metrics and a metrics backend such as Prometheus) to understand pod scheduling and resource requests. For GPUs, use NVIDIA’s DCGM (Data Center GPU Manager) exporter to expose utilization, memory usage, temperature, power draw, and sometimes per-process signals. DCGM metrics are essential for cost control: an idle GPU with a Running pod is pure waste, and you want that visible immediately.
Dashboards are not decoration—they shorten incident and lab-debug cycles. Even a single dashboard with (1) GPU utilization and memory, (2) node CPU/memory pressure, and (3) pod counts by namespace helps you correlate “why is training slow?” with real constraints. In later chapters, these same metrics drive autoscaling decisions (HPA/VPA patterns) and multi-tenant protections (quotas and priority classes), so building the habit now pays off.
Close the chapter by updating your reproducible lab checklist: toolchain versions, cluster version, node pool labels/taints, driver/toolkit versions, device plugin install method, smoke test manifest, and the metric endpoints/dashboards you rely on. This checklist is your reset button when something breaks—and your time saver when you need to re-create the environment under exam constraints.
1. What is the main outcome Chapter 1 is aiming for by the end of the lab setup?
2. Why does the chapter emphasize verifying GPU dependencies in sequence instead of relying on a single check like “nvidia-smi works”?
3. Which component is responsible for advertising GPU resources to Kubernetes so the kubelet can schedule GPU requests?
4. Which scenario best reflects the “most common failure mode” described in Chapter 1?
5. What does Chapter 1 consider “minimum viable observability” for GPU readiness?
GPU scheduling is the moment your Kubernetes cluster stops being “a place to run containers” and becomes a dependable platform for AI training and inference. The scheduler has one job: pick a node that can run your pod. For AI, that decision must account for scarce accelerators, heterogeneous hardware, and cost-sensitive capacity. In this chapter you will build the mental model and the muscle memory to place GPU workloads correctly, keep GPU nodes protected, avoid noisy-neighbor failures, and troubleshoot Pending pods quickly.
The workflow you’ll repeat in real environments looks like this: (1) validate GPU discovery via the NVIDIA device plugin, (2) request GPU resources correctly so the scheduler can do its job, (3) enforce placement with labels/affinity and guard GPU nodes with taints/tolerations, (4) apply multi-tenant controls (quotas, limits, priority), and (5) debug scheduling failures using events and scheduler hints. Each milestone in this chapter maps to those steps, and each step directly affects reliability and spend: mis-specified requests waste expensive GPU time; weak placement rules mix inference and training; missing quotas allow a single team to consume the fleet.
Engineering judgment matters throughout. The “most strict” placement rule is not always the best; overly rigid policies strand capacity and drive autoscalers to add nodes unnecessarily. Conversely, being too permissive can silently put expensive models on the wrong GPUs, reduce throughput, or cause unpredictable eviction behavior. The goal is controlled flexibility: tell Kubernetes what must be true (hard constraints) and what would be nice (soft preferences), then verify with metrics and events.
Practice note for Milestone 1: Schedule the first GPU pod with correct resource requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Enforce placement using labels, affinity, and selectors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Protect GPU nodes with taints and tolerations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Prevent noisy neighbors with quotas and limit ranges: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Resolve scheduling failures using events and logs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Schedule the first GPU pod with correct resource requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Enforce placement using labels, affinity, and selectors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Kubernetes schedules pods based on requests (what you need to run) and enforces limits (the maximum you’re allowed to consume) for CPU and memory. GPUs are different: they appear as extended resources exposed by a device plugin, most commonly nvidia.com/gpu. Extended resources are integer-only and are not overcommitted by Kubernetes; if you request 1 GPU, the scheduler must find a node with at least 1 allocatable GPU.
Milestone 1 is to run a first GPU pod that schedules predictably. Your pod spec must include a GPU request (and typically a matching limit). If you omit it, the pod may land on a CPU node and then fail at runtime when CUDA libraries can’t find a device. If you request GPUs but the device plugin is missing or misconfigured, the pod will remain Pending because no node advertises that extended resource.
A practical minimal container spec looks like: set resources.limits["nvidia.com/gpu"]: 1 (and optionally the same under requests). For GPUs, many teams set request=limit to make intent explicit and avoid confusion during reviews. Keep CPU/memory requests realistic as well; an AI job that requests 1 GPU but forgets CPU/memory might schedule onto a GPU node yet starve itself (or its neighbors) at runtime.
Before you trust scheduling, validate node capacity: kubectl describe node <node> should show Capacity and Allocatable for nvidia.com/gpu. If it does not, fix the device plugin or node driver installation first; scheduling rules cannot compensate for missing resources.
Most organizations end up with a heterogeneous GPU fleet: different GPU models (T4, L4, A10, A100/H100), different memory sizes, and sometimes different drivers or CUDA capabilities. If you treat all GPU nodes as equivalent, you’ll get mismatches: a training job needing 80GB may land on a 16GB card; an inference service optimized for a specific architecture may lose performance. The fix starts with a clear labeling strategy (Milestone 2’s foundation).
Use labels to express stable, meaningful scheduling dimensions. Prefer a small vocabulary that survives node replacement and autoscaling. Examples include: gpu.nvidia.com/model=A100, gpu.nvidia.com/memory-gb=80, gpu.nvidia.com/mig-enabled=true, workload.gpu/tier=training, or node.kubernetes.io/instance-type from the cloud provider. Avoid labels that change frequently (like “currently free”)—that’s what metrics and schedulers are for.
A common mistake is relying only on nodeSelector with one label such as gpu=true. That puts everything on “any GPU,” which is rarely what you want once you introduce multiple GPU models or specialized node pools. Another mistake is encoding too much detail (driver versions, minor differences) into scheduling constraints, which can make pods unschedulable and trigger unnecessary scale-outs—directly increasing cost.
Practical outcome: with a consistent labeling strategy, you can steer jobs to the right GPU class, keep expensive nodes reserved for the workloads that need them, and give autoscalers a clean target when expanding a specific pool.
Affinity rules are how you express placement intent beyond basic selectors. For AI workloads, the key distinction is: node affinity places pods on nodes with certain labels (hardware/pool constraints), while pod affinity/anti-affinity places pods relative to other pods (co-locate or spread). Milestone 2 uses these tools to enforce placement without overconstraining the cluster.
Use requiredDuringSchedulingIgnoredDuringExecution (hard constraints) for must-have requirements: “must run on A100 nodes,” “must run where MIG is enabled,” or “must run in the inference pool.” Use preferredDuringSchedulingIgnoredDuringExecution (soft preferences) for optimizations: “prefer nodes in the same zone as my data cache,” or “prefer nodes with a particular GPU model but allow fallback.” Soft preferences reduce the chance of Pending pods and help control costs by using available capacity rather than scaling out immediately.
Pod anti-affinity is especially practical for inference reliability: you can spread replicas across nodes to avoid a single-node failure taking down the service. For example, anti-affinity against the same app label at the hostname topology key prevents co-locating replicas on one node. Pod affinity can be useful when you want co-location for performance (e.g., an inference service near a GPU-sidecar cache), but be careful: co-location requirements can create scheduling deadlocks when combined with GPU scarcity.
Engineering judgment: start with node affinity to ensure hardware correctness, then add pod anti-affinity only where availability requires it. Prefer “soft spread” unless you have enough capacity to guarantee hard spread.
Labels and affinity help your GPU workloads find the right nodes, but they do not stop non-GPU workloads from landing on GPU nodes. That’s where taints and tolerations come in (Milestone 3). A taint on a node repels pods unless the pod explicitly tolerates it. This is one of the strongest cost-control tools in Kubernetes: it prevents “accidental” scheduling of cheap CPU services onto expensive GPU machines.
A common pattern is tainting GPU node pools with something like gpu=true:NoSchedule. Then, only pods that truly need GPUs include a matching toleration. Combine this with GPU requests: toleration alone should not be enough to land on a GPU node; it should be paired with nvidia.com/gpu requests and node affinity/selector so you don’t create a general-purpose backdoor into the GPU pool.
Common mistake: adding a toleration broadly via a shared Helm chart “just in case.” That defeats the purpose and can silently inflate spend. Another mistake is tainting GPU nodes but forgetting to add tolerations to system-level DaemonSets that must run everywhere (logging, monitoring). In practice, you either (1) ensure those DaemonSets tolerate the taint, or (2) keep them out of GPU pools intentionally and provide alternative telemetry, depending on your operational needs.
Practical outcome: GPU nodes become protected real estate. Only workloads that explicitly opt in—and meet the hardware constraints—consume them, which reduces accidental cost leakage and improves scheduling predictability.
Once multiple teams share a GPU cluster, the biggest operational risk is not “no GPUs exist,” but “someone took them all.” Kubernetes multi-tenancy controls—especially ResourceQuota and LimitRange—are the core of Milestone 4. They create predictable boundaries per namespace so one team can’t starve others, intentionally or accidentally.
ResourceQuota can cap total GPU consumption per namespace by limiting requests.nvidia.com/gpu (and CPU/memory). This is a direct guardrail for both fairness and cost control. For example, a research namespace might be capped at 8 GPUs while production inference is capped differently. Quotas also make scheduling failures faster to diagnose: instead of “cluster is full,” you get a clear “exceeded quota” signal.
LimitRange sets defaults and min/max per pod/container. This matters because AI manifests are often copied between projects; a missing request can lead to BestEffort pods competing unpredictably for CPU/memory, even if GPUs are requested. With a LimitRange, you can enforce that every container has CPU/memory requests and keep runaway settings in check.
Engineering judgment: quotas should reflect business priority and the cluster’s scaling model. If you rely on node autoscaling, quotas still matter—they prevent one namespace from triggering massive scale-outs. Pair quotas with PriorityClasses (covered later in the course outcomes) when you need “production wins” behavior under contention.
Milestone 5 is about speed: when a GPU pod is Pending, you should be able to identify the reason in minutes. Start with kubectl describe pod and read the Events section. The default scheduler explains what it tried and why it failed: insufficient GPUs, node taints not tolerated, node affinity mismatch, insufficient CPU/memory, or quota violations. Events are your primary “scheduler reasoning” output.
Typical failure patterns and fixes are repeatable:
kubectl get nodes --show-labels) and ensure your label keys/values are correct and consistently applied to the node group template.If events are unclear, look at cluster-wide signals: kubectl get events -A for broader context, and review scheduler logs (managed Kubernetes often exposes them via control plane logging). Also check node status: a GPU node might be NotReady, cordoned, or missing the device plugin; in that case, you will not see allocatable GPUs even if the hardware exists.
A disciplined debugging habit prevents expensive downtime. Don’t “randomly tweak” affinity and tolerations until it schedules; instead, use the event message as the hypothesis, verify the underlying state (labels, taints, allocatable resources, quotas), apply the smallest change, and re-check events. Practical outcome: faster recovery, fewer accidental policy bypasses, and a scheduler configuration you can explain and defend during audits.
1. Which workflow best reflects the chapter’s recommended approach to reliably schedule AI GPU workloads in Kubernetes?
2. Why are correct GPU resource requests critical for the Kubernetes scheduler in AI clusters?
3. In the chapter’s framing, what is the primary purpose of using labels, selectors, and affinity for GPU workloads?
4. How do taints and tolerations help manage GPU nodes according to the chapter?
5. What is a key trade-off described in the chapter when choosing strict vs flexible placement rules for GPU workloads?
Once your cluster can advertise GPUs (via the NVIDIA device plugin) and you have a basic scheduling strategy, the next step is operational: running real AI workloads in a way that is fast, safe, and cost-aware. This chapter focuses on the day-to-day mechanics of packaging GPU inference services, running batch training jobs, optimizing startup (time-to-first-token), choosing storage paths that don’t starve the GPU, and rolling out changes without burning expensive capacity.
A practical mindset helps: GPUs are not “just another resource.” They are scarce, expensive, and often coupled to driver/library constraints. That means your Kubernetes objects need to encode intent clearly (resource requests, node selection, and lifecycle behavior), and your images and probes must be designed for CUDA realities. You will work through five milestones across the chapter: (1) package an inference service with GPU access, (2) run a batch training job with sensible retries, (3) optimize images and startup, (4) configure storage for throughput, and (5) apply rollout safety for GPU-backed deployments.
Along the way, keep an eye on the engineering trade-offs you’re making. For example, insisting on a single GPU model may improve determinism but increases scheduling delay; mounting a remote filesystem may simplify data access but can cut effective GPU utilization in half if throughput is poor. Kubernetes gives you the levers—your job is to apply them intentionally.
The rest of the chapter is organized by workload types, images/runtime, health checks, storage, rollout strategies, and performance hygiene (CPU/memory alongside GPUs). Treat each section as a checklist you can apply in your own manifests and pipelines.
Practice note for Milestone 1: Package an inference service with GPU access: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Run a batch training job and manage retries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Optimize images and startup for faster GPU time-to-first-token: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Configure storage and data paths for throughput: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Apply rollout safety for GPU-backed deployments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Package an inference service with GPU access: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Run a batch training job and manage retries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
AI on Kubernetes usually falls into two execution shapes: long-running inference services and finite batch jobs. Kubernetes gives you different controllers for each, and choosing the right one is the first cost-control decision because it determines restart behavior, rollout semantics, and how work is counted as “done.” For Milestone 1 (packaging an inference service), you typically use a Deployment because you want a stable endpoint, rolling updates, and replica management. A Deployment pairs well with a Service and an HPA when request volume is variable.
For Milestone 2 (batch training), use a Job when the unit of work is finite (train for N steps, produce artifacts, exit). Jobs track completions and support backoffLimit for retries. A common mistake is using a Deployment for training: the controller interprets “process exited” as failure and keeps restarting, wasting GPU time and potentially corrupting outputs. With Jobs, also decide whether a retry is safe: if your training writes checkpoints, retries can resume; if it writes in-place without transactional discipline, retries may produce inconsistent artifacts.
Use a CronJob for scheduled batch tasks like nightly evaluation, embedding refresh, or periodic fine-tune runs. CronJobs can create many Jobs over time, so cost and quota discipline matter. Configure concurrencyPolicy (e.g., Forbid) to prevent overlapping runs that double GPU spend, and set startingDeadlineSeconds so missed schedules don’t spawn surprise catch-up jobs during peak hours.
Across all three types, encode GPU intent explicitly with resource requests (e.g., nvidia.com/gpu: 1) and placement rules (node affinity or tolerations) so the scheduler doesn’t guess. Treat “unschedulable due to GPU” as a first-class signal: it is usually either a genuine capacity issue or a constraint mismatch (wrong GPU type label, missing toleration, or requesting 2 GPUs on nodes that only have 1).
Most GPU workload failures are image/runtime mismatches: the Pod lands on a GPU node, but the process cannot load CUDA libraries, sees no device, or crashes on import. For Milestone 1 and Milestone 2, aim for a repeatable image strategy. If you use NVIDIA CUDA base images, align the CUDA version with your framework build (PyTorch/TF) and with the driver compatibility guarantees of your environment. A frequent mistake is “it works on my laptop” with a different driver/CUDA combo than the cluster nodes.
In Kubernetes, your container generally doesn’t need to ship the driver, but it does need compatible user-space libraries. The NVIDIA device plugin and runtime integrate the device into the container, but your process still must be able to load the correct libcuda-compatible stack. Validate inside the container with lightweight checks (e.g., nvidia-smi if present, or framework-level device queries) during development—not as a production readiness probe that runs every few seconds.
Milestone 3 (optimize images and startup) is where engineering judgment pays off. Large images delay scheduling-to-ready time and waste GPU minutes while the node pulls layers. Use multi-stage builds, minimize OS packages, and cache Python wheels effectively. Pin dependencies to avoid surprise downloads at startup. If your model is large, decide whether to bake it into the image (fast startup, slower builds, larger pulls) or fetch at runtime (smaller image, slower cold starts, extra network). For GPU cost control, you usually want to avoid “GPU allocated while downloading 20 GB of model weights,” so prefer pre-staging weights on shared storage or using an initContainer that runs before the main container requests the GPU (for example by separating model download into a CPU-only Pod step or using a workflow engine).
Finally, treat GPU runtime knobs as configuration, not code: set environment variables for memory behavior or performance (where appropriate), and keep them consistent across environments. When performance differs across nodes, suspect hidden differences: GPU model, driver version, power settings, or CPU limits that starve the GPU feed pipeline.
Inference services are expensive to run and easy to mis-probe. A naive readiness probe that calls “/generate” can inadvertently allocate GPU memory, warm caches, or even trigger long computations. Worse, during rollouts it can amplify traffic and cause cascading failures. For Milestone 1, design health checks that reflect service correctness without burning GPU cycles. Typically you want three layers: (1) a fast liveness check to detect deadlocks, (2) a readiness check that confirms the model is loaded and the service can accept traffic, and (3) optional startupProbe to protect slow model initialization from premature restarts.
On GPU workloads, the startup phase can be long: pulling the image, loading weights into CPU memory, transferring to GPU, compiling kernels, or initializing TensorRT. Use startupProbe with a generous failure threshold so Kubernetes doesn’t kill the container during expected warmup. Then let readiness become strict: only report ready after the model is loaded and you have verified a minimal forward pass (ideally CPU-only or a tiny GPU check that doesn’t allocate large buffers). A common mistake is reporting readiness as soon as the HTTP server binds a port; that causes traffic to arrive while the model is still loading, creating timeouts and retries that overload the node.
Implement back-pressure explicitly. If the model server has a queue, expose metrics and consider returning 429/503 when saturated rather than letting requests pile up. Your readiness endpoint can incorporate “can accept new requests” as a signal. This becomes crucial during rollouts (Milestone 5): if the new ReplicaSet is technically running but has not finished warming, you want it excluded from load balancing.
Finally, separate observability probes from user traffic. Use distinct paths like /healthz and /readyz, keep timeouts short, and avoid dependencies on external services that can flap. If you need deeper checks, run them on a slower cadence via background threads and have probes read cached status.
GPU utilization is often limited by data, not compute. Milestone 4 is about feeding the accelerator consistently by choosing the right storage pattern for each stage: model weights, datasets, checkpoints, and logs. Start by classifying data as read-mostly (weights, static datasets), write-heavy (checkpoints), or scratch (temporary shards, preprocessed batches). Each class maps naturally to different Kubernetes volumes.
Use PVCs when you need durability across Pod restarts or rescheduling, such as training checkpoints or shared model artifacts. The practical trade-off is latency and throughput: network-attached volumes vary widely. If training throughput is poor, measure I/O (read bandwidth, IOPS, and latency) and verify you are not bottlenecked on a single shared volume. A common mistake is putting both dataset reads and checkpoint writes onto the same slow PVC, causing periodic stalls that look like “GPU underutilization.”
Use ephemeral volumes (like emptyDir) for scratch space and caching. For example, you can stage frequently accessed dataset shards onto ephemeral disk at startup. The risk is that rescheduling loses the cache, so use this when you can tolerate cache rebuilds or when you have a warm pool of nodes. If nodes have fast local NVMe, ephemeral caching can dramatically reduce step time compared to reading from remote object storage.
Dataset staging is a deliberate pattern: download or copy data close to the compute before the GPU is engaged. Practically, this can be done via an initContainer (CPU-only) that pulls data to emptyDir, or via a separate pre-staging Job that populates a PVC. The key engineering judgement is cost: you want the “waiting on network” phase to happen without reserving the GPU whenever possible. When your workload requires GPU to preprocess (e.g., tokenization on GPU), ensure the staging step is still optimized and bounded.
For inference, weights are usually the biggest concern. If you mount weights from a shared store, validate concurrency behavior: dozens of replicas starting at once can stampede the storage backend. Rate-limit rollouts, or bake hot weights into the image for high-availability paths.
Milestone 5 is where GPU cost control and reliability collide. Updating an inference Deployment can temporarily double GPU usage (old and new replicas overlap) and can destabilize latency if new pods are cold. A default rolling update is rarely ideal without tuning. Set maxSurge and maxUnavailable intentionally: if GPUs are scarce, keep surge low to avoid pending pods and wasted scheduling churn; if availability is critical, allow limited surge but pair it with strict readiness so new pods only receive traffic when genuinely warm.
Canary releases reduce blast radius. In Kubernetes you can approximate a canary by running a second Deployment with a small replica count and splitting traffic (via service mesh, ingress weighting, or separate services). The practical outcome is faster detection of model regressions (accuracy, latency, memory leaks) before you scale the new version. Common mistakes include canaries that share the same HPA signals and inadvertently scale up due to test traffic, or canaries that miss real traffic patterns and fail to expose tail-latency problems.
Rollback signals must be measurable. Use more than “pods are running.” Track request error rate, P95/P99 latency, GPU memory usage, and restart counts. If a model version slowly leaks GPU memory, it may pass readiness but fail after hours. Integrate metrics-based alerts that trigger a rollback workflow (manual or automated) and freeze further rollouts. Also consider PodDisruptionBudgets so node maintenance doesn’t evict too many GPU pods at once, which can cause a thundering herd of cold starts.
For batch training Jobs, “rollout” looks different: you version the image and configuration, and you control retries. If you change training code, do not rely on implicit restarts; create a new Job name/version and keep previous outputs immutable. This makes failures diagnosable and prevents accidental overwrites of checkpoints.
Requesting a GPU is necessary but not sufficient. Many “slow GPU” incidents are actually CPU starvation, memory pressure, or poor threading defaults. Treat CPU and memory as first-class alongside nvidia.com/gpu. For inference, too little CPU can bottleneck tokenization, request parsing, or streaming responses, leaving the GPU idle between kernels. For training, CPU limits can throttle dataloaders so the GPU waits on batches.
Start with explicit resource requests/limits for CPU and memory that match your concurrency model. If you use multiple workers for data loading, align num_workers with available CPU cores and avoid setting a CPU limit so low that Linux throttling undermines throughput. Memory sizing matters because model loading often spikes RSS before settling; if you set memory limits too tightly, you’ll see OOMKills during warmup, which is especially wasteful when the pod already reserved a GPU.
Use practical telemetry: GPU utilization, SM occupancy, GPU memory, CPU usage, and disk/network throughput. If GPU utilization is low but CPU is pegged, increase CPU requests or optimize preprocessing. If GPU memory is near the limit and performance degrades, reduce batch size or enable more memory-efficient kernels. If pods take a long time to become ready, revisit Milestone 3: image size, dependency downloads, and weight staging.
Finally, remember scheduling side-effects: if you request “1 GPU + lots of CPU,” you may reduce bin-packing and strand GPUs on nodes where CPU is exhausted. Conversely, requesting too little CPU can pack many pods onto a node and create contention. The practical outcome is a sizing loop: measure, adjust requests, and keep profiles per workload (small inference, large inference, training) so your cluster autoscaling and quotas remain predictable.
1. Why does Chapter 3 emphasize that GPUs are not “just another resource” when defining Kubernetes objects for AI workloads?
2. Which set of actions best matches the chapter’s goals for reducing wasted GPU minutes?
3. What trade-off does the chapter describe when insisting on a single GPU model for a workload?
4. According to the chapter, why can mounting a remote filesystem be risky for GPU utilization?
5. Which failure mode is Chapter 3 explicitly trying to help you avoid through image/runtime and probe design for GPU services?
Autoscaling is where GPU clusters either become a cost-efficient platform—or an expensive science project. Unlike general web workloads, AI inference and training behave in bursts, have hard resource constraints (GPU memory, model size), and frequently depend on external queues and batch schedulers. That means “scale on CPU” is usually wrong, “scale on request rate” is sometimes wrong, and “scale on GPU utilization” can be dangerously misleading if you don’t understand saturation vs throughput.
This chapter builds a practical mental model for scaling decisions, then walks through a lab-style progression: scale an inference service with HPA using custom metrics (Milestone 1), use VPA safely without thrash (Milestone 2), trigger node scale-out when GPU pods are pending (Milestone 3), reduce idle spend with scale-down and disruption controls (Milestone 4), and finally validate everything with load tests and dashboards (Milestone 5).
The goal is not just to “turn on autoscaling,” but to make it predictable: pods scale when demand increases, nodes scale when pods can’t schedule, and the whole system scales down without dropping in-flight requests or killing expensive model warmups. Along the way, you’ll learn where autoscalers fight each other, how to define guardrails, and which signals actually map to cost and user experience.
Practice note for Milestone 1: Scale an inference service with HPA using custom metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Use VPA safely for AI workloads and avoid thrash: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Trigger node scale-out with pending GPU pods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Reduce idle time with scale-down and disruption controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Validate autoscaling with load tests and dashboards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Scale an inference service with HPA using custom metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Use VPA safely for AI workloads and avoid thrash: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Trigger node scale-out with pending GPU pods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Reduce idle time with scale-down and disruption controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Before configuring any autoscaler, separate three independent control loops: pod scaling, node scaling, and work queue dynamics. Pod scaling (HPA/VPA) adjusts the number of replicas or their resource requests. Node scaling (Cluster Autoscaler or similar) adjusts the number of nodes in a GPU node group. Queue dynamics (request queue, Kafka topic lag, SQS depth, Ray/Serve backlog) describe how work arrives and how quickly it can be processed.
A reliable mental model is: pods scale for throughput, nodes scale for capacity constraints, and queues scale for decoupling. If you only scale pods, but the cluster has no free GPUs, you will get pending pods and no additional throughput. If you only scale nodes, but your service is single-replica or limited by concurrency, you will pay for idle GPUs. If you only watch GPU utilization, you may scale too late because a saturated queue can exist while utilization looks “moderate” due to batching, throttling, or backpressure.
Common mistake: treating autoscaling as a single knob. In practice, you want HPA to add replicas when demand rises, then Cluster Autoscaler to add GPU nodes only when those replicas cannot schedule. This chapter’s milestones map to that layered approach: first scale an inference service at the pod level, then ensure nodes expand only when needed, then ensure the system contracts safely.
Horizontal Pod Autoscaler (HPA) is often taught with CPU utilization. For AI inference, CPU is frequently a poor proxy for demand: the hot path is GPU-bound, CPU stays low due to async request handling, and high CPU can appear during model load rather than steady-state serving. Scaling on CPU can therefore oscillate or scale too late, creating long queues and timeouts.
Milestone 1 focuses on scaling an inference service with HPA using custom metrics. The most defensible signals are usually “work backlog” and “service latency,” not raw device utilization. Practical options include: request queue depth (from your gateway or message queue), in-flight requests per pod, tokens/sec per replica, or p95 latency. GPU utilization can help, but only if you also understand batching and concurrency: high utilization with stable latency may be fine; moderate utilization with rising latency can indicate memory pressure, kernel launch overhead, or contention.
Implementation pattern: expose an application metric (Prometheus format) and use the Prometheus Adapter to map it into the Kubernetes custom metrics API, then configure HPA with a target value. For example, scale replicas to keep “inflight_requests” near a per-replica target, or scale to keep “queue_depth” under a threshold. Keep the math simple; unstable formulas make unstable scaling.
minReplicas high enough to avoid cold starts during moderate traffic; use maxReplicas to cap spend.Common mistake: using GPU utilization alone as the target. A single replica can show 90% GPU utilization while still delivering poor p95 latency if requests are queued. Prefer signals that directly represent SLO impact (latency/backlog), then use GPU utilization as a diagnostic metric on dashboards.
Vertical Pod Autoscaler (VPA) is powerful for right-sizing, but it can be hazardous for AI workloads if you treat it like a general-purpose optimizer. For GPU-bound services, CPU/memory requests still matter for scheduling and QoS, but changing them frequently can trigger evictions and restarts—expensive when each restart re-downloads weights or warms caches. Milestone 2 is about using VPA safely and avoiding thrash.
VPA has three practical modes: Off, Initial, and Auto. In Off, VPA only produces recommendations; this is ideal for learning typical CPU/memory footprints without changing live pods. In Initial, VPA applies recommendations only at pod creation time; this avoids mid-flight evictions and is usually the safest starting point for inference deployments. In Auto, VPA can evict pods to apply new requests; this can be acceptable for stateless services with fast startup, but risky for large-model inference or training sidecars.
A practical right-sizing workflow is: run VPA in Off for several days, review recommendations, then codify them as explicit requests/limits in your Deployment or Helm values. If you enable Initial mode, keep a tight range via VPA policies and set a reasonable updateMode to prevent frequent shifts. For AI, prioritize stability over perfect packing; an extra 200m CPU request is cheaper than repeated model reloads.
nvidia.com/gpu) and are not managed by VPA. Treat GPU count as an explicit sizing decision.Common mistake: turning on VPA Auto for a model server and then wondering why latency spikes every hour. The cause is often eviction-driven restarts. For AI services, VPA is best as a measurement tool first, an initializer second, and an automatic evictor only when you have strong disruption tolerance.
Once HPA is adding replicas, the next question is whether the cluster has enough GPUs to schedule them. Milestone 3 focuses on triggering node scale-out with pending GPU pods. This is exactly what Cluster Autoscaler (CA) is designed to do: watch unschedulable pods and add nodes in the appropriate node group so that the scheduler can place them.
For GPU clusters, node groups are typically separated by GPU type (A10, L4, A100), pricing model (on-demand vs spot), and sometimes by tenancy. CA needs clear signals to pick the right group: node labels (e.g., gpu.nvidia.com/class=a10), taints (e.g., nvidia.com/gpu=true:NoSchedule), and matching tolerations/affinity in the pod spec. If your GPU workloads don’t tolerate the GPU taint, CA can scale nodes all day and your pods will still be unschedulable. If your pod requires a label that no node group can satisfy, CA will not help.
A practical pattern is: define a GPU node group with a taint that blocks non-GPU workloads, then ensure AI pods include tolerations and node affinity. Request GPUs explicitly via resources.requests["nvidia.com/gpu"]. When replicas increase and GPUs run out, pods become Pending with an unschedulable reason; CA detects this and scales the GPU node group.
Common mistake: confusing “Pending” due to image pull/backoff with “Pending” due to unschedulable. CA only reacts to unschedulable pods. Your troubleshooting should start with kubectl describe pod to confirm the scheduler’s reason includes insufficient GPU or unmatched affinity/taints.
Scaling up is only half the story; cost control depends on scaling down without breaking workloads. Milestone 4 is about reducing idle GPU time with scale-down and disruption controls. The challenge: GPU nodes are expensive, but AI pods are also “sticky” due to model warmup, long-running requests, and checkpointing.
Cluster Autoscaler scale-down works by identifying nodes that can be removed and evicting pods so they can reschedule elsewhere. If your pods cannot move (due to strict node affinity, missing tolerations on other nodes, or oversized requests), nodes will never be considered removable. If your pods can move but take too long to terminate, scale-down may be delayed or may cause dropped requests if termination isn’t graceful.
Start with PodDisruptionBudgets (PDBs) to prevent too many replicas being disrupted at once. For an inference Deployment, a PDB like “minAvailable: 90%” can ensure capacity remains during drains. Next, implement graceful termination: set terminationGracePeriodSeconds long enough to finish in-flight requests, and add a preStop hook that stops accepting new traffic (e.g., mark unready, drain connections) before the process exits. Ensure your Service/readiness probes remove the pod from load balancing quickly.
Common mistake: enabling aggressive scale-down while ignoring readiness/termination. The symptom is “autoscaling worked” but users see 5xx spikes during node removals. Treat disruption controls as part of autoscaling, not an optional add-on.
Milestone 5 ties everything together: validate autoscaling with load tests and dashboards. Autoscaling configurations are hypotheses; validation proves whether the cluster meets SLOs at acceptable cost. The key is to test the full chain: metric emission → metric adapter → HPA decisions → pod scheduling → node provisioning → readiness → traffic distribution → scale-down.
Use synthetic load that resembles real inference: include realistic request sizes, concurrency, and burstiness. If your service batches requests, test both steady load and spiky load to see how queues form. During the test, watch SLO signals such as p50/p95 latency, error rate, and queue depth. These are the metrics your users feel. Also watch platform metrics: replica count, Pending pods, node group size, GPU utilization, and time-to-ready for new replicas.
Validation is not only “it scales up.” You also need to confirm: (1) scale-up occurs early enough to prevent SLO violations, (2) node scale-out triggers only when genuinely needed (unschedulable pods), (3) scale-down happens after load drops without causing errors, and (4) the steady-state footprint matches budget expectations.
Common mistake: testing only scale-up and declaring success. In GPU clusters, the biggest savings often come from reliable scale-down and avoiding idle nodes. Your final acceptance criteria should include both performance under peak and cost behavior after the peak ends.
1. Why is “scale on CPU” usually a poor autoscaling signal for GPU-based AI workloads?
2. What is the key risk of scaling directly on GPU utilization without understanding saturation vs throughput?
3. What does the chapter describe as the predictable autoscaling sequence to aim for?
4. Which milestone focuses on scaling an inference service using HPA with a metric beyond default CPU-based signals?
5. What is the main purpose of validating autoscaling with load tests and dashboards (Milestone 5) in this chapter’s framing?
GPU workloads fail differently than CPU-only services: they can be “healthy” at the container level while silently underperforming due to low GPU occupancy, memory fragmentation, PCIe bottlenecks, or thermal throttling. That is why observability for AI on Kubernetes must be built around GPU-specific signals and a workflow that narrows from user-visible symptoms down to node and device realities.
This chapter turns troubleshooting into an engineering practice. You will build a GPU-focused checklist (Milestone 1), then use it to trace a latency issue from service to node to GPU (Milestone 2). You will learn to diagnose OOM, throttling, and memory fragmentation (Milestone 3), investigate scheduling hot spots and bin-packing gaps (Milestone 4), and finally produce an incident report that leads to real remediation (Milestone 5).
The goal is not to “collect all metrics.” The goal is to answer a small set of repeatable questions: Is the service meeting its latency and throughput targets? If not, is it compute-bound, memory-bound, I/O-bound, or scheduler-bound? Are we wasting GPUs due to fragmentation or policy? And what guardrails prevent recurrence while controlling cost?
Practice note for Milestone 1: Build a GPU-focused troubleshooting checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Trace a latency issue from service to node to GPU: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Investigate scheduling hot spots and bin-packing gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Produce an incident report with actionable remediation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Build a GPU-focused troubleshooting checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Trace a latency issue from service to node to GPU: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Investigate scheduling hot spots and bin-packing gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Produce an incident report with actionable remediation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with the signals that map directly to how GPUs deliver performance. For inference and training, the most actionable quartet is: utilization (SM occupancy), memory (allocated vs used), power draw, and thermals. A common mistake is to treat “GPU utilization” as a single truth. In practice you need at least two utilization views: overall device busy time and kernel-level or SM-level occupancy. A GPU can show high utilization while doing inefficient small kernels, or low utilization while waiting on CPU preprocessing or network.
GPU memory signals are equally nuanced. Differentiate (1) memory allocated by the framework (reserved pools), (2) memory actually used by tensors, and (3) free memory that is unusable due to fragmentation. This is where Milestone 3 begins: if a pod OOMs while “free” memory appears available, you may be facing allocator fragmentation or oversize batch spikes. Also watch memory bandwidth utilization and PCIe throughput when available; low SM usage plus high PCIe suggests the GPU is starved.
Power and thermals translate directly into throttling. A service may pass readiness probes yet regress in p95 latency because the GPU is power-capped by node settings or is thermal-throttling in a dense rack. In troubleshooting, look for a pattern: utilization steady but clocks reduced; power pinned at cap; temperature near threshold. These are not Kubernetes problems, but Kubernetes is where the symptom surfaces.
Milestone 1 is to turn these into a checklist you can run in minutes: “Is latency up? Is queueing up? Are GPUs busy? If not, what is blocking?” That checklist is your first defense against spending hours in the wrong subsystem.
You cannot troubleshoot what you cannot measure consistently. For GPU workloads on Kubernetes, a practical metrics pipeline is Prometheus for storage and query, DCGM exporter for GPU device metrics, and kube-state-metrics/cAdvisor for Kubernetes and container metrics. The engineering judgment is choosing a small set of high-signal metrics, sane scrape intervals, and alert thresholds that reflect SLOs rather than noise.
DCGM exporter exposes metrics such as GPU utilization, memory usage, power draw, temperature, and sometimes per-process stats depending on configuration. Pair that with node exporter (node CPU, disk, network) and kubelet/cAdvisor (container CPU throttling, memory working set). When teams skip the exporter and rely only on application logs, they end up guessing whether the GPU is saturated or idle.
Alerting should follow user impact and cost impact. User impact: high p95 latency, rising error rate, backlog growth. Cost impact: GPUs allocated but underutilized for sustained windows, or nodes sitting idle due to scheduling constraints. A practical alert is “GPU allocated (pods requesting nvidia.com/gpu) but SM utilization < 20% for 30 minutes,” which flags waste and also hints at pipeline bottlenecks. Another is “GPU temperature near throttle threshold for 10 minutes,” which predicts tail latency regressions.
Common mistakes include overly aggressive scrape intervals that overload the control plane, and alerts on raw utilization without context. Calibrate with baselines: measure normal utilization for your model and batch size. Then tie alerts to symptoms (latency/queueing) and to actions (scale up, change batching, relocate pods).
Milestone 2 uses this pipeline to trace latency: start at service latency dashboards, correlate to queue depth, then to GPU busy time and power/thermals on the specific nodes running the pods. The win is speed: you move from “users report slowness” to “GPU is idle because CPU preprocessing is throttled” in a single dashboard pass.
Metrics tell you “what” is happening; logs and events tell you “why.” GPU troubleshooting in Kubernetes often requires reading three layers: Kubernetes events (scheduling, image pulls, pod lifecycle), kubelet logs (device plugin interactions, cgroup limits, OOM kills), and runtime logs (containerd/Docker, NVIDIA container runtime). The key is to follow time order and correlate with pod UID and node name.
Start with kubectl describe pod and events. For Milestone 4 (scheduling hot spots), events like 0/10 nodes are available: 10 Insufficient nvidia.com/gpu or node(s) had taint reveal whether your placement rules (taints/tolerations, node affinity) are too strict. If pods are pending while GPUs exist, you likely have fragmentation (e.g., many nodes each with 1 free GPU but pods request 2), or topology constraints that prevent packing.
For Milestone 3 (OOM and fragmentation), distinguish between host OOM and container OOM, and between CPU memory and GPU memory. Kubernetes OOMKilled events usually refer to CPU memory (cgroup limit). GPU memory OOM appears in application logs (CUDA OOM, framework allocator errors) and may not restart the pod unless the process exits. If the process survives but latency spikes, look for repeated allocation failures triggering fallback paths.
On the node, kubelet logs can show device plugin registration failures and allocation issues. Runtime logs can expose missing driver libraries or mismatched CUDA versions. A frequent pitfall is assuming “device plugin running” means “GPU usable.” Validate by checking that the container sees /dev/nvidia* devices and that nvidia-smi works inside the pod (or use a minimal validation container image).
The practical outcome is repeatable root-cause isolation: you can explain whether a slowdown is scheduling delay, node pressure, container throttling, or GPU-side failure, instead of treating them as one bucket.
Inference performance problems often present as tail latency spikes rather than average regressions. Profiling must therefore include queueing and concurrency, not only GPU utilization. A model server can keep the GPU “busy” while requests pile up because batch formation is inefficient or because CPU-side tokenization is saturated. Conversely, latency can rise with low GPU usage if requests are serialized due to a lock, a low concurrency setting, or a single-threaded preprocessor.
Milestone 2 becomes concrete here: trace latency from the service layer (p95/p99) to queue depth (in the model server or ingress) to per-pod throughput. Then tie it to resource contention: container CPU throttling is a common culprit when CPU limits are set too low for preprocessing. Check for high container_cpu_cfs_throttled_seconds_total while GPU utilization remains low. If you see that pattern, raising CPU limits or moving preprocessing off-path can improve GPU occupancy and reduce cost per request.
Tail latency is also sensitive to GPU thermal or power throttling (Section 5.1). If clocks drop under sustained load, average throughput may hold while p99 degrades. Another frequent pattern is memory pressure inside the framework: garbage collection or allocator compaction pauses that appear as periodic latency spikes. Collect request-level histograms in the application and align them with GPU metrics timestamps.
Common mistakes: optimizing batch size solely for throughput while violating latency SLOs, or scaling replicas without considering shared bottlenecks like a single upstream queue or a saturated node NIC. Practical profiling workflow: (1) fix a representative traffic shape, (2) measure per-replica max sustainable throughput, (3) observe queue growth onset, (4) validate headroom under failure (one replica down), and (5) set autoscaling targets based on queueing, not only CPU.
GPU capacity is expensive, so you must detect waste modes: fragmentation, bin-packing gaps, and saturation. Fragmentation happens when free GPUs exist but cannot satisfy pod shapes or placement rules. For example, requesting 2 GPUs per pod on a fleet of 4-GPU nodes can strand 1 GPU on many nodes if scheduling is not aligned with workload shapes. This is Milestone 4: identify where the cluster has “available capacity” that is not schedulable.
Start with a simple accounting table: for each node, total GPUs, allocated GPUs, free GPUs, and which pods hold them. Then layer in constraints: taints/tolerations, node affinity, topology spread, and priority classes. If many nodes show 1 free GPU but no pending pods can use 1 GPU, you have a shape mismatch. If many GPUs are allocated to low-utilization pods, you have consolidation opportunities (or you need MIG/time-slicing, if policy allows).
Saturation analysis asks: are we out of GPUs, or out of something else? GPU workloads can be blocked by node CPU, memory, ephemeral storage, or network. If pods request GPUs but are CPU-starved, the GPU becomes underutilized while the node is saturated on CPU. This is a cost trap: you pay for GPUs to wait. Address it by right-sizing CPU/memory requests, separating preprocessing into CPU pools, or using node classes that balance CPU:GPU ratios.
Bin packing is partly policy. Affinity rules that spread pods evenly can reduce blast radius but increase fragmentation and cost. Conversely, packing tightly can improve utilization but raise risk. Use metrics to choose intentionally: if SLOs are strict, keep headroom; if cost is paramount, pack and rely on priority/preemption to protect critical services.
The practical outcome is that you can quantify “how many more requests can this cluster serve” and “how many GPUs are effectively wasted” with evidence, not intuition.
Observability is only valuable if it changes outcomes during incidents and prevents repeats. Reliability for GPU workloads means writing runbooks that match your troubleshooting checklist (Milestone 1) and practicing an incident workflow that ends with an actionable report (Milestone 5). Your runbook should be opinionated: which dashboards to open first, which kubectl commands to run, and what decisions are allowed (scale replicas, cordon node, roll back model, change batch settings).
Define SLOs that reflect user experience and GPU realities. For inference, include request success rate and tail latency (p95/p99). Add capacity SLOs such as “no more than X minutes of pending time for GPU pods at priority P1,” which catches scheduling failures early. Tie alerts to these SLOs, not to arbitrary utilization thresholds.
For postmortems, avoid the trap of “GPU was overloaded.” Instead, document the chain: trigger, detection, impact, contributing factors (e.g., CPU throttling starved GPU, fragmentation prevented scale-out, thermal throttling increased p99), and concrete remediations. Good remediations are specific and testable: adjust resource requests, modify affinity to reduce fragmentation, add node pools with different GPU shapes, improve canarying for new models, or add budget guardrails that block runaway replicas.
An incident report template that works well includes: timeline, scope, graphs (latency + queue + GPU util + node pressure), what was tried, what worked, and follow-ups with owners and due dates. Close the loop by updating the runbook and adding a regression test or alert so the same pattern is detected earlier next time.
The practical outcome is resilience and cost control: your team responds faster, wastes fewer GPU-hours, and can justify scaling decisions with data and SLO alignment.
1. Why must observability for GPU workloads go beyond container-level health checks?
2. What is the recommended troubleshooting workflow direction for a latency issue in this chapter?
3. Which set of categories best matches the chapter’s approach to classifying why latency/throughput targets are missed?
4. Which issue is specifically called out as a way GPUs can be wasted even if the service is running?
5. What is the primary purpose of producing an incident report in this chapter’s troubleshooting practice?
GPU-enabled Kubernetes clusters can burn budget faster than any other platform component because they concentrate high hourly rates, bursty training jobs, and “just in case” overprovisioning. In this chapter you will treat cost as a first-class SLO alongside reliability and performance. The goal is not merely to “spend less,” but to create predictable guardrails: teams can run experiments safely, cluster operators can enforce fair sharing, and finance stakeholders can understand where spend is coming from.
You will implement a layered control system. First, you apply hard guardrails (quotas, limits, priority, and preemption) to prevent runaway usage. Second, you enforce governance (RBAC, namespaces, admission control) so GPU access is deliberate and auditable. Third, you add cost visibility signals (labels, allocation dimensions, and reporting patterns) so chargeback/showback becomes possible. Fourth, you optimize with cost-aware scheduling and scaling strategies that respect GPU scarcity and latency requirements. Finally, you complete a timed capstone lab that mirrors certification-style tasks: build, validate, troubleshoot, and document your decisions.
Keep an exam mindset: every object should be explainable, reproducible, and verifiable via kubectl outputs. You are aiming for practical outcomes: fewer surprise bills, fewer scheduling dead-ends, and faster incident triage when GPU workloads fail to start or scale.
Practice note for Milestone 1: Implement cost guardrails with quotas, limits, and priority: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Enforce policy checks for GPU usage and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add budget visibility and chargeback/showback signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Optimize spend with scheduling and scaling strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Complete a timed capstone lab mirroring certification tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Implement cost guardrails with quotas, limits, and priority: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Enforce policy checks for GPU usage and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add budget visibility and chargeback/showback signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Optimize spend with scheduling and scaling strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
FinOps for AI on Kubernetes starts by identifying the cost drivers unique to GPU workloads: GPU node hourly price, idle time (nodes running with no pods using GPUs), inefficient bin-packing (fragmented GPUs or CPU/RAM), and oversized requests that force larger instances. Unlike CPU-only clusters, GPU cost is often dominated by node availability rather than actual GPU utilization. Your control points therefore span both Kubernetes resources and cloud infrastructure: who can request GPUs, how many they can request, how long they can run, and when nodes are allowed to scale out.
Milestone 1 focuses on guardrails you can enforce directly in Kubernetes. Use Namespaces to define billing and ownership boundaries (team, project, environment). Apply ResourceQuotas for requests.nvidia.com/gpu (and optionally limits.nvidia.com/gpu) so a single namespace cannot consume the entire accelerator fleet. Pair quotas with LimitRanges to require defaults and cap per-pod CPU/memory requests, reducing accidental over-allocation that can strand GPUs due to memory pressure.
Engineering judgment: quotas should reflect both fairness and operational reality. If you have 8 GPUs total and want at least two teams to run concurrently, a quota of 4 GPUs per team creates a predictable worst case. Avoid setting quotas so low that teams bypass them with “temporary” namespaces; instead, combine quotas with governance controls in the next section. Common mistakes include forgetting that GPU requests are integer-only (you cannot request 0.5 GPU), setting CPU/memory limits too tightly (causing OOM kills), and using priority without a clear policy—leading to constant preemption churn and poor training throughput.
Practical outcome: after this milestone, you can answer “what is the maximum GPU blast radius per team?” and demonstrate it by trying (and failing) to schedule a pod that exceeds the namespace quota.
Cost controls fail if governance is weak. If anyone can create namespaces, bind cluster roles, or deploy to GPU nodes, then quotas and priorities become optional. Milestone 2 adds structural governance: RBAC defines who can do what, namespaces define where they can do it, and admission control acts as a policy enforcement point that can reject non-compliant workloads before they land on the cluster.
Start by defining standard namespaces per team or project and restricting namespace creation to platform operators. For each team namespace, grant a scoped Role/RoleBinding that allows typical operations (create Deployments, Jobs, Services, ConfigMaps) but not cluster-wide changes (nodes, CRDs, webhook configurations). A common operational pattern is to allow developers to manage workloads while reserving GPU node pool changes and admission policies for cluster admins.
Role over ClusterRole whenever possible.create on rolebindings unless necessary.Admission control is where governance becomes real-time enforcement. Even without custom policies, you should enable standard admission plugins (as managed by your distribution) and then layer dedicated policy engines (next section). Typical mistakes: granting broad permissions for convenience (“cluster-admin for the team”), allowing users to label nodes or modify taints (they can route themselves to GPUs), and relying on documentation rather than enforcement (“please don’t use GPUs for notebooks”).
Practical outcome: you can prove governance by attempting to deploy a GPU pod from a non-authorized namespace or user and observing an explicit authorization failure (RBAC) or a policy rejection (admission). This becomes essential evidence during audits and in certification-style troubleshooting.
Policy-as-code turns your GPU governance into versioned, testable artifacts. OPA Gatekeeper and Kyverno both integrate as admission controllers; the difference is authoring style (Rego vs YAML-like rules) and ecosystem. For exam readiness, focus on what policies accomplish: prevent accidental GPU consumption, enforce naming/labeling conventions for allocation, and require scheduling constraints that keep GPU workloads on approved node pools.
Milestone 2 continues here: enforce policy checks for GPU usage and namespaces. Common GPU-focused rules include: only specific namespaces may request nvidia.com/gpu; any pod requesting GPUs must set resources.limits["nvidia.com/gpu"] equal to requests (avoids “request 1, limit 0” misconfigurations); GPU pods must tolerate the GPU node taint and include node affinity to GPU-labeled nodes; and required labels like cost-center, team, and environment must be present.
Engineering judgment: keep policies minimal and composable. Overly strict rules cause friction and bypass behavior (shadow clusters, direct cloud VMs). Start with deny rules that prevent the biggest failures: unbounded GPU use and missing ownership metadata. Then add progressive enforcement (“warn/audit” mode first, then “enforce”). Mistakes include writing policies that block system components (e.g., device plugin DaemonSets), forgetting to exempt kube-system, and enforcing affinity in a way that breaks portability across environments.
Practical outcome: when a developer submits a GPU Job without the required labels or in the wrong namespace, the cluster rejects it immediately with a human-readable message. That converts an expensive surprise into a quick, cheap feedback loop.
Once hard guardrails and governance are in place, you can optimize spend without sacrificing throughput. Milestone 4 is about making scheduling and scaling decisions that reduce idle GPU time and avoid expensive node types when cheaper options satisfy requirements. In Kubernetes, cost-aware scheduling is not a single feature; it is a set of patterns: taints/tolerations to isolate GPU nodes, node affinity to target the right accelerator class, and autoscaling policies that scale up only when justified by pending GPU pods.
Start with node pools by accelerator type (e.g., T4 vs A10 vs A100) and label nodes accordingly (e.g., gpu.nvidia.com/class=a10, gpu.nvidia.com/memory=24gb). For training jobs with flexible performance needs, prefer cheaper GPUs and allow fallbacks via preferredDuringSchedulingIgnoredDuringExecution affinity. For latency-critical inference, use strict affinity to a known class and pair it with a PriorityClass so inference pods preempt lower value training if the cluster is saturated.
Common mistakes: over-requesting memory “just in case,” which forces larger (more expensive) instances; mixing incompatible workloads on the same GPU node without considering contention; and allowing node autoscaler to scale out for pods that are unschedulable due to missing tolerations/affinity—leading to wasted nodes. A practical troubleshooting workflow is: check pending pods, inspect events for “0/… nodes are available” reasons, verify tolerations and affinity, confirm the device plugin advertises GPUs, then confirm the autoscaler is seeing pending demand.
Practical outcome: your cluster scales GPUs up when real work arrives, packs them efficiently, and scales down when idle—while enforcing that only approved workloads can trigger that spend.
Milestone 3 adds budget visibility and showback/chargeback signals. Even strong controls are hard to sustain if teams cannot see the financial impact of their choices. In Kubernetes, cost allocation typically begins with consistent metadata and ends with reports that map resource consumption to owners. The key is to decide which dimensions you need (team, project, cost center, environment, model name) and enforce them as labels or annotations on namespaces and workloads.
Implement a labeling standard and make it enforceable (via the policies in Section 6.3). At minimum, label namespaces with team, cost-center, and environment. Then propagate or require equivalent labels on workloads to support granular views for shared namespaces. Add workload identifiers such as app, model, or experiment to separate long-running inference from short-lived training jobs.
Engineering judgment: showback is often the first step—share dashboards and weekly reports before implementing internal chargeback. Focus on trends (idle GPU hours, cost per training run, cost per 1k inferences) rather than perfect precision. Mistakes include relying on pod names (mutable) instead of labels (stable intent), failing to label ephemeral Jobs (which then appear as “unallocated”), and ignoring shared overhead (device plugin, monitoring, system daemons). Your reporting should clearly separate “shared platform cost” from “team consumption.”
Practical outcome: you can produce a report that answers “which team spent the most GPU-hours this week and on which model or environment,” and you can justify it with enforced labels and auditable policies.
Milestone 5 is a timed capstone that mirrors certification tasks: implement controls, validate behavior, troubleshoot scheduling failures, and document outcomes. Treat this as an operational runbook exercise. The goal is not only to configure objects, but to prove the system works with observable evidence (events, policy denials, quota errors, and successful GPU scheduling).
kubectl describe pod and events to diagnose pending pods; check node labels/taints, tolerations, and affinity; confirm GPU resources appear in kubectl describe node.Common capstone failure modes are predictable: policies blocking system namespaces, quotas applied to the wrong namespace, GPU pods missing tolerations, and autoscaler scaling out for unschedulable pods due to affinity mistakes. Your timed strategy should be: implement one layer at a time, validate immediately, and only then add the next layer. If something breaks, roll back the last change and re-test; do not “pile on” changes and hope the cluster recovers.
Practical outcome: by the end of the capstone, you have a defensible GPU multi-tenant platform with enforced guardrails, visible allocation signals, and a repeatable troubleshooting workflow—exactly the operational posture expected in real environments and reflected in certification-style tasks.
1. Why does Chapter 6 emphasize treating cost as a first-class SLO alongside reliability and performance for GPU-enabled Kubernetes clusters?
2. Which set best represents the chapter’s first layer of controls designed to prevent runaway GPU usage?
3. What is the primary purpose of the governance layer described in Chapter 6?
4. How does Chapter 6 suggest enabling chargeback/showback for GPU spend within the cluster?
5. What does the chapter’s “exam mindset” for the timed capstone lab most strongly require?