HELP

+40 722 606 166

messenger@eduailast.com

Kubernetes for AI Workloads Lab: GPU Scheduling & Cost Control

AI Certifications & Exam Prep — Intermediate

Kubernetes for AI Workloads Lab: GPU Scheduling & Cost Control

Kubernetes for AI Workloads Lab: GPU Scheduling & Cost Control

Run GPU-accelerated AI on Kubernetes—fast, scalable, and cost-aware.

Intermediate kubernetes · gpu-scheduling · ai-workloads · autoscaling

Why this course exists

AI workloads stress Kubernetes in ways typical web apps do not: GPUs are scarce and expensive, scheduling constraints are stricter, scaling signals are different, and a single misconfigured request can burn budget fast. This book-style lab course teaches you the practical patterns used to run training jobs and inference services on GPU-enabled Kubernetes clusters—while staying exam-ready and cost-aware.

You’ll work through a coherent six-chapter progression: from building a GPU-capable lab environment, to enforcing correct placement and multi-tenant controls, to autoscaling, troubleshooting, and finally governance and cost guardrails. Each chapter is structured like a short technical book chapter with milestones and sub-sections so you can study sequentially or revisit specific topics when preparing for certification-style tasks.

What you’ll build (and be able to repeat)

By the end, you will have a repeatable approach for deploying GPU-backed workloads with clear resource sizing, predictable scheduling behavior, and observability that surfaces GPU bottlenecks quickly. You’ll also implement the policies and controls that reduce accidental spend and enforce team boundaries in shared clusters.

  • A GPU-ready Kubernetes setup with validated device plugin support
  • Scheduling rules that keep GPU nodes protected and correctly utilized
  • Autoscaling at the pod layer (HPA/VPA) and node layer (cluster autoscaling patterns)
  • Dashboards and alerts that explain “why it’s slow” in GPU terms
  • FinOps-oriented guardrails: quotas, priorities, policies, and cost attribution signals

Who this is for

This course is designed for Kubernetes users who can already read YAML and operate basic workloads, and now need production-grade patterns for AI and GPU use cases. If you’re preparing for an AI platform, MLOps, or Kubernetes-adjacent certification that expects hands-on competence, the chapter milestones will feel like timed lab objectives.

How the 6 chapters fit together

Chapter 1 establishes the lab and validates GPU capability so every later exercise is grounded in a working environment. Chapter 2 focuses on the scheduling primitives that determine whether GPU pods land where they should—and why they sometimes don’t. Chapter 3 applies those primitives to real workload types: inference services and batch training jobs with storage, rollouts, and performance hygiene. Chapter 4 adds scaling: selecting the right signals and safely combining pod scaling with node scaling. Chapter 5 teaches you to interpret the system under pressure, using metrics, events, and capacity analysis to troubleshoot latency, OOMs, and fragmentation. Chapter 6 closes with governance and cost controls—then a capstone sequence that mirrors common certification lab tasks.

Get started

If you’re ready to build exam-ready Kubernetes AI skills through a structured, lab-first book format, you can Register free and start the first chapter. Want to compare options before committing? You can also browse all courses on Edu AI.

What You Will Learn

  • Install and validate GPU support on Kubernetes using the NVIDIA device plugin
  • Design GPU scheduling with taints/tolerations, node affinity, and resource requests/limits
  • Apply autoscaling for AI workloads with HPA/VPA and node autoscaling patterns
  • Implement safe multi-tenancy with quotas, limits, priority classes, and preemption
  • Monitor GPU, node, and workload performance using metrics-driven troubleshooting
  • Control spend with FinOps guardrails, budget alerts, and cost-aware scheduling policies
  • Harden AI runtime security with RBAC, Pod Security, and image governance basics

Requirements

  • Working knowledge of containers and Kubernetes basics (pods, deployments, services)
  • Comfort using kubectl and reading YAML manifests
  • A local Kubernetes environment (kind/minikube) plus access to a GPU-enabled cluster (cloud or on-prem) recommended
  • Basic understanding of ML training/inference concepts (batch jobs vs services)

Chapter 1: Lab Setup for GPU-Ready Kubernetes

  • Milestone 1: Build the lab environment and toolchain
  • Milestone 2: Provision a GPU node pool and validate drivers
  • Milestone 3: Install NVIDIA device plugin and run a smoke test
  • Milestone 4: Baseline observability for nodes, pods, and GPUs
  • Milestone 5: Capture a reproducible lab checklist for exam-style tasks

Chapter 2: GPU Scheduling Fundamentals for AI Workloads

  • Milestone 1: Schedule the first GPU pod with correct resource requests
  • Milestone 2: Enforce placement using labels, affinity, and selectors
  • Milestone 3: Protect GPU nodes with taints and tolerations
  • Milestone 4: Prevent noisy neighbors with quotas and limit ranges
  • Milestone 5: Resolve scheduling failures using events and logs

Chapter 3: Running AI Jobs and Services on GPUs

  • Milestone 1: Package an inference service with GPU access
  • Milestone 2: Run a batch training job and manage retries
  • Milestone 3: Optimize images and startup for faster GPU time-to-first-token
  • Milestone 4: Configure storage and data paths for throughput
  • Milestone 5: Apply rollout safety for GPU-backed deployments

Chapter 4: Autoscaling Patterns for GPU Clusters

  • Milestone 1: Scale an inference service with HPA using custom metrics
  • Milestone 2: Use VPA safely for AI workloads and avoid thrash
  • Milestone 3: Trigger node scale-out with pending GPU pods
  • Milestone 4: Reduce idle time with scale-down and disruption controls
  • Milestone 5: Validate autoscaling with load tests and dashboards

Chapter 5: Observability and Troubleshooting for GPU Workloads

  • Milestone 1: Build a GPU-focused troubleshooting checklist
  • Milestone 2: Trace a latency issue from service to node to GPU
  • Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation
  • Milestone 4: Investigate scheduling hot spots and bin-packing gaps
  • Milestone 5: Produce an incident report with actionable remediation

Chapter 6: Cost Controls, Governance, and Exam-Style Capstone

  • Milestone 1: Implement cost guardrails with quotas, limits, and priority
  • Milestone 2: Enforce policy checks for GPU usage and namespaces
  • Milestone 3: Add budget visibility and chargeback/showback signals
  • Milestone 4: Optimize spend with scheduling and scaling strategies
  • Milestone 5: Complete a timed capstone lab mirroring certification tasks

Sofia Chen

Senior Platform Engineer (Kubernetes, MLOps, FinOps)

Sofia Chen is a senior platform engineer specializing in Kubernetes platform design for ML and GPU workloads. She has led cost-optimization and autoscaling initiatives across multi-tenant clusters, integrating observability and policy controls to keep AI infrastructure reliable and auditable.

Chapter 1: Lab Setup for GPU-Ready Kubernetes

This lab course assumes you already know basic Kubernetes objects and can read YAML. The goal of Chapter 1 is to make your cluster “GPU-ready” in a way that is repeatable under exam-style time pressure: you will build a consistent toolchain, provision a GPU node pool, validate drivers, install the NVIDIA device plugin, and stand up enough observability to troubleshoot scheduling and performance issues without guesswork.

GPU enablement is not a single switch. A working setup requires alignment across layers: the cloud (or local) GPU hardware, the host OS drivers, the container runtime configuration, Kubernetes scheduling and admission behavior, and a plugin that advertises GPU resources to the kubelet. The most common failure mode is to validate only one layer (for example, “nvidia-smi works on the node”) and assume the rest will follow. In this chapter, you’ll verify each dependency in sequence and record a checklist you can re-run later.

Engineering judgment matters even in a lab: you’ll choose between local and cloud environments based on cost and reproducibility; you’ll select Kubernetes versions that match plugin support; and you’ll decide what “minimum viable observability” looks like for GPU workloads. By the end, you should be able to deploy a pod that requests nvidia.com/gpu and confirm it is scheduled correctly, visible to metrics, and ready for the later chapters on scheduling constraints, autoscaling patterns, multi-tenancy guardrails, and FinOps controls.

Practice note for Milestone 1: Build the lab environment and toolchain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Provision a GPU node pool and validate drivers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Install NVIDIA device plugin and run a smoke test: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Baseline observability for nodes, pods, and GPUs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Capture a reproducible lab checklist for exam-style tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Build the lab environment and toolchain: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Provision a GPU node pool and validate drivers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Install NVIDIA device plugin and run a smoke test: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Baseline observability for nodes, pods, and GPUs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Choosing your lab: local vs cloud GPU clusters

Your first design decision is where to run the lab. Local clusters (kind, k3d, MicroK8s) are great for learning Kubernetes objects, but GPU labs add constraints: you need a CUDA-capable GPU, matching drivers on the host, and a container runtime stack that can pass the device into containers. If you already have an NVIDIA GPU workstation, local can be fast and cheap. If you do not, cloud GPU nodes usually save time and reduce driver friction.

Cloud clusters (managed Kubernetes such as EKS/AKS/GKE) are the most exam-realistic because they resemble production environments: separate node pools, autoscaling, IAM policies, and standardized images. They also align naturally with later outcomes like cost guardrails and cost-aware scheduling. The tradeoff is spend. For labs, optimize for “short, repeatable sessions”: use a dedicated GPU node pool with small instance types, scale it to zero when not in use, and apply time-based shutdown automation if your platform supports it.

  • Local: best when you already own the GPU and want offline repeatability. Watch for mismatched driver/CUDA versions and missing kernel headers.
  • Cloud: best when you want reliable provisioning and easy node pool lifecycle. Watch for quota limits (GPU scarcity) and hourly costs.
  • Hybrid: do control-plane work locally (manifests, YAML, Helm) but validate GPU execution in cloud. This keeps costs low while still testing real GPUs.

Milestone 1 (toolchain) starts here: choose one environment, then standardize your tools. At minimum install kubectl, helm, a terminal YAML editor, and a way to authenticate to the cluster. Keep a single “lab repo” with manifests and notes so you can reproduce tasks quickly—this is essential for exam-style performance.

Section 1.2: Kubernetes versions, runtimes, and GPU prerequisites

Before installing anything GPU-related, confirm your Kubernetes baseline is compatible. The NVIDIA device plugin tracks Kubernetes feature changes (especially around device allocation, security contexts, and runtime behavior). As a rule, use a supported Kubernetes version within the “current minus a few” range recommended by the plugin’s documentation and your managed service. Avoid combining very old Kubernetes with new container runtimes or vice versa; GPU enablement is sensitive to version skew.

Next, identify your container runtime. Most clusters today use containerd. Docker Engine is less common for kubelets, and “Docker shim” is gone in modern Kubernetes. GPU workloads require the runtime to understand the NVIDIA container stack (or a compatible CDI configuration), otherwise containers will start but not see devices. Don’t postpone this check: many “device plugin installed but no GPUs appear” issues are runtime misconfiguration, not the plugin itself.

Milestone 2 (provision a GPU node pool) begins with prerequisites: pick a node image that is known to work with GPUs. Managed services often offer GPU-optimized images or add-ons. If you roll your own nodes, ensure the OS kernel version supports the driver version you plan to install. Also confirm node sizing: GPU nodes need adequate CPU and memory for data loading and preprocessing; starving the node will make GPU utilization look “bad” even when scheduling is correct.

  • Check Kubernetes node status: kubectl get nodes -o wide and ensure nodes are Ready.
  • Confirm runtime: kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.containerRuntimeVersion}'.
  • Plan your GPU node pool labels now (e.g., nodepool=gpu) to support later scheduling with node affinity and taints.

Common mistake: mixing multiple runtimes or custom runtime configs across node pools. For a lab, standardize the GPU pool first, then expand complexity later when you study multi-tenancy and cost-aware scheduling.

Section 1.3: NVIDIA drivers, container toolkit, and runtime class

The GPU hardware is invisible to Kubernetes until the host OS can drive it and containers can access it. Start with the node-level truth: SSH into a GPU node (or use your cloud provider’s session manager) and run nvidia-smi. If nvidia-smi fails, do not proceed to Kubernetes plugin steps; fix drivers first. This is Milestone 2’s “validate drivers” checkpoint.

Driver installation strategy depends on your platform. Many managed Kubernetes services provide an official GPU driver installer (as an add-on or a DaemonSet). That approach is generally safer than manual installs because it aligns kernel modules, driver versions, and reboot behavior. If you install drivers yourself, match the driver to the GPU generation and the CUDA compatibility you need. For labs, you don’t need the newest CUDA, you need a stable match.

Next is the container layer: NVIDIA Container Toolkit (or a CDI-based setup) configures the runtime to mount GPU devices and inject required libraries into containers. On containerd, this usually means configuring an NVIDIA runtime or enabling CDI and restarting containerd. A frequent pitfall is to install drivers and toolkit but forget the runtime restart, leaving pods unable to see GPUs until the node is recycled.

Finally, consider using a RuntimeClass to make GPU runtime selection explicit. In some environments you may define a runtime class (e.g., nvidia) pointing to an NVIDIA-aware runtime handler. This reduces ambiguity and makes your manifests clearer for exams and for later multi-tenant controls. If your environment uses a single runtime with CDI enabled, you may not need a runtime class—but you should still understand when it’s required.

  • Validate driver: nvidia-smi shows GPU and driver version.
  • Validate runtime integration: run a simple CUDA container on the node if possible, or proceed to Kubernetes smoke tests in Section 1.5.
  • Record versions (driver, toolkit, runtime) in your lab checklist for reproducibility.
Section 1.4: Deploying the NVIDIA device plugin (and common pitfalls)

Milestone 3 is where Kubernetes becomes GPU-aware: the NVIDIA device plugin registers GPU resources with the kubelet so that schedulers can place pods based on GPU requests. Without it, your nodes may have GPUs physically, but Kubernetes will not advertise nvidia.com/gpu, and pods requesting GPUs will remain Pending.

Deploy the plugin using the vendor-recommended manifest or Helm chart. In a lab, prefer the official installation path because it encodes tolerations, security contexts, and host mounts that change over time. Once installed, the plugin typically runs as a DaemonSet on GPU-capable nodes. If it schedules on non-GPU nodes, that’s not always harmful, but it can create noise; use node selectors or affinity to target your GPU pool labels.

Common pitfalls to recognize quickly:

  • Plugin doesn’t start: often due to missing privileges, incompatible OS, or missing host paths. Check DaemonSet pod logs and events.
  • No GPU resource appears: drivers or runtime integration are incomplete; the plugin can run but find zero devices. Verify nvidia-smi and runtime config.
  • Conflicting plugins: installing multiple GPU-related DaemonSets (vendor AMI add-ons plus manual plugin) can lead to confusing results. Keep one authoritative setup.
  • Node taints: GPU node pools are often tainted to prevent accidental scheduling. If the plugin DaemonSet lacks tolerations, it won’t run on the GPU nodes.

Also note the relationship to later scheduling outcomes: once GPUs are advertised as allocatable resources, Kubernetes will enforce GPU requests as integer resources. This is the foundation for cost control (don’t run GPU pods without requesting GPUs) and for safe multi-tenancy (quotas and priority classes depend on accurate resource accounting).

Section 1.5: Verifying GPU resources via kubectl and test containers

Milestone 3 ends only when you can prove, via Kubernetes, that GPUs are allocatable and usable from a pod. Start with cluster-level inspection. Run kubectl describe node <gpu-node> and look for Capacity and Allocatable entries for nvidia.com/gpu. If the resource isn’t listed, the device plugin isn’t registering devices (or is running on the wrong nodes). Also review kubectl get pods -n kube-system to ensure the plugin DaemonSet has a Ready pod on each GPU node.

Next, run a smoke test pod that requests a GPU. Keep the spec minimal and explicit because you’ll reuse it later when debugging scheduling behavior with taints/tolerations and node affinity. The key is the resource request/limit: GPUs are typically requested via resources.limits (and sometimes requests) as an integer. If you forget the GPU limit, the pod may schedule onto a GPU node but won’t be granted a device, leading to confusing runtime errors.

When the pod starts, execute nvidia-smi inside the container (or run a CUDA sample) to confirm device visibility. If nvidia-smi works in the container, you’ve validated the full chain: driver → runtime → device plugin → Kubernetes scheduling → container execution.

  • Check resources: kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
  • Check pod placement: kubectl get pod -o wide and confirm it landed on the GPU node pool.
  • Debug Pending pods: kubectl describe pod and read Events for “Insufficient nvidia.com/gpu” or missing tolerations.

Milestone 5 begins here: capture the exact commands you used and the success criteria (what output proves GPUs are working). In exam scenarios, you’re graded on outcomes; your checklist should map “symptom → command → expected signal → fix.”

Section 1.6: Metrics basics: node exporter, DCGM metrics, and dashboards

Milestone 4 establishes baseline observability so you can troubleshoot performance and cost signals later. For GPU workloads, “pod is Running” is not enough: you need to know whether the GPU is actually utilized, whether the node is CPU/memory bottlenecked, and whether your workload is throttled by I/O or networking. A minimal but effective stack combines node-level metrics, Kubernetes state metrics, and GPU-specific telemetry.

At the node level, node exporter provides CPU, memory, disk, and network metrics. Pair it with Kubernetes metrics sources (commonly kube-state-metrics and a metrics backend such as Prometheus) to understand pod scheduling and resource requests. For GPUs, use NVIDIA’s DCGM (Data Center GPU Manager) exporter to expose utilization, memory usage, temperature, power draw, and sometimes per-process signals. DCGM metrics are essential for cost control: an idle GPU with a Running pod is pure waste, and you want that visible immediately.

Dashboards are not decoration—they shorten incident and lab-debug cycles. Even a single dashboard with (1) GPU utilization and memory, (2) node CPU/memory pressure, and (3) pod counts by namespace helps you correlate “why is training slow?” with real constraints. In later chapters, these same metrics drive autoscaling decisions (HPA/VPA patterns) and multi-tenant protections (quotas and priority classes), so building the habit now pays off.

  • Baseline signals to capture: GPU utilization %, GPU memory used, GPU power draw, node CPU saturation, node memory available, pod restarts, Pending pods.
  • Common mistake: installing GPU metrics but forgetting RBAC or ServiceMonitor wiring, resulting in empty dashboards.
  • Practical outcome: you can answer “Is this slow because of GPU starvation, CPU preprocessing, or scheduling misplacement?” within minutes.

Close the chapter by updating your reproducible lab checklist: toolchain versions, cluster version, node pool labels/taints, driver/toolkit versions, device plugin install method, smoke test manifest, and the metric endpoints/dashboards you rely on. This checklist is your reset button when something breaks—and your time saver when you need to re-create the environment under exam constraints.

Chapter milestones
  • Milestone 1: Build the lab environment and toolchain
  • Milestone 2: Provision a GPU node pool and validate drivers
  • Milestone 3: Install NVIDIA device plugin and run a smoke test
  • Milestone 4: Baseline observability for nodes, pods, and GPUs
  • Milestone 5: Capture a reproducible lab checklist for exam-style tasks
Chapter quiz

1. What is the main outcome Chapter 1 is aiming for by the end of the lab setup?

Show answer
Correct answer: A repeatable, exam-ready GPU-enabled Kubernetes cluster where pods can request nvidia.com/gpu and be validated end-to-end
Chapter 1 focuses on making the cluster GPU-ready in a repeatable way, including scheduling a pod that requests nvidia.com/gpu and validating it with observability.

2. Why does the chapter emphasize verifying GPU dependencies in sequence instead of relying on a single check like “nvidia-smi works”?

Show answer
Correct answer: Because GPU enablement requires alignment across hardware, drivers, runtime, Kubernetes behavior, and the device plugin
The chapter states GPU enablement is not a single switch and that validating only one layer is a common failure mode.

3. Which component is responsible for advertising GPU resources to Kubernetes so the kubelet can schedule GPU requests?

Show answer
Correct answer: The NVIDIA device plugin
The chapter notes that a plugin is needed to advertise GPU resources to the kubelet, specifically the NVIDIA device plugin.

4. Which scenario best reflects the “most common failure mode” described in Chapter 1?

Show answer
Correct answer: Confirming GPU drivers work on the node and assuming scheduling and runtime integration will automatically work
The chapter explicitly calls out validating only one layer (e.g., nvidia-smi on the node) and assuming the rest will follow.

5. What does Chapter 1 consider “minimum viable observability” for GPU readiness?

Show answer
Correct answer: Enough visibility into nodes, pods, and GPUs to troubleshoot scheduling and performance issues without guesswork
The chapter says you will stand up enough observability to troubleshoot scheduling and performance issues and mentions baselining observability for nodes, pods, and GPUs.

Chapter 2: GPU Scheduling Fundamentals for AI Workloads

GPU scheduling is the moment your Kubernetes cluster stops being “a place to run containers” and becomes a dependable platform for AI training and inference. The scheduler has one job: pick a node that can run your pod. For AI, that decision must account for scarce accelerators, heterogeneous hardware, and cost-sensitive capacity. In this chapter you will build the mental model and the muscle memory to place GPU workloads correctly, keep GPU nodes protected, avoid noisy-neighbor failures, and troubleshoot Pending pods quickly.

The workflow you’ll repeat in real environments looks like this: (1) validate GPU discovery via the NVIDIA device plugin, (2) request GPU resources correctly so the scheduler can do its job, (3) enforce placement with labels/affinity and guard GPU nodes with taints/tolerations, (4) apply multi-tenant controls (quotas, limits, priority), and (5) debug scheduling failures using events and scheduler hints. Each milestone in this chapter maps to those steps, and each step directly affects reliability and spend: mis-specified requests waste expensive GPU time; weak placement rules mix inference and training; missing quotas allow a single team to consume the fleet.

  • Milestone 1: Schedule the first GPU pod with correct resource requests.
  • Milestone 2: Enforce placement using labels, affinity, and selectors.
  • Milestone 3: Protect GPU nodes with taints and tolerations.
  • Milestone 4: Prevent noisy neighbors with quotas and limit ranges.
  • Milestone 5: Resolve scheduling failures using events and logs.

Engineering judgment matters throughout. The “most strict” placement rule is not always the best; overly rigid policies strand capacity and drive autoscalers to add nodes unnecessarily. Conversely, being too permissive can silently put expensive models on the wrong GPUs, reduce throughput, or cause unpredictable eviction behavior. The goal is controlled flexibility: tell Kubernetes what must be true (hard constraints) and what would be nice (soft preferences), then verify with metrics and events.

Practice note for Milestone 1: Schedule the first GPU pod with correct resource requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Enforce placement using labels, affinity, and selectors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Protect GPU nodes with taints and tolerations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Prevent noisy neighbors with quotas and limit ranges: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Resolve scheduling failures using events and logs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Schedule the first GPU pod with correct resource requests: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Enforce placement using labels, affinity, and selectors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: GPU resources in Kubernetes: requests, limits, and extended resources

Section 2.1: GPU resources in Kubernetes: requests, limits, and extended resources

Kubernetes schedules pods based on requests (what you need to run) and enforces limits (the maximum you’re allowed to consume) for CPU and memory. GPUs are different: they appear as extended resources exposed by a device plugin, most commonly nvidia.com/gpu. Extended resources are integer-only and are not overcommitted by Kubernetes; if you request 1 GPU, the scheduler must find a node with at least 1 allocatable GPU.

Milestone 1 is to run a first GPU pod that schedules predictably. Your pod spec must include a GPU request (and typically a matching limit). If you omit it, the pod may land on a CPU node and then fail at runtime when CUDA libraries can’t find a device. If you request GPUs but the device plugin is missing or misconfigured, the pod will remain Pending because no node advertises that extended resource.

A practical minimal container spec looks like: set resources.limits["nvidia.com/gpu"]: 1 (and optionally the same under requests). For GPUs, many teams set request=limit to make intent explicit and avoid confusion during reviews. Keep CPU/memory requests realistic as well; an AI job that requests 1 GPU but forgets CPU/memory might schedule onto a GPU node yet starve itself (or its neighbors) at runtime.

  • Common mistake: requesting fractional GPUs (e.g., 0.5). Extended resources require whole integers unless you adopt a separate GPU sharing mechanism (MIG profiles or vendor features) that exposes smaller allocatable units as distinct resources.
  • Common mistake: assuming “limit without request” works. For many schedulers/admissions, GPU usage is effectively determined by the limit, but policies vary; use explicit request/limit for clarity.
  • Outcome: a pod that requests 1 GPU will only be scheduled when and where 1 GPU is actually available, preventing silent CPU fallback and reducing wasted debugging time.

Before you trust scheduling, validate node capacity: kubectl describe node <node> should show Capacity and Allocatable for nvidia.com/gpu. If it does not, fix the device plugin or node driver installation first; scheduling rules cannot compensate for missing resources.

Section 2.2: Node labeling strategy for heterogeneous GPU fleets

Section 2.2: Node labeling strategy for heterogeneous GPU fleets

Most organizations end up with a heterogeneous GPU fleet: different GPU models (T4, L4, A10, A100/H100), different memory sizes, and sometimes different drivers or CUDA capabilities. If you treat all GPU nodes as equivalent, you’ll get mismatches: a training job needing 80GB may land on a 16GB card; an inference service optimized for a specific architecture may lose performance. The fix starts with a clear labeling strategy (Milestone 2’s foundation).

Use labels to express stable, meaningful scheduling dimensions. Prefer a small vocabulary that survives node replacement and autoscaling. Examples include: gpu.nvidia.com/model=A100, gpu.nvidia.com/memory-gb=80, gpu.nvidia.com/mig-enabled=true, workload.gpu/tier=training, or node.kubernetes.io/instance-type from the cloud provider. Avoid labels that change frequently (like “currently free”)—that’s what metrics and schedulers are for.

  • Label at provisioning time: bake labels into node group templates (managed node groups, autoscaler node templates) so new nodes join with correct metadata.
  • Separate “hardware” vs “policy” labels: hardware labels describe what the node is; policy labels describe how you intend to use it (e.g., “inference-only”). Keeping them separate makes future changes safer.
  • Keep labels auditable: document label meanings and owners. A label taxonomy prevents two teams from using conflicting conventions.

A common mistake is relying only on nodeSelector with one label such as gpu=true. That puts everything on “any GPU,” which is rarely what you want once you introduce multiple GPU models or specialized node pools. Another mistake is encoding too much detail (driver versions, minor differences) into scheduling constraints, which can make pods unschedulable and trigger unnecessary scale-outs—directly increasing cost.

Practical outcome: with a consistent labeling strategy, you can steer jobs to the right GPU class, keep expensive nodes reserved for the workloads that need them, and give autoscalers a clean target when expanding a specific pool.

Section 2.3: Node affinity vs pod affinity/anti-affinity for AI placement

Section 2.3: Node affinity vs pod affinity/anti-affinity for AI placement

Affinity rules are how you express placement intent beyond basic selectors. For AI workloads, the key distinction is: node affinity places pods on nodes with certain labels (hardware/pool constraints), while pod affinity/anti-affinity places pods relative to other pods (co-locate or spread). Milestone 2 uses these tools to enforce placement without overconstraining the cluster.

Use requiredDuringSchedulingIgnoredDuringExecution (hard constraints) for must-have requirements: “must run on A100 nodes,” “must run where MIG is enabled,” or “must run in the inference pool.” Use preferredDuringSchedulingIgnoredDuringExecution (soft preferences) for optimizations: “prefer nodes in the same zone as my data cache,” or “prefer nodes with a particular GPU model but allow fallback.” Soft preferences reduce the chance of Pending pods and help control costs by using available capacity rather than scaling out immediately.

Pod anti-affinity is especially practical for inference reliability: you can spread replicas across nodes to avoid a single-node failure taking down the service. For example, anti-affinity against the same app label at the hostname topology key prevents co-locating replicas on one node. Pod affinity can be useful when you want co-location for performance (e.g., an inference service near a GPU-sidecar cache), but be careful: co-location requirements can create scheduling deadlocks when combined with GPU scarcity.

  • Common mistake: using hard anti-affinity for many replicas in a small GPU pool, which makes the later replicas unschedulable.
  • Common mistake: mixing nodeSelector, required node affinity, and taints/tolerations in conflicting ways. If any hard constraint cannot be satisfied, the scheduler stops.
  • Outcome: AI pods land on the correct GPU class and spread appropriately for availability, while maintaining enough flexibility to avoid unnecessary scale-outs.

Engineering judgment: start with node affinity to ensure hardware correctness, then add pod anti-affinity only where availability requires it. Prefer “soft spread” unless you have enough capacity to guarantee hard spread.

Section 2.4: Taints/tolerations patterns for GPU-only nodes

Section 2.4: Taints/tolerations patterns for GPU-only nodes

Labels and affinity help your GPU workloads find the right nodes, but they do not stop non-GPU workloads from landing on GPU nodes. That’s where taints and tolerations come in (Milestone 3). A taint on a node repels pods unless the pod explicitly tolerates it. This is one of the strongest cost-control tools in Kubernetes: it prevents “accidental” scheduling of cheap CPU services onto expensive GPU machines.

A common pattern is tainting GPU node pools with something like gpu=true:NoSchedule. Then, only pods that truly need GPUs include a matching toleration. Combine this with GPU requests: toleration alone should not be enough to land on a GPU node; it should be paired with nvidia.com/gpu requests and node affinity/selector so you don’t create a general-purpose backdoor into the GPU pool.

  • NoSchedule: blocks new pods without toleration; existing pods keep running. Good default for GPU pools.
  • PreferNoSchedule: soft repulsion; use when you want GPU nodes as overflow capacity but still prefer to keep them clean.
  • NoExecute: evicts running pods that don’t tolerate; useful for maintenance or strict isolation, but be cautious with long training runs.

Common mistake: adding a toleration broadly via a shared Helm chart “just in case.” That defeats the purpose and can silently inflate spend. Another mistake is tainting GPU nodes but forgetting to add tolerations to system-level DaemonSets that must run everywhere (logging, monitoring). In practice, you either (1) ensure those DaemonSets tolerate the taint, or (2) keep them out of GPU pools intentionally and provide alternative telemetry, depending on your operational needs.

Practical outcome: GPU nodes become protected real estate. Only workloads that explicitly opt in—and meet the hardware constraints—consume them, which reduces accidental cost leakage and improves scheduling predictability.

Section 2.5: Resource quotas and limit ranges for multi-tenant AI

Section 2.5: Resource quotas and limit ranges for multi-tenant AI

Once multiple teams share a GPU cluster, the biggest operational risk is not “no GPUs exist,” but “someone took them all.” Kubernetes multi-tenancy controls—especially ResourceQuota and LimitRange—are the core of Milestone 4. They create predictable boundaries per namespace so one team can’t starve others, intentionally or accidentally.

ResourceQuota can cap total GPU consumption per namespace by limiting requests.nvidia.com/gpu (and CPU/memory). This is a direct guardrail for both fairness and cost control. For example, a research namespace might be capped at 8 GPUs while production inference is capped differently. Quotas also make scheduling failures faster to diagnose: instead of “cluster is full,” you get a clear “exceeded quota” signal.

LimitRange sets defaults and min/max per pod/container. This matters because AI manifests are often copied between projects; a missing request can lead to BestEffort pods competing unpredictably for CPU/memory, even if GPUs are requested. With a LimitRange, you can enforce that every container has CPU/memory requests and keep runaway settings in check.

  • Common mistake: setting quotas only on CPU/memory and forgetting GPU extended resources. The result is a “GPU free-for-all.”
  • Common mistake: using quotas without considering bursty workflows; too-tight caps can cause job queues to back up and encourage users to create multiple namespaces to bypass limits.
  • Outcome: predictable per-team GPU boundaries, fewer noisy-neighbor incidents, and clearer accountability for utilization and spend.

Engineering judgment: quotas should reflect business priority and the cluster’s scaling model. If you rely on node autoscaling, quotas still matter—they prevent one namespace from triggering massive scale-outs. Pair quotas with PriorityClasses (covered later in the course outcomes) when you need “production wins” behavior under contention.

Section 2.6: Debugging Pending pods: events, scheduler reasoning, and fixes

Section 2.6: Debugging Pending pods: events, scheduler reasoning, and fixes

Milestone 5 is about speed: when a GPU pod is Pending, you should be able to identify the reason in minutes. Start with kubectl describe pod and read the Events section. The default scheduler explains what it tried and why it failed: insufficient GPUs, node taints not tolerated, node affinity mismatch, insufficient CPU/memory, or quota violations. Events are your primary “scheduler reasoning” output.

Typical failure patterns and fixes are repeatable:

  • “Insufficient nvidia.com/gpu”: all GPUs are allocated or the node pool is too small. Fix by freeing capacity, scaling the GPU node group, or relaxing hard constraints (e.g., allow another GPU model via preferred affinity).
  • “node(s) had taint … that the pod didn’t tolerate”: add the correct toleration (and confirm you also have GPU requests so the toleration doesn’t become a cost leak).
  • “didn’t match node affinity/selector”: validate node labels (kubectl get nodes --show-labels) and ensure your label keys/values are correct and consistently applied to the node group template.
  • “exceeded quota”: request an increase, move the workload to an appropriate namespace, or reduce parallelism.
  • “Insufficient cpu/memory/ephemeral-storage”: GPU nodes can still be CPU- or memory-bound. Adjust requests, pick a larger instance type, or separate CPU-heavy preprocessing from GPU steps.

If events are unclear, look at cluster-wide signals: kubectl get events -A for broader context, and review scheduler logs (managed Kubernetes often exposes them via control plane logging). Also check node status: a GPU node might be NotReady, cordoned, or missing the device plugin; in that case, you will not see allocatable GPUs even if the hardware exists.

A disciplined debugging habit prevents expensive downtime. Don’t “randomly tweak” affinity and tolerations until it schedules; instead, use the event message as the hypothesis, verify the underlying state (labels, taints, allocatable resources, quotas), apply the smallest change, and re-check events. Practical outcome: faster recovery, fewer accidental policy bypasses, and a scheduler configuration you can explain and defend during audits.

Chapter milestones
  • Milestone 1: Schedule the first GPU pod with correct resource requests
  • Milestone 2: Enforce placement using labels, affinity, and selectors
  • Milestone 3: Protect GPU nodes with taints and tolerations
  • Milestone 4: Prevent noisy neighbors with quotas and limit ranges
  • Milestone 5: Resolve scheduling failures using events and logs
Chapter quiz

1. Which workflow best reflects the chapter’s recommended approach to reliably schedule AI GPU workloads in Kubernetes?

Show answer
Correct answer: Validate GPU discovery via the NVIDIA device plugin, request GPUs correctly, enforce placement and protection with labels/affinity and taints/tolerations, apply multi-tenant controls, then debug with events and scheduler hints
The chapter describes a repeatable sequence: discovery, correct requests, placement/protection, multi-tenant controls, then troubleshooting via events/logs.

2. Why are correct GPU resource requests critical for the Kubernetes scheduler in AI clusters?

Show answer
Correct answer: They enable the scheduler to match scarce accelerators to pods and avoid wasting expensive GPU time
GPU requests are how you communicate required accelerator resources; mis-specified requests waste cost and reduce reliability.

3. In the chapter’s framing, what is the primary purpose of using labels, selectors, and affinity for GPU workloads?

Show answer
Correct answer: To enforce placement rules so workloads land on appropriate nodes (e.g., matching hardware or separating workload types)
Placement tools (labels/selectors/affinity) are used to control where pods can or should run based on node characteristics and policy.

4. How do taints and tolerations help manage GPU nodes according to the chapter?

Show answer
Correct answer: They protect GPU nodes by preventing unintended pods from scheduling there unless they explicitly tolerate the taint
Taints repel pods by default; tolerations allow only approved workloads onto protected nodes like GPU pools.

5. What is a key trade-off described in the chapter when choosing strict vs flexible placement rules for GPU workloads?

Show answer
Correct answer: Overly strict rules can strand capacity and trigger unnecessary autoscaling, while overly permissive rules can place workloads on the wrong GPUs or cause unpredictable behavior
The chapter emphasizes controlled flexibility: hard constraints for must-haves, soft preferences for nice-to-haves, validated via metrics and events.

Chapter 3: Running AI Jobs and Services on GPUs

Once your cluster can advertise GPUs (via the NVIDIA device plugin) and you have a basic scheduling strategy, the next step is operational: running real AI workloads in a way that is fast, safe, and cost-aware. This chapter focuses on the day-to-day mechanics of packaging GPU inference services, running batch training jobs, optimizing startup (time-to-first-token), choosing storage paths that don’t starve the GPU, and rolling out changes without burning expensive capacity.

A practical mindset helps: GPUs are not “just another resource.” They are scarce, expensive, and often coupled to driver/library constraints. That means your Kubernetes objects need to encode intent clearly (resource requests, node selection, and lifecycle behavior), and your images and probes must be designed for CUDA realities. You will work through five milestones across the chapter: (1) package an inference service with GPU access, (2) run a batch training job with sensible retries, (3) optimize images and startup, (4) configure storage for throughput, and (5) apply rollout safety for GPU-backed deployments.

Along the way, keep an eye on the engineering trade-offs you’re making. For example, insisting on a single GPU model may improve determinism but increases scheduling delay; mounting a remote filesystem may simplify data access but can cut effective GPU utilization in half if throughput is poor. Kubernetes gives you the levers—your job is to apply them intentionally.

  • Goal: reliably schedule GPU workloads and keep them healthy.
  • Goal: reduce wasted GPU minutes by improving startup, data access, and rollouts.
  • Goal: avoid common production failures (CrashLoop due to missing libs, false readiness, slow cold starts, and I/O bottlenecks).

The rest of the chapter is organized by workload types, images/runtime, health checks, storage, rollout strategies, and performance hygiene (CPU/memory alongside GPUs). Treat each section as a checklist you can apply in your own manifests and pipelines.

Practice note for Milestone 1: Package an inference service with GPU access: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Run a batch training job and manage retries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Optimize images and startup for faster GPU time-to-first-token: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Configure storage and data paths for throughput: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Apply rollout safety for GPU-backed deployments: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Package an inference service with GPU access: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Run a batch training job and manage retries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Workload types: Deployment, Job, and CronJob for AI

AI on Kubernetes usually falls into two execution shapes: long-running inference services and finite batch jobs. Kubernetes gives you different controllers for each, and choosing the right one is the first cost-control decision because it determines restart behavior, rollout semantics, and how work is counted as “done.” For Milestone 1 (packaging an inference service), you typically use a Deployment because you want a stable endpoint, rolling updates, and replica management. A Deployment pairs well with a Service and an HPA when request volume is variable.

For Milestone 2 (batch training), use a Job when the unit of work is finite (train for N steps, produce artifacts, exit). Jobs track completions and support backoffLimit for retries. A common mistake is using a Deployment for training: the controller interprets “process exited” as failure and keeps restarting, wasting GPU time and potentially corrupting outputs. With Jobs, also decide whether a retry is safe: if your training writes checkpoints, retries can resume; if it writes in-place without transactional discipline, retries may produce inconsistent artifacts.

Use a CronJob for scheduled batch tasks like nightly evaluation, embedding refresh, or periodic fine-tune runs. CronJobs can create many Jobs over time, so cost and quota discipline matter. Configure concurrencyPolicy (e.g., Forbid) to prevent overlapping runs that double GPU spend, and set startingDeadlineSeconds so missed schedules don’t spawn surprise catch-up jobs during peak hours.

Across all three types, encode GPU intent explicitly with resource requests (e.g., nvidia.com/gpu: 1) and placement rules (node affinity or tolerations) so the scheduler doesn’t guess. Treat “unschedulable due to GPU” as a first-class signal: it is usually either a genuine capacity issue or a constraint mismatch (wrong GPU type label, missing toleration, or requesting 2 GPUs on nodes that only have 1).

Section 3.2: GPU-enabled container images and runtime considerations

Most GPU workload failures are image/runtime mismatches: the Pod lands on a GPU node, but the process cannot load CUDA libraries, sees no device, or crashes on import. For Milestone 1 and Milestone 2, aim for a repeatable image strategy. If you use NVIDIA CUDA base images, align the CUDA version with your framework build (PyTorch/TF) and with the driver compatibility guarantees of your environment. A frequent mistake is “it works on my laptop” with a different driver/CUDA combo than the cluster nodes.

In Kubernetes, your container generally doesn’t need to ship the driver, but it does need compatible user-space libraries. The NVIDIA device plugin and runtime integrate the device into the container, but your process still must be able to load the correct libcuda-compatible stack. Validate inside the container with lightweight checks (e.g., nvidia-smi if present, or framework-level device queries) during development—not as a production readiness probe that runs every few seconds.

Milestone 3 (optimize images and startup) is where engineering judgment pays off. Large images delay scheduling-to-ready time and waste GPU minutes while the node pulls layers. Use multi-stage builds, minimize OS packages, and cache Python wheels effectively. Pin dependencies to avoid surprise downloads at startup. If your model is large, decide whether to bake it into the image (fast startup, slower builds, larger pulls) or fetch at runtime (smaller image, slower cold starts, extra network). For GPU cost control, you usually want to avoid “GPU allocated while downloading 20 GB of model weights,” so prefer pre-staging weights on shared storage or using an initContainer that runs before the main container requests the GPU (for example by separating model download into a CPU-only Pod step or using a workflow engine).

Finally, treat GPU runtime knobs as configuration, not code: set environment variables for memory behavior or performance (where appropriate), and keep them consistent across environments. When performance differs across nodes, suspect hidden differences: GPU model, driver version, power settings, or CPU limits that starve the GPU feed pipeline.

Section 3.3: Health checks and readiness for GPU inference endpoints

Inference services are expensive to run and easy to mis-probe. A naive readiness probe that calls “/generate” can inadvertently allocate GPU memory, warm caches, or even trigger long computations. Worse, during rollouts it can amplify traffic and cause cascading failures. For Milestone 1, design health checks that reflect service correctness without burning GPU cycles. Typically you want three layers: (1) a fast liveness check to detect deadlocks, (2) a readiness check that confirms the model is loaded and the service can accept traffic, and (3) optional startupProbe to protect slow model initialization from premature restarts.

On GPU workloads, the startup phase can be long: pulling the image, loading weights into CPU memory, transferring to GPU, compiling kernels, or initializing TensorRT. Use startupProbe with a generous failure threshold so Kubernetes doesn’t kill the container during expected warmup. Then let readiness become strict: only report ready after the model is loaded and you have verified a minimal forward pass (ideally CPU-only or a tiny GPU check that doesn’t allocate large buffers). A common mistake is reporting readiness as soon as the HTTP server binds a port; that causes traffic to arrive while the model is still loading, creating timeouts and retries that overload the node.

Implement back-pressure explicitly. If the model server has a queue, expose metrics and consider returning 429/503 when saturated rather than letting requests pile up. Your readiness endpoint can incorporate “can accept new requests” as a signal. This becomes crucial during rollouts (Milestone 5): if the new ReplicaSet is technically running but has not finished warming, you want it excluded from load balancing.

Finally, separate observability probes from user traffic. Use distinct paths like /healthz and /readyz, keep timeouts short, and avoid dependencies on external services that can flap. If you need deeper checks, run them on a slower cadence via background threads and have probes read cached status.

Section 3.4: Storage choices: PVCs, ephemeral volumes, and dataset staging

GPU utilization is often limited by data, not compute. Milestone 4 is about feeding the accelerator consistently by choosing the right storage pattern for each stage: model weights, datasets, checkpoints, and logs. Start by classifying data as read-mostly (weights, static datasets), write-heavy (checkpoints), or scratch (temporary shards, preprocessed batches). Each class maps naturally to different Kubernetes volumes.

Use PVCs when you need durability across Pod restarts or rescheduling, such as training checkpoints or shared model artifacts. The practical trade-off is latency and throughput: network-attached volumes vary widely. If training throughput is poor, measure I/O (read bandwidth, IOPS, and latency) and verify you are not bottlenecked on a single shared volume. A common mistake is putting both dataset reads and checkpoint writes onto the same slow PVC, causing periodic stalls that look like “GPU underutilization.”

Use ephemeral volumes (like emptyDir) for scratch space and caching. For example, you can stage frequently accessed dataset shards onto ephemeral disk at startup. The risk is that rescheduling loses the cache, so use this when you can tolerate cache rebuilds or when you have a warm pool of nodes. If nodes have fast local NVMe, ephemeral caching can dramatically reduce step time compared to reading from remote object storage.

Dataset staging is a deliberate pattern: download or copy data close to the compute before the GPU is engaged. Practically, this can be done via an initContainer (CPU-only) that pulls data to emptyDir, or via a separate pre-staging Job that populates a PVC. The key engineering judgement is cost: you want the “waiting on network” phase to happen without reserving the GPU whenever possible. When your workload requires GPU to preprocess (e.g., tokenization on GPU), ensure the staging step is still optimized and bounded.

For inference, weights are usually the biggest concern. If you mount weights from a shared store, validate concurrency behavior: dozens of replicas starting at once can stampede the storage backend. Rate-limit rollouts, or bake hot weights into the image for high-availability paths.

Section 3.5: Update strategies: rolling updates, canaries, and rollback signals

Milestone 5 is where GPU cost control and reliability collide. Updating an inference Deployment can temporarily double GPU usage (old and new replicas overlap) and can destabilize latency if new pods are cold. A default rolling update is rarely ideal without tuning. Set maxSurge and maxUnavailable intentionally: if GPUs are scarce, keep surge low to avoid pending pods and wasted scheduling churn; if availability is critical, allow limited surge but pair it with strict readiness so new pods only receive traffic when genuinely warm.

Canary releases reduce blast radius. In Kubernetes you can approximate a canary by running a second Deployment with a small replica count and splitting traffic (via service mesh, ingress weighting, or separate services). The practical outcome is faster detection of model regressions (accuracy, latency, memory leaks) before you scale the new version. Common mistakes include canaries that share the same HPA signals and inadvertently scale up due to test traffic, or canaries that miss real traffic patterns and fail to expose tail-latency problems.

Rollback signals must be measurable. Use more than “pods are running.” Track request error rate, P95/P99 latency, GPU memory usage, and restart counts. If a model version slowly leaks GPU memory, it may pass readiness but fail after hours. Integrate metrics-based alerts that trigger a rollback workflow (manual or automated) and freeze further rollouts. Also consider PodDisruptionBudgets so node maintenance doesn’t evict too many GPU pods at once, which can cause a thundering herd of cold starts.

For batch training Jobs, “rollout” looks different: you version the image and configuration, and you control retries. If you change training code, do not rely on implicit restarts; create a new Job name/version and keep previous outputs immutable. This makes failures diagnosable and prevents accidental overwrites of checkpoints.

Section 3.6: Performance hygiene: CPU/memory sizing alongside GPUs

Requesting a GPU is necessary but not sufficient. Many “slow GPU” incidents are actually CPU starvation, memory pressure, or poor threading defaults. Treat CPU and memory as first-class alongside nvidia.com/gpu. For inference, too little CPU can bottleneck tokenization, request parsing, or streaming responses, leaving the GPU idle between kernels. For training, CPU limits can throttle dataloaders so the GPU waits on batches.

Start with explicit resource requests/limits for CPU and memory that match your concurrency model. If you use multiple workers for data loading, align num_workers with available CPU cores and avoid setting a CPU limit so low that Linux throttling undermines throughput. Memory sizing matters because model loading often spikes RSS before settling; if you set memory limits too tightly, you’ll see OOMKills during warmup, which is especially wasteful when the pod already reserved a GPU.

Use practical telemetry: GPU utilization, SM occupancy, GPU memory, CPU usage, and disk/network throughput. If GPU utilization is low but CPU is pegged, increase CPU requests or optimize preprocessing. If GPU memory is near the limit and performance degrades, reduce batch size or enable more memory-efficient kernels. If pods take a long time to become ready, revisit Milestone 3: image size, dependency downloads, and weight staging.

Finally, remember scheduling side-effects: if you request “1 GPU + lots of CPU,” you may reduce bin-packing and strand GPUs on nodes where CPU is exhausted. Conversely, requesting too little CPU can pack many pods onto a node and create contention. The practical outcome is a sizing loop: measure, adjust requests, and keep profiles per workload (small inference, large inference, training) so your cluster autoscaling and quotas remain predictable.

Chapter milestones
  • Milestone 1: Package an inference service with GPU access
  • Milestone 2: Run a batch training job and manage retries
  • Milestone 3: Optimize images and startup for faster GPU time-to-first-token
  • Milestone 4: Configure storage and data paths for throughput
  • Milestone 5: Apply rollout safety for GPU-backed deployments
Chapter quiz

1. Why does Chapter 3 emphasize that GPUs are not “just another resource” when defining Kubernetes objects for AI workloads?

Show answer
Correct answer: Because GPUs are scarce/expensive and tied to driver/library constraints, so manifests must clearly encode intent (requests, node selection, lifecycle behavior)
The chapter highlights GPU scarcity/cost and CUDA/driver constraints, which require explicit resource requests, placement, and lifecycle choices.

2. Which set of actions best matches the chapter’s goals for reducing wasted GPU minutes?

Show answer
Correct answer: Improve startup (time-to-first-token), optimize data access/storage paths, and use rollout safety to avoid burning capacity
The chapter targets reducing waste through faster startup, better I/O throughput, and safer rollouts to avoid consuming expensive GPU time unnecessarily.

3. What trade-off does the chapter describe when insisting on a single GPU model for a workload?

Show answer
Correct answer: It may improve determinism but increase scheduling delay due to fewer eligible nodes
Restricting to one GPU model can make results more consistent but reduces scheduling flexibility and can delay placement.

4. According to the chapter, why can mounting a remote filesystem be risky for GPU utilization?

Show answer
Correct answer: Poor throughput can starve the GPU and significantly reduce effective utilization
The chapter notes that slow or constrained I/O can cut effective GPU utilization dramatically by keeping the GPU waiting on data.

5. Which failure mode is Chapter 3 explicitly trying to help you avoid through image/runtime and probe design for GPU services?

Show answer
Correct answer: CrashLoop due to missing libraries and false readiness causing unhealthy rollouts
The chapter calls out common production failures like CrashLoops from missing libs and readiness probes that incorrectly mark GPU services as ready.

Chapter 4: Autoscaling Patterns for GPU Clusters

Autoscaling is where GPU clusters either become a cost-efficient platform—or an expensive science project. Unlike general web workloads, AI inference and training behave in bursts, have hard resource constraints (GPU memory, model size), and frequently depend on external queues and batch schedulers. That means “scale on CPU” is usually wrong, “scale on request rate” is sometimes wrong, and “scale on GPU utilization” can be dangerously misleading if you don’t understand saturation vs throughput.

This chapter builds a practical mental model for scaling decisions, then walks through a lab-style progression: scale an inference service with HPA using custom metrics (Milestone 1), use VPA safely without thrash (Milestone 2), trigger node scale-out when GPU pods are pending (Milestone 3), reduce idle spend with scale-down and disruption controls (Milestone 4), and finally validate everything with load tests and dashboards (Milestone 5).

The goal is not just to “turn on autoscaling,” but to make it predictable: pods scale when demand increases, nodes scale when pods can’t schedule, and the whole system scales down without dropping in-flight requests or killing expensive model warmups. Along the way, you’ll learn where autoscalers fight each other, how to define guardrails, and which signals actually map to cost and user experience.

Practice note for Milestone 1: Scale an inference service with HPA using custom metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Use VPA safely for AI workloads and avoid thrash: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Trigger node scale-out with pending GPU pods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Reduce idle time with scale-down and disruption controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Validate autoscaling with load tests and dashboards: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Scale an inference service with HPA using custom metrics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Use VPA safely for AI workloads and avoid thrash: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Trigger node scale-out with pending GPU pods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Reduce idle time with scale-down and disruption controls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Autoscaling mental model: pods vs nodes vs queues

Before configuring any autoscaler, separate three independent control loops: pod scaling, node scaling, and work queue dynamics. Pod scaling (HPA/VPA) adjusts the number of replicas or their resource requests. Node scaling (Cluster Autoscaler or similar) adjusts the number of nodes in a GPU node group. Queue dynamics (request queue, Kafka topic lag, SQS depth, Ray/Serve backlog) describe how work arrives and how quickly it can be processed.

A reliable mental model is: pods scale for throughput, nodes scale for capacity constraints, and queues scale for decoupling. If you only scale pods, but the cluster has no free GPUs, you will get pending pods and no additional throughput. If you only scale nodes, but your service is single-replica or limited by concurrency, you will pay for idle GPUs. If you only watch GPU utilization, you may scale too late because a saturated queue can exist while utilization looks “moderate” due to batching, throttling, or backpressure.

  • Pod autoscaling questions: What metric best represents user demand? What is the stable “unit of work” per replica (QPS, tokens/sec, frames/sec)? How long does a new pod take to become useful (image pull + model load)?
  • Node autoscaling questions: What prevents scheduling (GPU requests, node affinity, taints/tolerations, extended resources)? How long does a GPU node take to boot and join? Are nodes preemptible/spot?
  • Queue questions: Is there buffering? What is acceptable queueing delay? Are you batching requests and how does that affect the metric?

Common mistake: treating autoscaling as a single knob. In practice, you want HPA to add replicas when demand rises, then Cluster Autoscaler to add GPU nodes only when those replicas cannot schedule. This chapter’s milestones map to that layered approach: first scale an inference service at the pod level, then ensure nodes expand only when needed, then ensure the system contracts safely.

Section 4.2: HPA for AI: CPU pitfalls and GPU/custom metric approaches

Horizontal Pod Autoscaler (HPA) is often taught with CPU utilization. For AI inference, CPU is frequently a poor proxy for demand: the hot path is GPU-bound, CPU stays low due to async request handling, and high CPU can appear during model load rather than steady-state serving. Scaling on CPU can therefore oscillate or scale too late, creating long queues and timeouts.

Milestone 1 focuses on scaling an inference service with HPA using custom metrics. The most defensible signals are usually “work backlog” and “service latency,” not raw device utilization. Practical options include: request queue depth (from your gateway or message queue), in-flight requests per pod, tokens/sec per replica, or p95 latency. GPU utilization can help, but only if you also understand batching and concurrency: high utilization with stable latency may be fine; moderate utilization with rising latency can indicate memory pressure, kernel launch overhead, or contention.

Implementation pattern: expose an application metric (Prometheus format) and use the Prometheus Adapter to map it into the Kubernetes custom metrics API, then configure HPA with a target value. For example, scale replicas to keep “inflight_requests” near a per-replica target, or scale to keep “queue_depth” under a threshold. Keep the math simple; unstable formulas make unstable scaling.

  • Guardrails: set minReplicas high enough to avoid cold starts during moderate traffic; use maxReplicas to cap spend.
  • Stabilization: tune HPA behavior with scale-up policies that react quickly and scale-down policies that are conservative to avoid flapping during bursty load.
  • Readiness matters: ensure pods only become Ready after model load completes; otherwise HPA adds replicas that accept traffic before they can serve it.

Common mistake: using GPU utilization alone as the target. A single replica can show 90% GPU utilization while still delivering poor p95 latency if requests are queued. Prefer signals that directly represent SLO impact (latency/backlog), then use GPU utilization as a diagnostic metric on dashboards.

Section 4.3: VPA modes (Off/Initial/Auto) and right-sizing strategy

Vertical Pod Autoscaler (VPA) is powerful for right-sizing, but it can be hazardous for AI workloads if you treat it like a general-purpose optimizer. For GPU-bound services, CPU/memory requests still matter for scheduling and QoS, but changing them frequently can trigger evictions and restarts—expensive when each restart re-downloads weights or warms caches. Milestone 2 is about using VPA safely and avoiding thrash.

VPA has three practical modes: Off, Initial, and Auto. In Off, VPA only produces recommendations; this is ideal for learning typical CPU/memory footprints without changing live pods. In Initial, VPA applies recommendations only at pod creation time; this avoids mid-flight evictions and is usually the safest starting point for inference deployments. In Auto, VPA can evict pods to apply new requests; this can be acceptable for stateless services with fast startup, but risky for large-model inference or training sidecars.

A practical right-sizing workflow is: run VPA in Off for several days, review recommendations, then codify them as explicit requests/limits in your Deployment or Helm values. If you enable Initial mode, keep a tight range via VPA policies and set a reasonable updateMode to prevent frequent shifts. For AI, prioritize stability over perfect packing; an extra 200m CPU request is cheaper than repeated model reloads.

  • Do not VPA GPUs: GPUs are requested as extended resources (e.g., nvidia.com/gpu) and are not managed by VPA. Treat GPU count as an explicit sizing decision.
  • Avoid HPA+VPA conflicts: HPA changes replica counts; VPA changes requests. If both react to the same signal (e.g., CPU), you can get feedback loops. Prefer HPA on custom backlog/latency metrics and VPA limited to CPU/memory recommendations.
  • Use resource limits carefully: memory limits that are too low will OOMKill the process mid-batch; for inference, it’s often safer to set memory requests close to expected usage and set limits with headroom or omit limits where policy allows.

Common mistake: turning on VPA Auto for a model server and then wondering why latency spikes every hour. The cause is often eviction-driven restarts. For AI services, VPA is best as a measurement tool first, an initializer second, and an automatic evictor only when you have strong disruption tolerance.

Section 4.4: Cluster autoscaler basics for GPU node groups

Once HPA is adding replicas, the next question is whether the cluster has enough GPUs to schedule them. Milestone 3 focuses on triggering node scale-out with pending GPU pods. This is exactly what Cluster Autoscaler (CA) is designed to do: watch unschedulable pods and add nodes in the appropriate node group so that the scheduler can place them.

For GPU clusters, node groups are typically separated by GPU type (A10, L4, A100), pricing model (on-demand vs spot), and sometimes by tenancy. CA needs clear signals to pick the right group: node labels (e.g., gpu.nvidia.com/class=a10), taints (e.g., nvidia.com/gpu=true:NoSchedule), and matching tolerations/affinity in the pod spec. If your GPU workloads don’t tolerate the GPU taint, CA can scale nodes all day and your pods will still be unschedulable. If your pod requires a label that no node group can satisfy, CA will not help.

A practical pattern is: define a GPU node group with a taint that blocks non-GPU workloads, then ensure AI pods include tolerations and node affinity. Request GPUs explicitly via resources.requests["nvidia.com/gpu"]. When replicas increase and GPUs run out, pods become Pending with an unschedulable reason; CA detects this and scales the GPU node group.

  • Speed matters: GPU nodes can take minutes to provision. Counter this with a small warm pool (min size > 0) or faster images and model loading.
  • Bin packing: set CPU/memory requests realistically so pods pack efficiently around GPUs; over-requesting CPU can strand GPU capacity.
  • Cost controls: cap node group max size, use separate spot node groups with appropriate tolerations, and consider priority classes so critical inference pods land on on-demand first.

Common mistake: confusing “Pending” due to image pull/backoff with “Pending” due to unschedulable. CA only reacts to unschedulable pods. Your troubleshooting should start with kubectl describe pod to confirm the scheduler’s reason includes insufficient GPU or unmatched affinity/taints.

Section 4.5: Disruption controls: PDBs, graceful termination, and drain behavior

Scaling up is only half the story; cost control depends on scaling down without breaking workloads. Milestone 4 is about reducing idle GPU time with scale-down and disruption controls. The challenge: GPU nodes are expensive, but AI pods are also “sticky” due to model warmup, long-running requests, and checkpointing.

Cluster Autoscaler scale-down works by identifying nodes that can be removed and evicting pods so they can reschedule elsewhere. If your pods cannot move (due to strict node affinity, missing tolerations on other nodes, or oversized requests), nodes will never be considered removable. If your pods can move but take too long to terminate, scale-down may be delayed or may cause dropped requests if termination isn’t graceful.

Start with PodDisruptionBudgets (PDBs) to prevent too many replicas being disrupted at once. For an inference Deployment, a PDB like “minAvailable: 90%” can ensure capacity remains during drains. Next, implement graceful termination: set terminationGracePeriodSeconds long enough to finish in-flight requests, and add a preStop hook that stops accepting new traffic (e.g., mark unready, drain connections) before the process exits. Ensure your Service/readiness probes remove the pod from load balancing quickly.

  • Drain behavior: understand that node drain triggers pod eviction; if you rely on local NVMe caches for model weights, eviction may force re-downloads elsewhere—plan for that cost.
  • Don’t block forever: overly strict PDBs or long termination grace periods can prevent scale-down entirely. Choose values that match your SLO and traffic patterns.
  • Training jobs differ: for training, consider checkpoint frequency and job interruption handling; use priority and disruption policies so training is the first to be preempted when reclaiming GPUs.

Common mistake: enabling aggressive scale-down while ignoring readiness/termination. The symptom is “autoscaling worked” but users see 5xx spikes during node removals. Treat disruption controls as part of autoscaling, not an optional add-on.

Section 4.6: Testing autoscaling: synthetic load, SLO signals, and validation

Milestone 5 ties everything together: validate autoscaling with load tests and dashboards. Autoscaling configurations are hypotheses; validation proves whether the cluster meets SLOs at acceptable cost. The key is to test the full chain: metric emission → metric adapter → HPA decisions → pod scheduling → node provisioning → readiness → traffic distribution → scale-down.

Use synthetic load that resembles real inference: include realistic request sizes, concurrency, and burstiness. If your service batches requests, test both steady load and spiky load to see how queues form. During the test, watch SLO signals such as p50/p95 latency, error rate, and queue depth. These are the metrics your users feel. Also watch platform metrics: replica count, Pending pods, node group size, GPU utilization, and time-to-ready for new replicas.

Validation is not only “it scales up.” You also need to confirm: (1) scale-up occurs early enough to prevent SLO violations, (2) node scale-out triggers only when genuinely needed (unschedulable pods), (3) scale-down happens after load drops without causing errors, and (4) the steady-state footprint matches budget expectations.

  • Dashboards: create a single pane that overlays demand (QPS/queue depth), SLO (p95), and capacity (replicas/nodes/GPUs). Correlation is how you spot wrong scaling signals.
  • Event-based debugging: review HPA events and decisions, plus scheduler events for Pending pods. This often reveals misconfigured metric targets or missing tolerations.
  • Cost validation: compare GPU node-hours before/after changes. Autoscaling that improves latency but doubles spend may be unacceptable without tighter caps or better batching.

Common mistake: testing only scale-up and declaring success. In GPU clusters, the biggest savings often come from reliable scale-down and avoiding idle nodes. Your final acceptance criteria should include both performance under peak and cost behavior after the peak ends.

Chapter milestones
  • Milestone 1: Scale an inference service with HPA using custom metrics
  • Milestone 2: Use VPA safely for AI workloads and avoid thrash
  • Milestone 3: Trigger node scale-out with pending GPU pods
  • Milestone 4: Reduce idle time with scale-down and disruption controls
  • Milestone 5: Validate autoscaling with load tests and dashboards
Chapter quiz

1. Why is “scale on CPU” usually a poor autoscaling signal for GPU-based AI workloads?

Show answer
Correct answer: CPU usage often doesn’t reflect GPU memory constraints, model size limits, or bursty AI demand patterns
AI workloads are constrained by GPU resources and burst behavior; CPU can stay low while the system is saturated elsewhere.

2. What is the key risk of scaling directly on GPU utilization without understanding saturation vs throughput?

Show answer
Correct answer: It can trigger unsafe scaling decisions because utilization may look high even when throughput isn’t improving
GPU utilization can be misleading; high utilization doesn’t necessarily map to better throughput, leading to incorrect scaling.

3. What does the chapter describe as the predictable autoscaling sequence to aim for?

Show answer
Correct answer: Pods scale when demand increases, nodes scale when pods can’t schedule, and the system scales down safely without disrupting work
The goal is coordinated, predictable behavior: demand drives pod scaling, pending pods drive node scale-out, and scale-down preserves in-flight work and warmups.

4. Which milestone focuses on scaling an inference service using HPA with a metric beyond default CPU-based signals?

Show answer
Correct answer: Milestone 1: Scale an inference service with HPA using custom metrics
Milestone 1 explicitly targets HPA using custom metrics, which is emphasized as more appropriate than CPU for many AI workloads.

5. What is the main purpose of validating autoscaling with load tests and dashboards (Milestone 5) in this chapter’s framing?

Show answer
Correct answer: To confirm scaling behavior is predictable and aligns with cost and user experience under realistic demand
Validation is about observing real behavior under load so scaling decisions map to cost efficiency and user-facing performance.

Chapter 5: Observability and Troubleshooting for GPU Workloads

GPU workloads fail differently than CPU-only services: they can be “healthy” at the container level while silently underperforming due to low GPU occupancy, memory fragmentation, PCIe bottlenecks, or thermal throttling. That is why observability for AI on Kubernetes must be built around GPU-specific signals and a workflow that narrows from user-visible symptoms down to node and device realities.

This chapter turns troubleshooting into an engineering practice. You will build a GPU-focused checklist (Milestone 1), then use it to trace a latency issue from service to node to GPU (Milestone 2). You will learn to diagnose OOM, throttling, and memory fragmentation (Milestone 3), investigate scheduling hot spots and bin-packing gaps (Milestone 4), and finally produce an incident report that leads to real remediation (Milestone 5).

The goal is not to “collect all metrics.” The goal is to answer a small set of repeatable questions: Is the service meeting its latency and throughput targets? If not, is it compute-bound, memory-bound, I/O-bound, or scheduler-bound? Are we wasting GPUs due to fragmentation or policy? And what guardrails prevent recurrence while controlling cost?

Practice note for Milestone 1: Build a GPU-focused troubleshooting checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Trace a latency issue from service to node to GPU: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Investigate scheduling hot spots and bin-packing gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Produce an incident report with actionable remediation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Build a GPU-focused troubleshooting checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Trace a latency issue from service to node to GPU: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Investigate scheduling hot spots and bin-packing gaps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Produce an incident report with actionable remediation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Key signals: GPU utilization, memory, power, and thermals

Start with the signals that map directly to how GPUs deliver performance. For inference and training, the most actionable quartet is: utilization (SM occupancy), memory (allocated vs used), power draw, and thermals. A common mistake is to treat “GPU utilization” as a single truth. In practice you need at least two utilization views: overall device busy time and kernel-level or SM-level occupancy. A GPU can show high utilization while doing inefficient small kernels, or low utilization while waiting on CPU preprocessing or network.

GPU memory signals are equally nuanced. Differentiate (1) memory allocated by the framework (reserved pools), (2) memory actually used by tensors, and (3) free memory that is unusable due to fragmentation. This is where Milestone 3 begins: if a pod OOMs while “free” memory appears available, you may be facing allocator fragmentation or oversize batch spikes. Also watch memory bandwidth utilization and PCIe throughput when available; low SM usage plus high PCIe suggests the GPU is starved.

Power and thermals translate directly into throttling. A service may pass readiness probes yet regress in p95 latency because the GPU is power-capped by node settings or is thermal-throttling in a dense rack. In troubleshooting, look for a pattern: utilization steady but clocks reduced; power pinned at cap; temperature near threshold. These are not Kubernetes problems, but Kubernetes is where the symptom surfaces.

  • Service layer: RPS, p50/p95/p99 latency, error rate, queue depth
  • Pod layer: CPU throttling, container memory RSS, restarts, request/limit alignment
  • Node layer: CPU steal, disk and network saturation, NUMA locality issues
  • GPU layer: SM utilization, memory used/allocated, power, thermals, ECC errors

Milestone 1 is to turn these into a checklist you can run in minutes: “Is latency up? Is queueing up? Are GPUs busy? If not, what is blocking?” That checklist is your first defense against spending hours in the wrong subsystem.

Section 5.2: Metrics pipelines: Prometheus, DCGM exporter, and alerts

You cannot troubleshoot what you cannot measure consistently. For GPU workloads on Kubernetes, a practical metrics pipeline is Prometheus for storage and query, DCGM exporter for GPU device metrics, and kube-state-metrics/cAdvisor for Kubernetes and container metrics. The engineering judgment is choosing a small set of high-signal metrics, sane scrape intervals, and alert thresholds that reflect SLOs rather than noise.

DCGM exporter exposes metrics such as GPU utilization, memory usage, power draw, temperature, and sometimes per-process stats depending on configuration. Pair that with node exporter (node CPU, disk, network) and kubelet/cAdvisor (container CPU throttling, memory working set). When teams skip the exporter and rely only on application logs, they end up guessing whether the GPU is saturated or idle.

Alerting should follow user impact and cost impact. User impact: high p95 latency, rising error rate, backlog growth. Cost impact: GPUs allocated but underutilized for sustained windows, or nodes sitting idle due to scheduling constraints. A practical alert is “GPU allocated (pods requesting nvidia.com/gpu) but SM utilization < 20% for 30 minutes,” which flags waste and also hints at pipeline bottlenecks. Another is “GPU temperature near throttle threshold for 10 minutes,” which predicts tail latency regressions.

Common mistakes include overly aggressive scrape intervals that overload the control plane, and alerts on raw utilization without context. Calibrate with baselines: measure normal utilization for your model and batch size. Then tie alerts to symptoms (latency/queueing) and to actions (scale up, change batching, relocate pods).

Milestone 2 uses this pipeline to trace latency: start at service latency dashboards, correlate to queue depth, then to GPU busy time and power/thermals on the specific nodes running the pods. The win is speed: you move from “users report slowness” to “GPU is idle because CPU preprocessing is throttled” in a single dashboard pass.

Section 5.3: Logs and events: kubelet, scheduler, and runtime debugging

Metrics tell you “what” is happening; logs and events tell you “why.” GPU troubleshooting in Kubernetes often requires reading three layers: Kubernetes events (scheduling, image pulls, pod lifecycle), kubelet logs (device plugin interactions, cgroup limits, OOM kills), and runtime logs (containerd/Docker, NVIDIA container runtime). The key is to follow time order and correlate with pod UID and node name.

Start with kubectl describe pod and events. For Milestone 4 (scheduling hot spots), events like 0/10 nodes are available: 10 Insufficient nvidia.com/gpu or node(s) had taint reveal whether your placement rules (taints/tolerations, node affinity) are too strict. If pods are pending while GPUs exist, you likely have fragmentation (e.g., many nodes each with 1 free GPU but pods request 2), or topology constraints that prevent packing.

For Milestone 3 (OOM and fragmentation), distinguish between host OOM and container OOM, and between CPU memory and GPU memory. Kubernetes OOMKilled events usually refer to CPU memory (cgroup limit). GPU memory OOM appears in application logs (CUDA OOM, framework allocator errors) and may not restart the pod unless the process exits. If the process survives but latency spikes, look for repeated allocation failures triggering fallback paths.

On the node, kubelet logs can show device plugin registration failures and allocation issues. Runtime logs can expose missing driver libraries or mismatched CUDA versions. A frequent pitfall is assuming “device plugin running” means “GPU usable.” Validate by checking that the container sees /dev/nvidia* devices and that nvidia-smi works inside the pod (or use a minimal validation container image).

  • Scheduler debugging: inspect pending pod events, check node labels/taints, verify resource requests match reality
  • Kubelet debugging: look for OOM kills, device allocation errors, volume mount timeouts
  • Runtime debugging: confirm NVIDIA runtime class, driver compatibility, and container library paths

The practical outcome is repeatable root-cause isolation: you can explain whether a slowdown is scheduling delay, node pressure, container throttling, or GPU-side failure, instead of treating them as one bucket.

Section 5.4: Profiling inference: throughput, queueing, and tail latency

Inference performance problems often present as tail latency spikes rather than average regressions. Profiling must therefore include queueing and concurrency, not only GPU utilization. A model server can keep the GPU “busy” while requests pile up because batch formation is inefficient or because CPU-side tokenization is saturated. Conversely, latency can rise with low GPU usage if requests are serialized due to a lock, a low concurrency setting, or a single-threaded preprocessor.

Milestone 2 becomes concrete here: trace latency from the service layer (p95/p99) to queue depth (in the model server or ingress) to per-pod throughput. Then tie it to resource contention: container CPU throttling is a common culprit when CPU limits are set too low for preprocessing. Check for high container_cpu_cfs_throttled_seconds_total while GPU utilization remains low. If you see that pattern, raising CPU limits or moving preprocessing off-path can improve GPU occupancy and reduce cost per request.

Tail latency is also sensitive to GPU thermal or power throttling (Section 5.1). If clocks drop under sustained load, average throughput may hold while p99 degrades. Another frequent pattern is memory pressure inside the framework: garbage collection or allocator compaction pauses that appear as periodic latency spikes. Collect request-level histograms in the application and align them with GPU metrics timestamps.

Common mistakes: optimizing batch size solely for throughput while violating latency SLOs, or scaling replicas without considering shared bottlenecks like a single upstream queue or a saturated node NIC. Practical profiling workflow: (1) fix a representative traffic shape, (2) measure per-replica max sustainable throughput, (3) observe queue growth onset, (4) validate headroom under failure (one replica down), and (5) set autoscaling targets based on queueing, not only CPU.

Section 5.5: Capacity analysis: fragmentation, bin packing, and saturation

GPU capacity is expensive, so you must detect waste modes: fragmentation, bin-packing gaps, and saturation. Fragmentation happens when free GPUs exist but cannot satisfy pod shapes or placement rules. For example, requesting 2 GPUs per pod on a fleet of 4-GPU nodes can strand 1 GPU on many nodes if scheduling is not aligned with workload shapes. This is Milestone 4: identify where the cluster has “available capacity” that is not schedulable.

Start with a simple accounting table: for each node, total GPUs, allocated GPUs, free GPUs, and which pods hold them. Then layer in constraints: taints/tolerations, node affinity, topology spread, and priority classes. If many nodes show 1 free GPU but no pending pods can use 1 GPU, you have a shape mismatch. If many GPUs are allocated to low-utilization pods, you have consolidation opportunities (or you need MIG/time-slicing, if policy allows).

Saturation analysis asks: are we out of GPUs, or out of something else? GPU workloads can be blocked by node CPU, memory, ephemeral storage, or network. If pods request GPUs but are CPU-starved, the GPU becomes underutilized while the node is saturated on CPU. This is a cost trap: you pay for GPUs to wait. Address it by right-sizing CPU/memory requests, separating preprocessing into CPU pools, or using node classes that balance CPU:GPU ratios.

Bin packing is partly policy. Affinity rules that spread pods evenly can reduce blast radius but increase fragmentation and cost. Conversely, packing tightly can improve utilization but raise risk. Use metrics to choose intentionally: if SLOs are strict, keep headroom; if cost is paramount, pack and rely on priority/preemption to protect critical services.

  • Fragmentation symptom: pending pods + free GPUs present
  • Bin-packing gap symptom: many nodes partially used, few fully used
  • Saturation symptom: GPU utilization high + queueing high + scaling unable to add nodes

The practical outcome is that you can quantify “how many more requests can this cluster serve” and “how many GPUs are effectively wasted” with evidence, not intuition.

Section 5.6: Reliability practices: runbooks, postmortems, and SLOs

Observability is only valuable if it changes outcomes during incidents and prevents repeats. Reliability for GPU workloads means writing runbooks that match your troubleshooting checklist (Milestone 1) and practicing an incident workflow that ends with an actionable report (Milestone 5). Your runbook should be opinionated: which dashboards to open first, which kubectl commands to run, and what decisions are allowed (scale replicas, cordon node, roll back model, change batch settings).

Define SLOs that reflect user experience and GPU realities. For inference, include request success rate and tail latency (p95/p99). Add capacity SLOs such as “no more than X minutes of pending time for GPU pods at priority P1,” which catches scheduling failures early. Tie alerts to these SLOs, not to arbitrary utilization thresholds.

For postmortems, avoid the trap of “GPU was overloaded.” Instead, document the chain: trigger, detection, impact, contributing factors (e.g., CPU throttling starved GPU, fragmentation prevented scale-out, thermal throttling increased p99), and concrete remediations. Good remediations are specific and testable: adjust resource requests, modify affinity to reduce fragmentation, add node pools with different GPU shapes, improve canarying for new models, or add budget guardrails that block runaway replicas.

An incident report template that works well includes: timeline, scope, graphs (latency + queue + GPU util + node pressure), what was tried, what worked, and follow-ups with owners and due dates. Close the loop by updating the runbook and adding a regression test or alert so the same pattern is detected earlier next time.

The practical outcome is resilience and cost control: your team responds faster, wastes fewer GPU-hours, and can justify scaling decisions with data and SLO alignment.

Chapter milestones
  • Milestone 1: Build a GPU-focused troubleshooting checklist
  • Milestone 2: Trace a latency issue from service to node to GPU
  • Milestone 3: Diagnose OOM, throttling, and GPU memory fragmentation
  • Milestone 4: Investigate scheduling hot spots and bin-packing gaps
  • Milestone 5: Produce an incident report with actionable remediation
Chapter quiz

1. Why must observability for GPU workloads go beyond container-level health checks?

Show answer
Correct answer: GPU workloads can appear healthy while underperforming due to GPU-specific issues like low occupancy, fragmentation, PCIe bottlenecks, or thermal throttling
The chapter highlights that GPU services may be 'healthy' at the container level yet slow because of device-level constraints invisible to basic health checks.

2. What is the recommended troubleshooting workflow direction for a latency issue in this chapter?

Show answer
Correct answer: Start from user-visible symptoms, then narrow to service, node, and finally GPU/device realities
Milestone 2 emphasizes tracing latency from service to node to GPU, narrowing step-by-step from symptoms to device causes.

3. Which set of categories best matches the chapter’s approach to classifying why latency/throughput targets are missed?

Show answer
Correct answer: Compute-bound, memory-bound, I/O-bound, or scheduler-bound
The chapter’s repeatable questions explicitly ask whether the bottleneck is compute, memory, I/O, or scheduling.

4. Which issue is specifically called out as a way GPUs can be wasted even if the service is running?

Show answer
Correct answer: GPU memory fragmentation or policy choices that prevent efficient use
The chapter notes wasted GPUs due to fragmentation or policy, tying troubleshooting to cost control.

5. What is the primary purpose of producing an incident report in this chapter’s troubleshooting practice?

Show answer
Correct answer: To document findings and lead to actionable remediation and guardrails that prevent recurrence while controlling cost
Milestone 5 focuses on an incident report that drives remediation and prevention guardrails, aligned with cost control.

Chapter 6: Cost Controls, Governance, and Exam-Style Capstone

GPU-enabled Kubernetes clusters can burn budget faster than any other platform component because they concentrate high hourly rates, bursty training jobs, and “just in case” overprovisioning. In this chapter you will treat cost as a first-class SLO alongside reliability and performance. The goal is not merely to “spend less,” but to create predictable guardrails: teams can run experiments safely, cluster operators can enforce fair sharing, and finance stakeholders can understand where spend is coming from.

You will implement a layered control system. First, you apply hard guardrails (quotas, limits, priority, and preemption) to prevent runaway usage. Second, you enforce governance (RBAC, namespaces, admission control) so GPU access is deliberate and auditable. Third, you add cost visibility signals (labels, allocation dimensions, and reporting patterns) so chargeback/showback becomes possible. Fourth, you optimize with cost-aware scheduling and scaling strategies that respect GPU scarcity and latency requirements. Finally, you complete a timed capstone lab that mirrors certification-style tasks: build, validate, troubleshoot, and document your decisions.

Keep an exam mindset: every object should be explainable, reproducible, and verifiable via kubectl outputs. You are aiming for practical outcomes: fewer surprise bills, fewer scheduling dead-ends, and faster incident triage when GPU workloads fail to start or scale.

Practice note for Milestone 1: Implement cost guardrails with quotas, limits, and priority: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Enforce policy checks for GPU usage and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Add budget visibility and chargeback/showback signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Optimize spend with scheduling and scaling strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 5: Complete a timed capstone lab mirroring certification tasks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 1: Implement cost guardrails with quotas, limits, and priority: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 2: Enforce policy checks for GPU usage and namespaces: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 3: Add budget visibility and chargeback/showback signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Milestone 4: Optimize spend with scheduling and scaling strategies: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: FinOps for Kubernetes AI: cost drivers and control points

FinOps for AI on Kubernetes starts by identifying the cost drivers unique to GPU workloads: GPU node hourly price, idle time (nodes running with no pods using GPUs), inefficient bin-packing (fragmented GPUs or CPU/RAM), and oversized requests that force larger instances. Unlike CPU-only clusters, GPU cost is often dominated by node availability rather than actual GPU utilization. Your control points therefore span both Kubernetes resources and cloud infrastructure: who can request GPUs, how many they can request, how long they can run, and when nodes are allowed to scale out.

Milestone 1 focuses on guardrails you can enforce directly in Kubernetes. Use Namespaces to define billing and ownership boundaries (team, project, environment). Apply ResourceQuotas for requests.nvidia.com/gpu (and optionally limits.nvidia.com/gpu) so a single namespace cannot consume the entire accelerator fleet. Pair quotas with LimitRanges to require defaults and cap per-pod CPU/memory requests, reducing accidental over-allocation that can strand GPUs due to memory pressure.

  • ResourceQuota: caps total GPUs, CPU, memory, and object counts per namespace.
  • LimitRange: enforces per-container min/max and default requests/limits.
  • PriorityClass: ensures critical inference or platform jobs win scheduling over ad-hoc training runs.
  • Preemption: allows higher priority pods to reclaim resources when the cluster is saturated.

Engineering judgment: quotas should reflect both fairness and operational reality. If you have 8 GPUs total and want at least two teams to run concurrently, a quota of 4 GPUs per team creates a predictable worst case. Avoid setting quotas so low that teams bypass them with “temporary” namespaces; instead, combine quotas with governance controls in the next section. Common mistakes include forgetting that GPU requests are integer-only (you cannot request 0.5 GPU), setting CPU/memory limits too tightly (causing OOM kills), and using priority without a clear policy—leading to constant preemption churn and poor training throughput.

Practical outcome: after this milestone, you can answer “what is the maximum GPU blast radius per team?” and demonstrate it by trying (and failing) to schedule a pod that exceeds the namespace quota.

Section 6.2: Governance primitives: RBAC, namespaces, and admission control

Cost controls fail if governance is weak. If anyone can create namespaces, bind cluster roles, or deploy to GPU nodes, then quotas and priorities become optional. Milestone 2 adds structural governance: RBAC defines who can do what, namespaces define where they can do it, and admission control acts as a policy enforcement point that can reject non-compliant workloads before they land on the cluster.

Start by defining standard namespaces per team or project and restricting namespace creation to platform operators. For each team namespace, grant a scoped Role/RoleBinding that allows typical operations (create Deployments, Jobs, Services, ConfigMaps) but not cluster-wide changes (nodes, CRDs, webhook configurations). A common operational pattern is to allow developers to manage workloads while reserving GPU node pool changes and admission policies for cluster admins.

  • Use least privilege: prefer Role over ClusterRole whenever possible.
  • Separate duties: “workload deployers” vs “policy maintainers” vs “node pool operators.”
  • Prevent privilege escalation: avoid granting create on rolebindings unless necessary.

Admission control is where governance becomes real-time enforcement. Even without custom policies, you should enable standard admission plugins (as managed by your distribution) and then layer dedicated policy engines (next section). Typical mistakes: granting broad permissions for convenience (“cluster-admin for the team”), allowing users to label nodes or modify taints (they can route themselves to GPUs), and relying on documentation rather than enforcement (“please don’t use GPUs for notebooks”).

Practical outcome: you can prove governance by attempting to deploy a GPU pod from a non-authorized namespace or user and observing an explicit authorization failure (RBAC) or a policy rejection (admission). This becomes essential evidence during audits and in certification-style troubleshooting.

Section 6.3: Policy as code with OPA Gatekeeper/Kyverno for GPU rules

Policy-as-code turns your GPU governance into versioned, testable artifacts. OPA Gatekeeper and Kyverno both integrate as admission controllers; the difference is authoring style (Rego vs YAML-like rules) and ecosystem. For exam readiness, focus on what policies accomplish: prevent accidental GPU consumption, enforce naming/labeling conventions for allocation, and require scheduling constraints that keep GPU workloads on approved node pools.

Milestone 2 continues here: enforce policy checks for GPU usage and namespaces. Common GPU-focused rules include: only specific namespaces may request nvidia.com/gpu; any pod requesting GPUs must set resources.limits["nvidia.com/gpu"] equal to requests (avoids “request 1, limit 0” misconfigurations); GPU pods must tolerate the GPU node taint and include node affinity to GPU-labeled nodes; and required labels like cost-center, team, and environment must be present.

  • Namespace allowlist: deny GPU requests outside approved namespaces.
  • Require limits: ensure GPU requests/limits are explicit and consistent.
  • Scheduling guardrail: require tolerations/affinity for GPU nodes to prevent accidental placement attempts on CPU nodes.
  • Mandatory labels: enforce allocation tags needed for reporting and showback.

Engineering judgment: keep policies minimal and composable. Overly strict rules cause friction and bypass behavior (shadow clusters, direct cloud VMs). Start with deny rules that prevent the biggest failures: unbounded GPU use and missing ownership metadata. Then add progressive enforcement (“warn/audit” mode first, then “enforce”). Mistakes include writing policies that block system components (e.g., device plugin DaemonSets), forgetting to exempt kube-system, and enforcing affinity in a way that breaks portability across environments.

Practical outcome: when a developer submits a GPU Job without the required labels or in the wrong namespace, the cluster rejects it immediately with a human-readable message. That converts an expensive surprise into a quick, cheap feedback loop.

Section 6.4: Cost-aware scheduling and instance selection strategies

Once hard guardrails and governance are in place, you can optimize spend without sacrificing throughput. Milestone 4 is about making scheduling and scaling decisions that reduce idle GPU time and avoid expensive node types when cheaper options satisfy requirements. In Kubernetes, cost-aware scheduling is not a single feature; it is a set of patterns: taints/tolerations to isolate GPU nodes, node affinity to target the right accelerator class, and autoscaling policies that scale up only when justified by pending GPU pods.

Start with node pools by accelerator type (e.g., T4 vs A10 vs A100) and label nodes accordingly (e.g., gpu.nvidia.com/class=a10, gpu.nvidia.com/memory=24gb). For training jobs with flexible performance needs, prefer cheaper GPUs and allow fallbacks via preferredDuringSchedulingIgnoredDuringExecution affinity. For latency-critical inference, use strict affinity to a known class and pair it with a PriorityClass so inference pods preempt lower value training if the cluster is saturated.

  • Bin-pack intentionally: request realistic CPU/memory so multiple GPU pods can share a node when appropriate.
  • Use node autoscaling patterns: scale GPU node groups from zero when there are pending GPU pods; scale down aggressively when idle.
  • Right-size with VPA carefully: VPA can help CPU/memory right-sizing but must not destabilize training jobs via frequent evictions.
  • HPA for inference: scale replicas based on latency/QPS (and optionally GPU utilization), but ensure quotas prevent runaway scale-outs.

Common mistakes: over-requesting memory “just in case,” which forces larger (more expensive) instances; mixing incompatible workloads on the same GPU node without considering contention; and allowing node autoscaler to scale out for pods that are unschedulable due to missing tolerations/affinity—leading to wasted nodes. A practical troubleshooting workflow is: check pending pods, inspect events for “0/… nodes are available” reasons, verify tolerations and affinity, confirm the device plugin advertises GPUs, then confirm the autoscaler is seeing pending demand.

Practical outcome: your cluster scales GPUs up when real work arrives, packs them efficiently, and scales down when idle—while enforcing that only approved workloads can trigger that spend.

Section 6.5: Chargeback/showback: labels, allocation, and reporting patterns

Milestone 3 adds budget visibility and showback/chargeback signals. Even strong controls are hard to sustain if teams cannot see the financial impact of their choices. In Kubernetes, cost allocation typically begins with consistent metadata and ends with reports that map resource consumption to owners. The key is to decide which dimensions you need (team, project, cost center, environment, model name) and enforce them as labels or annotations on namespaces and workloads.

Implement a labeling standard and make it enforceable (via the policies in Section 6.3). At minimum, label namespaces with team, cost-center, and environment. Then propagate or require equivalent labels on workloads to support granular views for shared namespaces. Add workload identifiers such as app, model, or experiment to separate long-running inference from short-lived training jobs.

  • Namespace as the primary billing boundary: simplest and most reliable for allocation.
  • Workload labels for detail: useful when teams share namespaces or run many experiments.
  • Node pool attribution: label nodes by GPU class and pricing tier for cost breakdowns.
  • Budgets and alerts: set thresholds per team (outside Kubernetes) and tie alerts to namespace labels.

Engineering judgment: showback is often the first step—share dashboards and weekly reports before implementing internal chargeback. Focus on trends (idle GPU hours, cost per training run, cost per 1k inferences) rather than perfect precision. Mistakes include relying on pod names (mutable) instead of labels (stable intent), failing to label ephemeral Jobs (which then appear as “unallocated”), and ignoring shared overhead (device plugin, monitoring, system daemons). Your reporting should clearly separate “shared platform cost” from “team consumption.”

Practical outcome: you can produce a report that answers “which team spent the most GPU-hours this week and on which model or environment,” and you can justify it with enforced labels and auditable policies.

Section 6.6: Capstone checklist: build, validate, troubleshoot, and document

Milestone 5 is a timed capstone that mirrors certification tasks: implement controls, validate behavior, troubleshoot scheduling failures, and document outcomes. Treat this as an operational runbook exercise. The goal is not only to configure objects, but to prove the system works with observable evidence (events, policy denials, quota errors, and successful GPU scheduling).

  • Build: create team namespaces; apply ResourceQuotas/LimitRanges; define PriorityClasses; ensure GPU nodes are tainted/labeled; confirm device plugin is healthy.
  • Govern: apply RBAC so only approved users/namespaces can deploy GPU workloads; restrict namespace creation; ensure no broad privilege escalation paths exist.
  • Enforce: install and configure Gatekeeper or Kyverno; add policies for GPU allowlists, required labels, and scheduling constraints; start in audit mode if needed, then enforce.
  • Validate: run a GPU pod in an approved namespace (should schedule); attempt the same in a non-approved namespace (should be denied); exceed quota intentionally (should fail); verify priority/preemption by competing workloads.
  • Troubleshoot: use kubectl describe pod and events to diagnose pending pods; check node labels/taints, tolerations, and affinity; confirm GPU resources appear in kubectl describe node.
  • Document: record decisions (quota sizes, priority rationale, policy exceptions), and capture command outputs that prove compliance and cost guardrail behavior.

Common capstone failure modes are predictable: policies blocking system namespaces, quotas applied to the wrong namespace, GPU pods missing tolerations, and autoscaler scaling out for unschedulable pods due to affinity mistakes. Your timed strategy should be: implement one layer at a time, validate immediately, and only then add the next layer. If something breaks, roll back the last change and re-test; do not “pile on” changes and hope the cluster recovers.

Practical outcome: by the end of the capstone, you have a defensible GPU multi-tenant platform with enforced guardrails, visible allocation signals, and a repeatable troubleshooting workflow—exactly the operational posture expected in real environments and reflected in certification-style tasks.

Chapter milestones
  • Milestone 1: Implement cost guardrails with quotas, limits, and priority
  • Milestone 2: Enforce policy checks for GPU usage and namespaces
  • Milestone 3: Add budget visibility and chargeback/showback signals
  • Milestone 4: Optimize spend with scheduling and scaling strategies
  • Milestone 5: Complete a timed capstone lab mirroring certification tasks
Chapter quiz

1. Why does Chapter 6 emphasize treating cost as a first-class SLO alongside reliability and performance for GPU-enabled Kubernetes clusters?

Show answer
Correct answer: Because GPU clusters can rapidly accumulate spend due to high hourly rates, bursty jobs, and overprovisioning, requiring predictable guardrails
The chapter frames cost control as essential because GPUs are expensive and workloads can spike unpredictably, so guardrails are needed for predictable operations.

2. Which set best represents the chapter’s first layer of controls designed to prevent runaway GPU usage?

Show answer
Correct answer: Quotas, limits, priority, and preemption
The first layer is hard guardrails: quotas/limits plus priority and preemption to control and constrain resource use.

3. What is the primary purpose of the governance layer described in Chapter 6?

Show answer
Correct answer: To make GPU access deliberate and auditable using mechanisms like RBAC, namespaces, and admission control
Governance controls ensure access is intentional and traceable, supporting enforcement and auditability.

4. How does Chapter 6 suggest enabling chargeback/showback for GPU spend within the cluster?

Show answer
Correct answer: By adding cost visibility signals such as labels, allocation dimensions, and reporting patterns
Chargeback/showback requires attribution, which the chapter ties to labeling and allocation/reporting signals.

5. What does the chapter’s “exam mindset” for the timed capstone lab most strongly require?

Show answer
Correct answer: Every object and decision should be explainable, reproducible, and verifiable via kubectl outputs
The capstone mirrors certification tasks, so work must be demonstrable and verifiable with kubectl, with clear, reproducible configurations.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.