AI Platform Engineering & MLOps · Part XXV of 34
The GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction in three layers
How Kueue, Volcano, and the NVIDIA GPU Operator compose into a three-layer stack — and where each layer fails when something goes wrong.
Three components appear repeatedly in GPU-capable Kubernetes platforms: a job-queue admission controller, a gang-aware batch scheduler, and a GPU operator. Each is documented in isolation. The documentation rarely makes the responsibility boundary between them explicit — which layer decides whether a job runs, which layer decides where, and which layer decides how many GPU units exist in the first place.
This article maps each layer to a single question, traces a job through all three, and identifies which layer fails first in the three most common production failure modes. It is Part 7, Article 25 of the AI Platform Engineering & MLOps series.
One sentence per layer
Before the detail, the boundary in one sentence each:
- Kueue asks: does this job fit within the team’s quota right now?
- Volcano asks: can all pods in this job be placed simultaneously?
- The GPU Operator answers: how many schedulable GPU units does each node advertise to the kubelet?
The tracer below steps a PyTorchJob through all three layers, showing per-layer decisions and failure scenarios.
Three-Layer Stack Tracer
Trace a PyTorchJob (4 workers, 4×A100) through all three scheduling layers. Step manually or auto-play. Simulate a failure at any layer.
Press Step → to trace the job manually, or Auto-play to watch it run.
You can also click any layer box above to jump directly to it.
Layer 1 — Kueue: quota admission
Kueue is a Kubernetes-native job queueing system developed under SIG Scheduling, introduced to the Kubernetes ecosystem in October 2022. It does not replace the kube-scheduler — it sits in front of it, deciding whether a job is allowed to run at all before the scheduler ever sees the pods.
Four primitives
Kueue’s data model has four core primitives, documented in the Kueue concepts reference:
- ResourceFlavor — names a class of capacity (e.g. a node pool carrying A100 GPUs vs one carrying L40S GPUs). A ResourceFlavor maps to node labels and tolerations.
- ClusterQueue — the quota gate at cluster scope. A ClusterQueue owns a slice of the ResourceFlavor's capacity: e.g. team-a may use up to 8 nvidia.com/gpu units.
- LocalQueue — a namespace-scoped pointer into a ClusterQueue. Teams submit jobs to their namespace's LocalQueue; the LocalQueue forwards admission requests to the ClusterQueue.
- Workload — the admission object wrapping a Job, JobSet, PyTorchJob, or similar. Kueue holds the Workload in a Pending state (with the underlying pods suspended) until the ClusterQueue has capacity. When capacity is available, Kueue unsuspends the Workload and the scheduler sees the pods for the first time.
What Kueue does not do
Kueuedoes not place pods. It holds or releases a job; once released, placement is entirely the scheduler’s concern. This is stated explicitly in the ClusterQueue concepts documentation. The implication: if you need all pods of a distributed job to land simultaneously (gang scheduling), you need an additional mechanism at the placement layer — Kueue’s release of a Workload does not guarantee atomic placement.
Layer 2 — Volcano: gang scheduling and placement
Volcano is a CNCF incubating batch scheduling project (CNCF projects page). It extends Kubernetes with a batch scheduling framework that adds two capabilities the default kube-scheduler does not provide: gang scheduling and topology-aware placement.
Why gang scheduling matters for distributed training
A distributed training job running PyTorch DDP or DeepSpeed requires all worker processes to start before any of them can begin the first communication round. Without gang scheduling, a partial allocation is possible: some workers start, others cannot schedule because the remaining nodes are occupied, and the running workers hold their GPU allocations while waiting for peers that never arrive. This is a starvation deadlock — documented in the Volcano project documentation. Volcano’s response is the PodGroup: all members of the group must be satisfiable before any pod is bound to a node.
Key primitives
- PodGroup — the atomic scheduling unit. The minMember field sets how many pods must be satisfiable before any are bound. Setting minMember lower than the job's actual worker count defeats the gang guarantee.
- Queue — a Volcano-level fair-share queue with configurable weights. Multiple teams can share a cluster-level capacity pool; the Queue CRD carries a capability ceiling and a reclaimable flag (allowing other teams to borrow idle quota).
- TopologyPolicy — routes workers to nodes that share NVLink or InfiniBand, reducing inter-node communication latency on multi-node training. The preferSingleSocket value is the practical default for most GPU node configurations. The preferSingleNUMANode value may cause jobs to wait indefinitely on nodes where GPUs span multiple NUMA domains (a common topology on multi-socket servers with more than four GPUs); verify your node NUMA topology with numactl --hardware before using this value in production.
How Kueue and Volcano compose
The two systems compose without conflict. Kueue manages quota — whether a job is allowed to consume resources at all. Volcano manages placement — whether all the pods of an admitted job can land simultaneously. The handoff point is the Workload unsuspend event: when Kueue determines quota is available and lifts the suspension, the pods become visible to the scheduler; if the scheduler is Volcano, it then applies PodGroup gang semantics before binding any pod.
Layer 3 — the GPU Operator: hardware abstraction
The NVIDIA GPU Operator is a Kubernetes operator that bundles the full software stack a GPU node needs: the driver, NVIDIA Container Toolkit, device plugin, Node Feature Discovery, GPU Feature Discovery, DCGM exporter for metrics, and MIG manager. One Helm release replaces a stack of host-level installs and independent DaemonSets.
What the device plugin does
The NVIDIA device plugin registers nvidia.com/gpu as an extended resource with the kubelet on each node. Once registered, pods can request it like any other resource in a container spec. The count exposed depends entirely on how the operator is configured: a node with 8 physical A100s might advertise 8 units (whole-GPU mode), 64 units (time-slicingwith 8× replication per card), or a set of MIG slice units (e.g. 56 units under the 1g.10gb profile across 8 GPUs).
DCGM exporter and observability
The bundled DCGM exporter emits GPU metrics directly into a Prometheus scrape endpoint. The metrics most useful for platform utilisation tracking are DCGM_FI_DEV_GPU_UTIL (SM utilisation percentage), DCGM_FI_DEV_FB_USED (framebuffer memory used), and DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (tensor core active percentage — the most direct signal for whether a training job is doing useful compute). Without DCGM exporter, GPU utilisation is invisible to the platform; the first sign of a problem is usually a job that runs for far longer than expected.
Driver installation strategy
The operator’s driver component is correct for bare-metal nodes where no driver is pre-installed. On managed Kubernetes distributions that ship a pre-installed GPU driver (a common pattern on managed cloud node pools), driver.enabled=false is the safe configuration — it allows the operator to provide the device plugin, DCGM, and MIG manager without conflicting with the already-loaded driver module. Mixing operator-managed and pre-installed drivers on the same node leaves the node in a NotReady state due to module-load conflicts.
A job flowing through all three layers
Following a PyTorchJob submission from a data science team makes the layer interactions concrete:
- 1The team submits a PyTorchJob to their namespace. The Kueue webhook intercepts the creation and immediately suspends all pods. Kueue wraps the job in a Workloadobject and queues it against the team’s LocalQueue.
- 2Kueue checks the team’s ClusterQueue. If the requested
nvidia.com/gpucount fits within the remaining quota (including any borrowable capacity from a cohort), Kueue unsuspends the Workload. - 3The unsuspended pods are visible to the scheduler. If the cluster uses Volcano as its scheduler (via
schedulerName: volcanoin the pod spec), Volcano checks the PodGroupassociated with the job. The PodGroup’sminMembervalue must be satisfiable — all workers placeable at the same time — before any pod is bound. - 4Once all pods are placeable, Volcano binds them atomically to GPU nodes. The kubelet on each target node consults the device plugin’s resource registry to allocate the exact GPU units (whole cards, MIG slices, or time-sliced replicas) to the container.
- 5The NVIDIA Container Toolkit injects the allocated GPU devices into the container’s cgroup. The training process starts with exclusive access to the allocated GPU units.
A stack overview (diagram)
The responsibility boundary at each layer:
three-layer-gpu-scheduling-stack.mermaid
Layer 1 — KUEUE: Quota admission
Layer 2 — VOLCANO: Gang scheduling + placement
Layer 3 — NVIDIA GPU OPERATOR: Hardware abstraction
Device Plugin (nvidia.com/gpu count)
- → Whole-card mode: 8 GPUs → 8 units
- → Time-slicing: 8 GPUs × 8 replicas → 64 units
- → MIG Manager (A100/H100 hardware partitions)
- → MPS Server (shared CUDA context)
DCGM Exporter
- → Prometheus (utilisation metrics)
Three common failure modes: which layer fails first
Understanding the layer boundaries makes it faster to diagnose production problems. Each failure mode has a clear first-failure layer:
Failure mode 1: jobs queue indefinitely despite available nodes
First-failure layer: Kueue (quota admission)
A job visible to the cluster but not progressing is held at the Workloadlevel. The diagnostic is: check the Workload object’s status conditions. If the condition is QuotaReserved: False with reason Pending, the ClusterQueue has no available quota. The node availability is irrelevant — Kueue will not release the job until the quota condition is met, regardless of how many GPU nodes are idle.
nominalQuota is set too low for the job’s resource request; a borrowing cohort is exhausted.Failure mode 2: training jobs stall after partial pod placement
First-failure layer: Volcano (gang scheduling) — or its absence
If some pods of a distributed job start and others remain Pending, a partial placement has occurred. This is the starvation deadlock pattern: the running pods are waiting for communication partners that are stuck behind other jobs in the queue. The running pods hold their GPU allocations; the pending pods cannot schedule because the nodes are occupied.
minMember set below the job’s actual replica count. Without a valid PodGroup, Volcano schedules the job as individual pods with no gang guarantee. The fix: verify the PodGroup is created alongside the job and that minMember equals the total worker count.Failure mode 3: GPU nodes show Ready but no GPUs are schedulable
First-failure layer: GPU Operator (hardware abstraction)
If nodes are Ready but kubectl describe node shows zero nvidia.com/gpu capacity, the device plugin has not registered the resource. Common causes: driver container did not start (check the nvidia-driver DaemonSet pods); device plugin DaemonSet pod is in CrashLoopBackOff; an operator driver installation conflicted with a pre-existing driver on the host (common when driver.enabled is left true on a node image that already ships a driver).
Configuration sketch
The following fragments illustrate the three-layer wiring. They are illustrative — tune namespaces, resource counts, and GPU SKU labels to your environment.
kueue-cluster-queue.yaml
# Layer 1: Kueue — quota admission
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team-a-queue
spec:
namespaceSelector: {}
cohort: shared-gpu-pool
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: gpu-a100
resources:
- name: nvidia.com/gpu
nominalQuota: "8" # guaranteed share
borrowingLimit: "4" # may borrow up to 4 extra
preemption:
reclaimWithinCohort: Any # reclaim guaranteed share if borrowedvolcano-podgroup.yaml
# Layer 2: Volcano — gang scheduling
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: pytorchjob-workers
namespace: team-a
spec:
minMember: 4 # ALL 4 workers must be placeable before any are bound
minResources:
nvidia.com/gpu: "4"
queue: team-a-volcano-queue
priorityClassName: gpu-traininggpu-operator-helm-values.yaml
# Layer 3: GPU Operator — hardware abstraction
# Bare-metal / self-managed node example (driver.enabled=true)
driver:
enabled: true # set false on nodes with a pre-installed driver
toolkit:
enabled: true
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true # scrape by Prometheus Operator
migManager:
enabled: true # A100/H100 MIG partitioning
nodeStatusExporter:
enabled: true
# Resource flavor label used by Kueue ResourceFlavor nodeLabels
nodeFeatureDiscovery:
worker:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleQuota borrowing and preemption
Kueue’s cohort model allows multiple ClusterQueues to share a borrowing pool. A team with a nominalQuota of 8 GPUs can borrow additional capacity from another team’s idle allocation, up to a configured borrowingLimit. When the lending team submits a new job and needs its quota back, preemptionreclaims it. This is the mechanism that makes a fixed GPU pool feel larger than its physical count — idle quota does not sit unused while other teams’ jobs wait.
The Kueue metrics to monitor for this behaviour are kueue_pending_workloads (rising count indicates queue pressure), kueue_admitted_workloads_total (rate of successful admission), and kueue_quota_reserved_resources (current utilisation against nominalQuota per ClusterQueue).
Alternatives to this stack composition
The three-layer stack (Kueue + Volcano + GPU Operator) is one composition. Others are viable depending on workload mix:
- Kueue + default scheduler + GPU Operator — appropriate for single-pod or small-scale jobs that do not require gang semantics. Lower operational overhead; loses topology-aware placement and gang guarantees.
- Kueue coscheduling plugin + GPU Operator — the Kubernetes SIG Scheduling coscheduling plugin provides gang admission as a scheduler plugin rather than a separate scheduler. Less operationally heavy than Volcano; less mature for topology-aware placement.
- Apache YuniKorn + GPU Operator — an alternative batch scheduler with gang scheduling and capacity management. Broader workload support (Spark, Flink natively), less adoption in the Kubernetes MLOps ecosystem as of 2026.
- Standalone device plugin DaemonSet (no GPU Operator) — possible but requires managing the driver, container toolkit, and DCGM exporter separately. Increases operational surface at the hardware abstraction layer.
The explorer below lets you compare the four compositions across six dimensions. Select one stack to inspect its profile, or two to compare them side by side.
Scheduler Stack Explorer
Select one stack to inspect it. Select two to compare across six dimensions side by side.
Kueue + Volcano
Canonical three-layer stack
- Gang scheduling
- Full gang guarantee via PodGroup + minMember. All pods bound atomically or none. Starvation deadlock impossible when minMember is set correctly.
- Topology-aware placement
- TopologyPolicy routes workers to nodes sharing NVLink or InfiniBand. preferSingleSocket is the safe default. Reduces all-reduce latency on multi-node training.
- Quota + fairness
- Kueue owns quota (nominalQuota + borrowingLimit + preemption). Volcano Queue provides fair-share ordering within the admitted set. Do not run both quota systems simultaneously.
- Operational load
- Two separate controllers (Kueue, Volcano) plus GPU Operator. Higher initial setup. Well-documented compose pattern; both projects are CNCF-hosted.
- Maturity
- Kueue: Kubernetes SIG Scheduling, GA in 1.30+. Volcano: CNCF Incubating since 2022, used in production AI clusters at scale.
- Best for
- Multi-node distributed training (PyTorchJob, MPIJob) requiring gang semantics. Teams with topology-sensitive workloads. Shared clusters needing per-team quota isolation.
Select a second stack to compare dimensions side by side.
What comes next: the GPU-sharing decision tree
The stack described above treats each nvidia.com/gpu unit as a whole-card allocation by default. Whether to partition those cards — and which mechanism to use (time-slicing, MPS, MIG, or fractional GPU virtualization) — is a separate decision that belongs at the GPU Operator layer. That decision has a significant effect on the isolation and latency jitter profile seen by jobs. The next article in this series, the GPU-sharing decision tree (Article 27), provides a structured framework for that choice — starting from workload type (training vs inference) and hardware generation (MIG-capable vs not).
References
- [1] “Introducing Kueue.” Kubernetes Blog, SIG Scheduling, October 2022. kubernetes.io/blog
- [2] Kueue Concepts (ResourceFlavor, ClusterQueue, LocalQueue, Workload). Kueue project documentation, kubernetes-sigs/kueue. kueue.sigs.k8s.io
- [3] Kueue Workload concept. Kueue project documentation. kueue.sigs.k8s.io/docs/concepts/workload
- [4] Kueue ClusterQueue concept (admission, placement boundary). Kueue project documentation. kueue.sigs.k8s.io/docs/concepts/cluster_queue
- [5] “Cloud Native Batch System Volcano moves to the CNCF Incubator.” CNCF Blog, April 2022. cncf.io/blog
- [6] Volcano project page. CNCF (Cloud Native Computing Foundation). cncf.io/projects/volcano
- [7] Volcano documentation (PodGroup, Queue, gang scheduling mechanics). Volcano project. volcano.sh/en/docs
- [8] NVIDIA GPU Operator documentation (components: driver, toolkit, device plugin, DCGM exporter, MIG manager, NFD, GFD). NVIDIA Corporation. docs.nvidia.com/datacenter/cloud-native/gpu-operator
- [9] NVIDIA Kubernetes Device Plugin (nvidia.com/gpu extended resource registration). NVIDIA Corporation, GitHub. github.com/NVIDIA/k8s-device-plugin
- [10] DCGM Exporter documentation (GPU metrics for Prometheus: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE). NVIDIA Corporation. docs.nvidia.com/datacenter/dcgm
Continue the Journey
Training workloads on Kubernetes — operators, gang scheduling, and checkpointing
The full training-workload playbook: PyTorchJob, Volcano gang semantics, and checkpoint strategy. Applies the three-layer stack described in this article.
Read articleAI PlatformMulti-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem
The quota and fairness model that Kueue enforces — ClusterQueues, cohort borrowing, and per-team isolation policies at platform scale.
Read articleAI PlatformThe five canonical AI/ML workload shapes
Where the gang-scheduled training shape sits among fine-tuning, batch, online, and agent workloads — the taxonomy this article's scheduling decisions map to.
Read articleInteractiveThe Architectural Saga of Kubernetes
The control plane, CRDs, and operators that PyTorchJob, Kueue, and Volcano extend — an interactive deep dive into the platform these schedulers run on.
Read article