AI Platform Engineering & MLOps · Part XXVIII of 34
HAMi — fractional GPU on the GPUs you actually have
HAMi virtualises GPU memory and compute at the CUDA-call layer — hard memory caps, soft compute throttling, no MIG-capable silicon required.
Most Kubernetes GPU clusters carry a mix of GPU generations — T4s bought two years ago, A10Gs added last year, possibly a pair of A100s set aside for large training runs. MIG provides hardware-level isolation, but only on A100 and H100. Time-slicingrequires no extra tooling, but gives every tenant zero memory protection: one container’s out-of-memory error resets the entire physical device for every co-located workload. The gap between those two options is where most real-world inference clusters sit.
HAMi (Heterogeneous AI Computing Virtualisation Middleware) fills that gap. It is a CNCF Sandbox project — accepted 21 August 2024 — that virtualises GPU resources in any Kubernetes cluster without patching the kernel or requiring specialised silicon features. [1] It works on the T4, L4, A10G, V100, and A100 equally, because its enforcement mechanism lives in userspace, inside the container.
This article covers the four-component architecture, the precise isolation contract (what is a hard guarantee and what is not), how to stack quota-aware queuing on top, what observability looks like, and when HAMi is the right tool versus when hardware MIG partitioning is worth the administrative overhead.
The four-component architecture
HAMi’s official architecture documentation describes three sequential phases — Admission, Scheduling, Allocation — implemented by four discrete components: [2]
- 1Mutating admission webhook (hami-webhook). Intercepts pod creation before the pod reaches the scheduler. Validates fractional resource requests, applies cluster-level defaults, and rewrites the pod’s
schedulerNamefield tohami-schedulerso that subsequent placement is handled by the HAMi extender, not the default scheduler alone. - 2Scheduler extender (hami-scheduler). A sidecar to kube-scheduler that implements custom Filter, Score, and Bind verbs. During Filter it removes nodes that lack sufficient unallocated GPU memory to satisfy the request. During Bind it writes per-container allocation metadata — memory limitand compute cap — back to the node’s device-plugin allocation record.
- 3Device plugin DaemonSet (hami-device-plugin). A fork of the upstream NVIDIA device plugin. Enumerates physical GPUs on each node, registers fractional resource names with kubelet (
nvidia.com/gpumem,nvidia.com/gpucores,nvidia.com/gpumem-percentage), and places allocation metadata into the container’s environment at launch time. - 4HAMi-core (libvgpu.so). The in-container CUDA interception library. Injected at pod startup as a volume mount via the webhook. Its README describes the mechanism explicitly: it hijacks the API-call boundary between
libcudart.so(CUDA Runtime) andlibcuda.so(CUDA Driver) viaLD_PRELOAD. [3]
The flow through these components for a workload pod:
Pod creation
└─▶ [1] hami-webhook (MutatingAdmissionWebhook)
• validates resource requests
• sets schedulerName: hami-scheduler
• injects libvgpu.so volume mount
└─▶ [2] hami-scheduler (ExtenderConfig sidecar)
• Filter: remove nodes with insufficient gpumem
• Score: binpack or spread
• Bind: write allocation record to device plugin
└─▶ [3] hami-device-plugin (DaemonSet on GPU node)
• presents nvidia.com/gpumem etc. to kubelet
• passes CUDA_DEVICE_MEMORY_LIMIT to container env
└─▶ [4] libvgpu.so (inside container, via LD_PRELOAD)
• overrides cuMemAlloc / cudaMalloc
• enforces CUDA_DEVICE_MEMORY_LIMIT (hard)
• throttles SM submission for CUDA_DEVICE_SM_LIMIT (soft)The isolation contract: what is guaranteed and what is not
The practical value of HAMi— and the reason it requires careful placement decisions — depends entirely on understanding where the isolation boundary is hard and where it is approximate.
Memory cap: hard limit
libvgpu.so overrides cuMemAlloc, cudaMalloc, and related allocation calls. Any allocation that would push a container’s cumulative device-memory usage above CUDA_DEVICE_MEMORY_LIMIT returns an out-of-memory error to the requesting application — not to neighbouring containers on the same physical card. [3] The per-container enforcement is the key distinction from time-slicing: with time-slicing, a CUDA OOM in any virtual GPU resets the entire physical device for all co-located workloads.
One caveat: the memory limit applies to device-memory allocations routed through the CUDA Runtime path. Workloads that call the CUDA Driver API directly, bypassing libcudart.so, may not have the limit enforced. In practice, mainstream frameworks — PyTorch, TensorFlow, JAX — all use the Runtime path, so this edge case rarely surfaces.
Compute cap: soft limit
The compute limit — expressed via nvidia.com/gpucores in the pod spec and enforced inside the container as CUDA_DEVICE_SM_LIMIT — is not a hardware partition. Community documentation of HAMi-core’s SM-partitioning mechanism describes it as feedback-based time-slicing: HAMi monitors per-container GPU utilisation and injects cudaDeviceSynchronize() plus sleep cycles to throttle kernel submissions when a container exceeds its core budget. The documented deviation is ±5–10%. [4]
The practical implication: an adversarial or simply compute-hungry workload can exceed its nominal compute cap temporarily, because the throttle operates by rate-limiting kernel submissions rather than by a hardware boundary. This makes HAMi unsuitable for co-locating latency-sensitive inference with sustained batch training on the same physical GPU — training will saturate the streaming multiprocessors faster than the soft throttle can respond.
Fault isolation: none
A CUDA fault — illegal memory access, driver-level error — in one container can reset the GPU context for all containers sharing that physical device. This is the same behaviour as MPS and time-slicing. Only hardware MIG partitions provide true fault isolation. On shared multi-tenant inference nodes this is an accepted risk; on nodes where a single misbehaving tenant must not affect others, MIG remains the appropriate mechanism.
The explorer below shows the isolation profile of each mechanism side by side. Select a mechanism to see expanded guidance on when to use it and its principal caveats.
GPU Isolation Mechanism Explorer
Select a mechanism to see its isolation guarantees. HAMi fills the gap between unprotected time-slicing and hardware-level MIG — specifically for GPU generations that do not support MIG.
| Mechanism | Memory | Compute | Fault | Hardware |
|---|---|---|---|---|
| Time-slicing | None | None | None | Any NVIDIA |
| MPS | None | None | None | Any NVIDIA |
| HAMi | Hard (CUDA API) | Soft | None | Any NVIDIA |
| MIG | Hardware | Hardware | Hardware | A100 / H100 only |
HAMi
Multi-tenant inference on non-MIG GPU generations (T4, L4, A10G, V100). Memory isolation without MIG-capable silicon.
Compute cap is soft (±5–10% deviation). Fault isolation is absent — a CUDA fault still resets the whole physical device.
Source: HAMi architecture docs and community documentation on SM-partitioning. See article references [1]–[4].
Resource names and a worked pod spec
HAMi’s device plugin registers four resource names per GPU node. The full set is documented in the HAMi configuration reference: [5]
nvidia.com/gpu— integer virtual GPU slot count (how many fractional slices the pod needs)nvidia.com/gpumem— device memory hard-cap per slot, in MiBnvidia.com/gpucores— compute soft-cap per slot, in percent (0–100)nvidia.com/gpumem-percentage— alternative to gpumem, expressed as a fraction of total card memory
A workload running a 7B-parameter int4 LLM that needs approximately 4 GiB on a 16 GiB card:
resources:
limits:
nvidia.com/gpu: "1" # one virtual GPU slot
nvidia.com/gpumem: "4096" # 4 GiB hard memory cap
nvidia.com/gpucores: "25" # 25 % SM compute soft-limitWith those settings, four such pods can share a single 16 GiB card: each gets 4 GiB with a hard ceiling, and each is nominally throttled to 25% of the streaming multiprocessors. A fifth pod requesting 4096 MiB will not be scheduled onto that card — the scheduler extender’s Filter phase removes it from consideration because the remaining unallocated memory falls below the request.
The simulator below lets you pack fractional pods onto a 16 GiB card and observe the memory hard-cap enforcement and compute oversubscription behaviour in real time.
Fractional GPU Allocation Simulator
Pack pods onto a single 16 GiB GPU card. HAMi’s scheduler extender enforces memory hard-caps at placement time — a pod is rejected when free mem < gpumem request. Compute caps are soft (±5–10 % deviation).
No pods scheduled yet — add one above.
Memory cap is hard (libvgpu.so intercepts cuMemAlloc). Compute cap is soft (±5–10% deviation via feedback-based SM throttling).
Deploying HAMi via Helm
HAMi ships a Helm chart. The steps below apply to any CNCF-conformant Kubernetes distribution; the only distribution-specific element is how the scheduler extender configuration reaches the kube-scheduler binary. The official deployment guide covers additional options.
Step 1 — label GPU nodes
Nodes without the gpu=on label are excluded from HAMi’s scheduler extender filter. Label every node that carries physical GPUs:
kubectl label nodes <gpu-node-1> gpu=on
kubectl label nodes <gpu-node-2> gpu=onStep 2 — register the scheduler extender
The scheduler extender is wired via a KubeSchedulerConfiguration file passed to the kube-scheduler binary at startup. On distributions that manage the scheduler through static pod manifests (kubeadm-based), add an --config flag to the scheduler’s command args and mount the config file. On distributions that manage the scheduler as part of a cluster-level config (such as distributions that use /etc/kubernetes/scheduler-config.yaml conventions), edit that file directly. Consult your distribution’s scheduler configuration documentation for the precise path.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
extenders:
- urlPrefix: "https://127.0.0.1:443"
filterVerb: "filter"
bindVerb: "bind"
weight: 1
nodeCacheCapable: true
managedResources:
- name: "nvidia.com/gpu"
ignoredByScheduler: trueignoredByScheduler: true on nvidia.com/gpuis required. Without it the default scheduler attempts to match the resource against node allocatable, and the HAMi extender’s Bind step will fail because the default scheduler has already consumed the resource slot.Step 3 — install via Helm
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update
helm upgrade --install hami hami-charts/hami \
--namespace kube-system \
--set scheduler.extender.urlPrefix="https://127.0.0.1:443" \
--set scheduler.nodepolicies="binpack" \
--set scheduler.gpupolicies="binpack" \
--version "2.7.0"The binpackpolicy packs pods onto the fewest GPU cards before spilling to additional nodes or cards — the right default when requests are small relative to card capacity and the goal is to keep as many cards fully idle as possible. Switch to spread when you need latency headroom between tenants on separate physical cards.
Stacking quota-aware queuing on top of HAMi
HAMi handles within-node fractional allocation. It does not, by itself, prevent a single team from submitting enough pods to exhaust the GPU pool for the rest of the cluster. Kueue’s official guide for using HAMi with Kueue describes the integration: HAMi’s resource names are first-class Kueue resources, and a ResourceTransformation object instructs Kueue to compute aggregate quota consumption by multiplying per-slot values by the vGPU count. [6]
For example: a pod requesting nvidia.com/gpu: 2 and nvidia.com/gpumem: 1024 contributes 2048 MiB to the team’s total-gpumem quota counter. This prevents a team from issuing many small-memory pods that individually fit per-pod quota limits but collectively exhaust the physical pool.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team-inference
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources:
- nvidia.com/gpu
- nvidia.com/gpumem
- nvidia.com/gpucores
flavors:
- name: hami-gpu
resources:
- name: nvidia.com/gpu
nominalQuota: "8" # 8 virtual GPU slots
- name: nvidia.com/gpumem
nominalQuota: "32768" # 32 GiB total across the team
- name: nvidia.com/gpucores
nominalQuota: "200" # 200 % SM (2 full cards nominal)The admission order matters: the quota-aware admission webhook runs before the HAMi mutating webhookin the Kubernetes admission chain. The queue evaluates whether the team has quota; if admitted, HAMi’s webhook then rewrites the scheduler name and injects the libvgpu.sovolume mount. If the queue suspends the workload (quota exhausted), the pod never reaches HAMi. Align the ClusterQueue’s nominalQuota for gpumem with the actual sum of physical card memory on your GPU nodes — setting it higher creates a queue that admits workloads that the extender will then fail to place.
Observability
HAMi’s Helm chart does not bundle a GPU hardware exporter. Wire the NVIDIA DCGM exporterseparately — it reads hardware performance counters from the GPU’s PMU and is unaware of HAMi’s virtual allocation boundaries, so it surfaces physical ground truth regardless of whether HAMi is managing the allocation. HAMi-specific per-allocation telemetry is exposed from the hami-scheduler pod’s /metrics endpoint on port 8080. HAMi’s dashboard documentation lists Grafana dashboard IDs 22043 and 21833 for visualising this data. [7]
The two-layer scrape strategy gives the full picture:
- DCGM exporter —
DCGM_FI_DEV_GPU_UTIL(physical compute utilisation),DCGM_FI_DEV_FB_USED/FB_FREE(physical frame-buffer consumption),DCGM_FI_PROF_PIPE_TENSOR_ACTIVE(tensor core activity, the primary training efficiency signal). - HAMi scheduler metrics endpoint — per-container allocated memory and core-utilisation accounting. Comparing the HAMi accounting view against the DCGM physical view surfaces cases where a container's actual consumption diverges from its declared request.
Alert on DCGM_FI_DEV_FB_USED approaching the sum of all active containers’ gpumemlimits — this signals that the HAMi accounting and the hardware reality are converging and a new pod may fail placement even if the extender believes capacity is available.
Operational diagnostics
Pod stuck in Pending
Common causes: GPU node not labelled gpu=on; scheduler configuration file not reloaded after a scheduler restart; nvidia.com/gpu not listed in managedResources with ignoredByScheduler: true (the default scheduler intercepts the Bind).
# Inspect extender filter decisions
kubectl logs -n kube-system -l app=hami-scheduler --tail=100
# Check scheduler events on the stuck pod
kubectl describe pod <pending-pod> | grep -A 10 "Events:"Memory limit not being enforced
If a container exceeds its declared gpumem without returning an OOM error, libvgpu.so was not injected. Verify:
# CUDA_DEVICE_MEMORY_LIMIT must be present in the container environment
kubectl exec -it <pod> -- env | grep CUDA_DEVICE_MEMORY_LIMIT
# If absent: check that the hami-webhook pod is healthy
kubectl get pods -n kube-system -l app=hami-webhook
# Verify the webhook's namespaceSelector includes the pod's namespace
kubectl get mutatingwebhookconfigurations hami-webhook -o yaml | grep -A5 namespaceSelectorComparison with time-slicing, MPS, and MIG
HAMi occupies a specific position in the GPU-sharing mechanism space. The table below summarises the four mechanisms available on NVIDIA hardware:
| Mechanism | Memory isolation | Compute isolation | Fault isolation | Hardware req |
|---|---|---|---|---|
| Time-slicing | None | None | None | Any NVIDIA |
| MPS | None | None (shared ctx) | None | Any NVIDIA |
| HAMi | Hard (CUDA API) | Soft (rate-limit) | None | Any NVIDIA |
| MIG | Hardware | Hardware | Hardware | A100/H100 only |
HAMi is the appropriate mechanism when: the node carries a GPU generation that does not support MIG (T4, L4, A10G, V100); the workload mix is inference-dominated with known, stable memory footprints; and memory isolation between tenants is required but hardware-level fault isolation is acceptable to trade away.
When MIG is available, prefer MIG
On A100 and H100 nodes, hardware MIGpartitioning provides memory isolation, compute isolation, and fault isolation that are each enforced at the silicon level — none of which can be achieved through userspace library interception. If your cluster carries MIG-capable GPUs and the administrative cost of managing MIG profiles is acceptable, MIG is the stronger mechanism for multi-tenant inference.
HAMi’s value proposition is specifically on the non-MIG GPU generations that constitute the majority of most existing clusters. It bridges the isolation gap between the unprotected sharing of time-slicing and the hardware partitioning of MIG— for the GPU inventory that cannot use MIG at all.
HAMi’s CNCF Sandbox status (accepted August 2024, with an Incubation application filed July 2025) reflects an active but maturing community. Evaluate its graduation trajectory against your own production standards before committing a critical-path inference fleet to it.
References
- [1] CNCF Sandbox — HAMi proposal, Issue #97. github.com/cncf/sandbox/issues/97. Status: Done (accepted 2024-08-21). CNCF, 2024.
- [2] Project HAMi — Architecture. project-hami.io/docs/core-concepts/architecture/. Project HAMi, 2024.
- [3] Project-HAMi/HAMi-core. github.com/Project-HAMi/HAMi-core. README: “Hijacking the API-call between CUDA-Runtime(libcudart.so) and CUDA-Driver(libcuda.so).” Project HAMi, 2024.
- [4] theriseunion.com — HAMi Compute Partitioning Mechanism: Feedback-Based Time-Slicing. theriseunion.com/en/blog/HAMi-QA-SM-Partitioning.html. Community documentation, 2024.
- [5] Project HAMi — Configuration reference. github.com/Project-HAMi/HAMi/blob/master/docs/config.md. Project HAMi, 2024.
- [6] Kueue — Using HAMi. kueue.sigs.k8s.io/docs/tasks/run/using_hami/. Kubernetes SIGs, 2024.
- [7] Project HAMi — dashboard.md. github.com/Project-HAMi/HAMi/blob/master/docs/dashboard.md. Grafana dashboard 21833: grafana.com/grafana/dashboards/21833. Project HAMi, 2024.
- [8] CNCF Blog — Exploring cloud native projects in sandbox: 13 arrivals from 2024 H2. cncf.io/blog/2025/08/11/.... Corroborates August 21, 2024 acceptance date. CNCF, 2025.
Continue the Journey
Why share a GPU? The economics, the mechanics, the four mechanisms
Part 26 — the prerequisite read. Introduces time-slicing, MPS, MIG, and HAMi without the architecture deep-dive this article adds.
Read articleAI PlatformPicking a GPU-sharing mechanism — a decision tree
Part 27 — a structured decision tree that routes a workload to the right sharing mechanism, including HAMi as the primary branch for non-MIG GPUs.
Read articleAI PlatformMIG configuration strategy — partition geometry, node pools, and the operator
Part 29 — when A100/H100 hardware is available, this is the alternative: hardware-level isolation, profile selection, and the NVIDIA GPU Operator.
Read articleAI PlatformThe GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction
Part 25 — the three-layer scheduling stack that sits above the sharing mechanism: how Kueue, Volcano, and device plugins route workloads to the right pool.
Read article