AI Platform Engineering & MLOps · Part XXVI of 34
Why share a GPU? The economics, the mechanics, the four mechanisms
GPU idle time is expensive. This article makes the economic case for sharing, then maps four mechanisms — time-slicing, MPS, MIG, and HAMi — against the isolation axes that determine which one fits which workload.
A GPU that runs a training job for eight hours and then sits idle for sixteen is not a training asset — it is an expensive space heater. Industry surveys put average GPU utilisation in enterprise environments at roughly 5 percent [1], and a 2025 large-scale HPC study found that 37 percent of jobs never exceeded 15 percent GPU memory utilisation across their entire run [2]. These are not outliers. They are the normal consequence of allocating whole GPUs to single workloads in an environment where workloads rarely fill the hardware they claim.
GPU sharing is the engineering response to that waste. This article makes the economic case for sharing, then introduces the four primary mechanisms — time-slicing, CUDA Multi-Process Service (MPS), Multi-Instance GPU (MIG), and HAMi fractional GPU — and maps each against the four operational axes that determine which mechanism fits which workload.
The utilisation problem and its cost
GPU compute is priced by the hour whether or not a workload is using the silicon. On-prem, the same logic applies in amortised form: every idle GPU-hour represents depreciation consumed without corresponding output. When teams over-provision — reserving a full GPU for a workload that occupies 10–20 percent of its compute and memory — the remainder of the card is unavailable to other work, even if it is sitting unused.
The waste compounds in serving workloads. A 2026 analysis of LLM serving traces from three production environments found that execution-idle intervals — periods when a GPU is allocated to a serving replica but processing no requests — accounted for 7 to 65 percent of total energy consumption, depending on load patterns [3]. That ceiling of 65 percent is not a pathological edge case; it reflects a lightly-loaded serving deployment where the model is resident in VRAM but the request queue is near-empty.
NVIDIA’s own operations team quantified a version of this problem internally: before deploying continuous idle-job monitoring, approximately 5.5 percent of allocated GPU-hours were being consumed by jobs that had stalled or completed but not released their allocation. After deploying automated reaping, that figure dropped to below 1 percent [4]. The point is not the specific percentage; it is that idle detection alone recovered meaningful capacity without procuring additional hardware.
Sharing is not the only lever — better autoscaling, scale-to-zero inference, and queue-aware scheduling each address a slice of the problem. But sharing addresses the structural mismatch between GPU granularity (the smallest allocatable unit is typically one whole GPU) and workload demand (which varies continuously and rarely fills a whole GPU). Sharing subdivides the allocation unit, letting two or more workloads claim the capacity that one workload leaves on the table.
The calculator below shows cost-per-workload and recovered idle GPU-hours as you adjust tenants per GPU, per-tenant utilisation, and sharing mechanism.
GPU Economics Calculator
Adjust tenants per GPU and per-tenant utilisation to see the cost impact of each sharing mechanism on a 10-GPU cluster (A100 at $3.50/hr).
Sharing mechanism
Cost / workload / hr
$0.95
Baseline cost / hr
$3.50
Savings per workload
-73%
GPU-hrs recovered/day
180
Time-slicing — suitable for this configuration
Good for low-utilisation batch / notebook workloads with relaxed latency budgets. Avoid for online inference (high P99 jitter).
Four mechanisms at a glance
The four mechanisms differ in where they divide the GPU — in time, in software, in hardware — and those differences determine what isolation guarantees each can offer.
- Time-slicing — The simplest form of sharing: the GPU driver serialises access across multiple CUDA contexts. No hardware changes required. High latency jitter. Suitable for batch and notebook workloads.
- MPS — Routes multiple CUDA client processes through a single server, enabling genuine concurrent kernel execution. Low-to-medium jitter. Suitable for trusted co-tenancy within a team. No fault isolation.
- MIG — Hardware partitioning on A100/H100/Blackwell. Up to seven fully isolated instances per GPU. The only mechanism with hardware-enforced memory, compute, and fault isolation.
- HAMi — Software fractional GPU via LD_PRELOAD API interception. Works on any CUDA-capable GPU. CNCF Sandbox project. Software isolation only — not equivalent to MIG.
Time-slicing
Time-slicing is the simplest form of sharing: the GPU driver serialises access across multiple CUDA contexts, switching between them on a configurable time quantum. Each context sees the GPU as if it owns it exclusively, but only for its time slice. No hardware changes are required; the mechanism is implemented in the NVIDIA kernel driver and exposed through the device plugin configuration in a Kubernetes environment.
The misleading marketing claim
Time-slicing is often described as enabling true concurrent GPU use. It does not. Context switching introduces measurable jitter. Red Hat’s published analysis notes that time-slicing serialises access and is unsuitable for latency-sensitive serving workloads [5]. The NVIDIA developer forums document additional context-switch overhead at the CUDA level [6]. For batch training or offline inference with relaxed latency budgets, time-slicing is adequate. For an online inference endpoint with a P99 SLA in the tens of milliseconds, it is not.
Isolation properties
Memory isolation: none. Contexts share physical memory, differentiated only by virtual address spaces. Compute isolation: none. A context that saturates the GPU takes its full time quantum regardless of whether co-tenants need compute urgently. Fault isolation: partial. A GPU-reset event triggered by one context typically clears all contexts on that GPU. Workloads in other time-slices are exposed to each other’s crash behaviour, but not to each other’s in-flight data. Latency jitter: high.
When to use it
Low-utilisation batch workloads, notebook and experimentation environments, and offline inference with relaxed latency budgets. Time-slicing is the appropriate choice when the cluster includes older GPU hardware (T4, V100) that cannot use MIG, and when workload latency requirements are not strict.
CUDA Multi-Process Service (MPS)
MPS routes multiple CUDA client processes through a single server process. The MPS server submits work from all clients to the GPU concurrently, so kernel execution and memory-copy operations from different clients can genuinely overlap. On Volta-class hardware and later, each client gets a separate GPU address space, eliminating the address-space sharing risk of earlier MPS implementations. Clients share SM scheduling resources, meaning one misbehaving client can affect the throughput of others — there is no hard compute quota [7].
Throughput gains
The throughput gains from eliminating serialisation are real. Databricks reported meaningful throughput improvements when deploying MPS for small-LLM inference serving [8]. A University of North Texas study measured 0–147 percent throughput improvement depending on workload mix and concurrency level [9]. The range is wide because the gain is proportional to the idle SM cycles that concurrent kernels can backfill — a workload that already saturates the GPU sees little benefit.
The misleading claim
MPS is sometimes described as providing isolation equivalent to separate GPU instances. It does not. A fault in one MPS client process can corrupt shared state and terminate all clients sharing that MPS server. MPS fault isolation is none. MPS is appropriate for trusted, co-owned workloads — multiple replicas of the same inference service, or co-scheduled jobs from a single team — and not for multi-tenant environments where workloads must be fault-isolated from each other [7].
Isolation properties
Memory isolation: partial. On Volta+, separate address spaces per client — but all clients share one MPS server process, and a server crash terminates all of them. Compute isolation: none. Fault isolation: none. Latency jitter: low-to-medium, from shared SM scheduling.
Multi-Instance GPU (MIG)
MIG is a hardware partitioning capability available on NVIDIA A100, H100, and Blackwell-class GPUs. It partitions a single physical GPU into up to seven fully isolated instances. Isolation is spatial and enforced in silicon: each instance’s streaming multiprocessors have separate and exclusive paths through the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses [10][11]. A workload running in one MIG instance cannot observe the memory or execution of a workload in another, even if they share physical hardware.
Hardware-enforced multi-tenant isolation
This makes MIG the only mechanism in this set that provides hardware-enforced multi-tenant isolation. The others rely on software or driver-level boundaries. The consequence is that MIG is appropriate for regulated environments where workloads from different security domains must run on shared hardware without cross-contamination risk. KV cache VRAMfrom one tenant’s inference requests cannot be read by another tenant’s model weights under MIG — this guarantee is meaningful in healthcare, financial, and government deployments.
The misleading claim
MIG is sometimes presented as flexible, dynamic partitioning that can be resized on demand without disruption. In practice, changing MIG geometry requires destroying existing instances (which terminates any workloads running in them) and recreating the partition layout. This is an operational event, not a live resize. Teams that need to change their MIG configuration between training and serving phases need a planned maintenance window or a node-pool design that keeps the partition geometry fixed per pool.
Hardware gate
MIG is also hardware-gated. It is not available on T4, L4, or V100 GPUs. Clusters with mixed GPU generations require a mechanism-per-pool strategy: MIG on Ampere/Hopper nodes, time-slicing or HAMi on older hardware.
Isolation properties
Memory isolation: hardware. Compute isolation: hardware. Fault isolation: hardware. Latency jitter: negligible — due to hardware-isolated L2 cache and memory controller paths, workloads in different MIG instances do not contend for memory bandwidth, which is the dominant source of jitter under concurrent inference load.
HAMi fractional GPU
HAMi (Heterogeneous AI Computing Virtualization Middleware) is a CNCF Sandbox project, accepted in August 2024 [12]. It implements fractional GPU allocation through a software virtualisation layer: a shared library (libvgpu.so) is injected into container processes via LD_PRELOAD and intercepts CUDA driver and NVML API calls before they reach the hardware. The interception layer enforces per-container GPU memory and compute limits set by the scheduler, without requiring any hardware-level partitioning support.
Broadest hardware compatibility
This makes HAMi the most broadly compatible mechanism in this set. Because it operates at the user-space API layer rather than in hardware, it works on any CUDA-capable GPU — including those that do not support MIG. An L4 or T4 node pool that cannot be partitioned in hardware can still host fractional GPU workloads under HAMi.
The misleading claim
Because HAMi intercepts at the API level, it is sometimes described as providing isolation equivalent to MIG. It does not. The isolation is enforced in software, not hardware. A workload that bypasses or subverts the LD_PRELOAD injection — for example, through a statically linked CUDA binary or a privileged container that replaces the library — can escape the limits. HAMi is appropriate for trusted, cooperative workloads where strict hardware-enforced isolation between untrusted tenants is not required.
Isolation properties
Memory isolation: software. Compute isolation: software. Fault isolation: partial — if the intercept library itself faults, all workloads sharing that library instance may be affected. Latency jitter: low-to-medium. The overhead of HAMi’s LD_PRELOAD interception layer means every CUDA memory-allocation call passes through libvgpu.sobefore reaching the driver, adding a small but non-zero per-call cost. Unlike MIG, there is no hardware context switch; unlike time-slicing, execution is not serialised. The jitter is therefore bounded by API-call overhead rather than context-switch latency, which keeps it lower than time-slicing but measurably above MIG’s hardware-isolated baseline.
The isolation matrix: four mechanisms × four axes
The four axes below are the operational properties that determine which mechanism fits which workload context. The table is a decision aid, not a ranking — there is no universally superior mechanism.
Memory isolation — whether one workload's VRAM is inaccessible to another at the hardware or driver level.
Compute isolation — whether one workload's SM utilisation is bounded and cannot starve another's execution.
Fault isolation — whether a crash or OOM in one workload terminates other workloads sharing the same GPU.
Latency jitter — whether co-located workloads introduce unpredictable latency variability into each other's inference path.
| Mechanism | Memory iso. | Compute iso. | Fault iso. | Latency jitter |
|---|---|---|---|---|
| Time-slicing | None | None | Partial | High |
| MPS | Partial* | None | None | Low–Medium |
| MIG | Hardware | Hardware | Hardware | Negligible |
| HAMi | Software | Software | Partial | Low–Medium |
* MPS on Volta+ provides separate address spaces per client, but all clients share one MPS server process — a server crash terminates all clients.
- Time-slicing fault isolation is marked Partial because a GPU-reset event triggered by one context will typically clear all contexts on that GPU. Workloads in other time-slices are therefore exposed to each other's crash behaviour, but not to each other's in-flight data (contexts do not share address space).
- HAMi fault isolation is Partial for the same reason plus an additional caveat: if the intercept library itself faults, all workloads sharing that library instance may be affected.
- MIG's Negligible latency-jitter rating reflects the hardware-isolated L2 cache and memory controller paths. In practice, workloads in different MIG instances do not contend for memory bandwidth, which is the dominant source of jitter under concurrent inference load.
- HAMi's Low–Medium latency-jitter rating reflects the overhead of its LD_PRELOAD interception layer: every CUDA memory-allocation call passes through libvgpu.so before reaching the driver, adding a small but non-zero per-call cost. Unlike MIG, there is no hardware context switch; unlike time-slicing, execution is not serialised.
The explorer below lets you select one mechanism to inspect its isolation properties, or two to compare them side by side.
Four Mechanisms Explorer
Select one mechanism to inspect it. Select two to compare isolation axes side by side.
MIG
Hardware partition · Silicon-enforced · A100/H100+
Hardware partitioning available on A100, H100, and Blackwell GPUs. Partitions a single physical GPU into up to seven fully isolated instances. Each instance has exclusive paths through crossbar ports, L2 cache banks, memory controllers, and DRAM address buses — enforced in silicon.
Memory isolation
HardwareCompute isolation
HardwareFault isolation
HardwareLatency jitter
Negligible- Best for
- Regulated environments where workloads from different security domains must run on shared hardware. Multi-tenant inference hosting with strict VRAM isolation requirements.
- Avoid when
- Clusters with T4/L4/V100 hardware (MIG not supported). Workloads that need dynamic partition resizing — geometry changes require a planned maintenance window.
- Hardware requirement
- NVIDIA Ampere (A100), Hopper (H100, H200), or Blackwell-class GPUs only. Not available on T4, L4, or V100.
Select a second mechanism to compare isolation axes side by side.
Hardware availability gates the choice
Not every mechanism is available on every GPU. MIG requires Ampere (A100), Hopper (H100, H200), or Blackwell-class hardware [11]. Time-slicing and MPS are available on any CUDA-capable GPU. HAMi’s LD_PRELOAD architectureplaces no hardware requirement beyond CUDA support, though its effective memory enforcement depends on the GPU driver’s NVML implementation.
In a cluster with mixed GPU generations — A100 nodes alongside T4 or L4 nodes, for example — the practical approach is to fix the sharing mechanism per node pool rather than per workload. MIG on A100/H100 pools, HAMi or time-slicing on older pools. This avoids per-node mechanism negotiation and makes scheduling predicates simpler: a workload requiring hardware-isolated VRAM targets the MIG pool; a workload tolerating software isolation targets the HAMi pool.
| GPU | Time-slicing | MPS | MIG | HAMi |
|---|---|---|---|---|
| T4 | ✓ | ✓ | ✗ | ✓ |
| V100 | ✓ | ✓ (Volta) | ✗ | ✓ |
| A10 / L4 | ✓ | ✓ | ✗ | ✓ |
| A100 | ✓ | ✓ | ✓ | ✓ |
| H100 / H200 | ✓ | ✓ | ✓ | ✓ |
| Blackwell (B200) | ✓ | ✓ | ✓ | ✓ |
Matching mechanism to workload
Three workload archetypes dominate production AI platforms:
- Large distributed training runs — these typically benefit least from sharing. A well-utilised training job that fills a node’s GPUs is already efficient. Sharing here is counterproductive: it adds scheduling complexity without recovering significant idle capacity. Gang scheduling (ensuring all GPUs in a distributed job start together) matters more than partitioning. See the preceding article in this series on queue-aware scheduling for the relevant patterns.
- Online inference serving — typical VRAM utilisation for a serving replica depends heavily on model size and batch configuration. Many small-to-medium model serving deployments use 20–60 percent of a GPU’s VRAM, leaving substantial capacity unused. MPS (for trusted co-tenancy) or MIG (for isolated multi-tenancy) both apply here. Time-slicing is generally inappropriate because context-switch jitter degrades P99 latency.
- Notebook and experimentation workloads — these are the primary source of idle-GPU waste. A notebook that allocates a GPU on launch and then sits idle for hours while the user reads documentation is a textbook over-provisioning case. Time-slicing or HAMi fractional allocation constrains the impact of any one notebook on shared capacity. Fault isolation matters less here because notebook workloads are interactive and ephemeral.
What this series covers next
This article has introduced the mechanisms and framed the problem. The next article in this series (article 27, gpu-sharing-decision-tree) works through a structured decision tree that routes a specific workload and cluster context to the appropriate mechanism, including the trap branches — the configurations that look correct but fail under production conditions.
References
- [1] “5% GPU utilization: The $401 billion AI infrastructure problem enterprises can’t keep ignoring.” VentureBeat, April 2026. venturebeat.com
- [2] “Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems.” ACM PEARC 2025. dl.acm.org
- [3] “The Energy Cost of Execution-Idle in GPU Clusters.” arXiv:2604.04745, April 2026. arxiv.org/abs/2604.04745
- [4] “Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring Tools.” NVIDIA Technical Blog. developer.nvidia.com
- [5] “Sharing is caring: How to make the most of your GPUs (part 1 — time-slicing).” Red Hat Blog. redhat.com
- [6] NVIDIA Developer Forums: “CUDA context switching overhead of current GPU.” forums.developer.nvidia.com
- [7] NVIDIA CUDA Multi-Process Service (MPS) Documentation (official). docs.nvidia.com/deploy/mps
- [8] “Scaling Small LLMs with NVIDIA MPS.” Databricks Engineering Blog. databricks.com
- [9] “Granularity- and Interference-Aware GPU Sharing with MPS.” University of North Texas CSRL. engineering.unt.edu
- [10] “Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU.” NVIDIA Technical Blog. developer.nvidia.com
- [11] NVIDIA MIG User Guide (official). docs.nvidia.com/datacenter/tesla/mig-user-guide
- [12] CNCF HAMi Project Page (accepted August 2024). cncf.io/projects/hami
Continue the Journey
GPU sharing decision tree — routing a workload to the right mechanism
The next article: a structured decision tree that routes a specific workload and cluster context to time-slicing, MPS, MIG, or HAMi — including the trap branches that fail under production conditions.
Read articleAI PlatformThe GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction
The three-layer scheduling stack — Kueue, Volcano, and device plugins — that sits above the sharing mechanisms described in this article.
Read articleAI PlatformMIG configuration strategy — partition geometry, node pools, and the operator
A deep dive into choosing and managing MIG geometry across a heterogeneous cluster, including the NVIDIA GPU Operator integration.
Read articleAI PlatformMulti-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem
The quota and fairness layer that sits above GPU sharing — how resource allocation interacts with Kueue cohorts and namespace-level isolation.
Read article