Jessie A Ellis
Mar 25, 2026 17:19
New NVIDIA benchmarks present Multi-Occasion GPU partitioning achieves 1.00 req/s per GPU versus 0.76 for time-slicing in manufacturing AI workloads.
NVIDIA has launched benchmark information exhibiting its Multi-Occasion GPU (MIG) expertise delivers 33% greater throughput effectivity than software-based time-slicing for AI inference workloads—a discovering that would reshape how enterprises allocate compute assets for manufacturing AI deployments.
The exams, performed on NVIDIA A100 Tensor Core GPUs in a Kubernetes atmosphere, demonstrated MIG attaining roughly 1.00 requests per second per GPU in comparison with 0.76 req/s for time-slicing configurations. Each approaches maintained 100% success charges with no failures throughout testing.
The GPU Fragmentation Drawback
Most manufacturing AI pipelines undergo from a mismatch between mannequin necessities and {hardware} allocation. Light-weight fashions for computerized speech recognition or text-to-speech may want solely 10 GB of VRAM however occupy a whole GPU below normal Kubernetes scheduling. NVIDIA’s information reveals GPU compute utilization typically hovers between 0-10% for these assist fashions.
The corporate examined three configurations utilizing a voice-to-voice AI pipeline: a baseline with devoted GPUs for every mannequin, time-slicing the place ASR and TTS share a GPU by means of software program scheduling, and MIG the place {hardware} bodily partitions the GPU into remoted situations with devoted reminiscence and streaming multiprocessors.
{Hardware} Isolation Wins on Throughput
Beneath heavy load with 50 concurrent customers over 375 seconds of sustained interplay, MIG’s {hardware} partitioning eradicated useful resource rivalry fully. Time-slicing confirmed sooner particular person process completion for bursty workloads—144.7ms imply TTS latency versus MIG’s 168.2ms—however that 23.5ms distinction turns into negligible when the LLM bottleneck accounts for roughly 9 seconds of complete processing time.
The essential benefit: MIG’s fault isolation prevents reminiscence overflow in a single course of from crashing others sharing the cardboard. Time-slicing’s shared execution context means a deadly error propagates throughout all processes, doubtlessly triggering a GPU reset.
Manufacturing Implications
NVIDIA recommends MIG because the default for manufacturing environments prioritizing throughput and reliability, whereas time-slicing fits improvement, CI/CD pipelines, and proof-of-concept work the place minimizing {hardware} footprint issues greater than peak efficiency.
For organizations working combined AI workloads, consolidating assist fashions onto partitioned GPUs frees complete playing cards for LLM situations—the precise compute bottleneck in most generative AI functions. The corporate has revealed implementation guides and YAML manifests for Kubernetes deployments by means of its NIM Operator framework.
Picture supply: Shutterstock

