Felix Pinkston
Apr 09, 2026 17:23
NVIDIA’s Slinky mission allows operating Slurm clusters on Kubernetes, already deployed on 8,000+ GPU methods for large-scale AI coaching infrastructure.
NVIDIA has launched Slinky, an open-source mission that bridges the hole between Slurm—the job scheduler operating over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The corporate already runs Slinky in manufacturing throughout clusters with greater than 8,000 GPUs.
The technical drawback right here is actual: organizations have years invested in Slurm job scripts, fair-share insurance policies, and accounting workflows. However Kubernetes has grow to be the usual for managing GPU infrastructure. Operating two separate environments creates operational complications that compound at scale.
How Slinky Really Works
Slinky’s slurm-operator represents every Slurm element—scheduling, accounting, compute employees, API entry—as Kubernetes Customized Useful resource Definitions. You outline a Slurm cluster utilizing Customized Sources, and Slinky spins up containerized Slurm daemons in their very own pods.
The high-availability story issues for manufacturing deployments. Slinky handles management aircraft HA by pod regeneration fairly than Slurm’s native mechanism. Configuration adjustments propagate mechanically with zero scheduler downtime. Staff can autoscale primarily based on cluster metrics, and on scale-in, Slinky absolutely drains nodes earlier than terminating pods—operating workloads full first.
For NVIDIA’s GB200 NVL72 structure, the place GPUs talk throughout nodes by multinode NVLink, Slinky allows ComputeDomains that dynamically handle high-bandwidth GPU-to-GPU connectivity. Distributed coaching jobs obtain full NVLink bandwidth throughout node boundaries.
Manufacturing Outcomes at NVIDIA
NVIDIA experiences GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable affect from the Kubernetes layer. New clusters reportedly go from zero to operating jobs in hours utilizing Helm charts.
The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside normal Kubernetes metrics. When well being checks flag an unhealthy node, the state syncs mechanically between methods. Rolling updates proceed whereas coaching jobs proceed on remaining capability.
One constraint price noting: Slinky at present assumes one employee pod per node. Should you’re operating solely single-node Slurm jobs, this over-provisions relative to what you want.
What’s New in v1.1.0
The lately launched slurm-operator v1.1.0 provides dynamic topology assist—employee pods now register with topology primarily based on their Kubernetes node, enabling topology-aware scheduling as pods transfer. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters the place each GPU node ought to run a Slurm employee.
The roadmap consists of sleek cluster upgrades, deliberate outage workflows, and configuration rollback. For AI infrastructure groups weighing build-versus-integrate choices, Slinky represents a significant possibility that did not exist a yr in the past. The code is obtainable on GitHub underneath the SlinkyProject group.
Picture supply: Shutterstock

