Felix Pinkston
Jan 21, 2026 21:57
NVIDIA simplifies GPU growth with CUB single-call API in CUDA 13.1, eliminating repetitive two-phase reminiscence allocation code with out efficiency loss.
NVIDIA has shipped a big quality-of-life improve for GPU builders with CUDA 13.1, introducing a single-call API for the CUB template library that eliminates the clunky two-phase reminiscence allocation sample builders have labored round for years.
The change addresses a long-standing ache level. CUB—the C++ template library powering high-performance GPU primitives like scans, kinds, and histograms—beforehand required builders to name every perform twice: as soon as to calculate required reminiscence, then once more to truly run the algorithm. This meant each CUB operation seemed one thing like this verbose dance of reminiscence estimation, allocation, and execution.
PyTorch’s codebase tells the story. The framework wraps CUB calls in macros particularly to cover this two-step invocation, a workaround widespread throughout manufacturing codebases. Macros obscure management stream and complicate debugging—a trade-off groups accepted as a result of the choice was worse.
Zero Overhead, Much less Code
The brand new API cuts straight to the purpose. What beforehand required express reminiscence allocation now matches in a single line, with CUB dealing with short-term storage internally. NVIDIA’s benchmarks present the streamlined interface introduces zero efficiency overhead in comparison with the guide strategy—reminiscence allocation nonetheless occurs, slightly below the hood through asynchronous allocation embedded inside gadget primitives.
Critically, the previous two-phase API stays accessible. Builders who want fine-grained management over reminiscence—reusing allocations throughout a number of operations or sharing between algorithms—can proceed utilizing the prevailing sample. However for almost all of use circumstances, the single-call strategy ought to change into the default.
The Surroundings Argument
Past simplifying fundamental calls, CUDA 13.1 introduces an extensible “env” argument that consolidates execution configuration. Builders can now mix customized CUDA streams, reminiscence assets, deterministic necessities, and tuning insurance policies by means of a single type-safe object moderately than juggling a number of perform parameters.
Reminiscence assets—a brand new utility for allocation and deallocation—might be handed by means of this surroundings argument. NVIDIA gives default assets, however builders can substitute their very own customized implementations or use CCCL-provided options like gadget reminiscence swimming pools.
At the moment, the surroundings interface helps core algorithms together with DeviceReduce operations (Cut back, Sum, Min, Max, ArgMin, ArgMax) and DeviceScan operations (ExclusiveSum, ExclusiveScan). NVIDIA is monitoring further algorithm help through their CCCL GitHub repository.
Sensible Implications
For groups sustaining GPU-accelerated functions, this replace means much less wrapper code and cleaner integration. The CUB library already serves as a foundational part of NVIDIA’s CUDA Core Compute Libraries, and simplifying its API reduces friction for builders constructing customized CUDA kernels.
The timing aligns with broader trade motion towards extra accessible GPU programming. As AI workloads drive demand for optimized GPU code, reducing obstacles to utilizing high-performance primitives issues.
CUDA 13.1 is on the market now by means of NVIDIA’s developer portal. Groups presently utilizing macro wrappers round CUB calls ought to consider migrating to the native single-call API—it delivers the identical abstraction with out the debugging complications.
Picture supply: Shutterstock

