Collectively AI Kernels Staff Achieves 3.6x Efficiency Good points on NVIDIA {Hardware}

Timothy Morano
Apr 01, 2026 19:17

Collectively AI’s kernel analysis group delivers main GPU optimization breakthroughs, slicing inference latency from 281ms to 77ms for enterprise AI deployments.

The group behind FlashAttention has quietly turn into probably the most consequential teams in AI infrastructure. Collectively AI’s kernel analysis unit, now about 15 engineers sturdy, is fixing an issue most individuals do not even know exists: the large efficiency hole between AI fashions and the {hardware} operating them.

Their newest win? Taking a voice AI firm’s time-to-first-token from 281ms right down to 77ms—a 3.6x enchancment that translated to 7.2x higher unit economics.

The Hidden Bottleneck

Here is what most AI discourse misses: having nice fashions and costly GPUs does not assure efficiency. The bottleneck sits in between—the kernel layer that interprets mathematical operations into precise silicon directions.

“The hole between what researchers design and what really runs quick on {hardware} is huge,” explains Dan Fu, who leads a parallel analysis lab at UCSD. Get kernels proper and also you unlock {hardware}’s full potential. Get them flawed and your costly GPUs sit partially idle.

For firms constructing AI-native merchandise, this is not educational. When inference prices run 2x increased than needed, or when latency breaks the consumer expertise, kernel optimization turns into existential.

One Week Versus One Yr

The group’s capabilities confirmed clearly when NVIDIA’s Blackwell GPUs arrived in March 2025. NVIDIA had spent a yr with dozens of engineers optimizing kernels for the brand new structure. Collectively AI had every week.

Their secret weapon: ThunderKittens, a library developed with Stanford researchers that reduces kernel code from 1,000+ strains of CUDA to roughly 100-200 strains. The abstraction layer is constructed round NVIDIA’s tensor cores, the specialised matrix multiplication models on fashionable GPUs.

Inside seven days of {hardware} entry, the group had a number of the quickest FP4 and FP8 GEMM kernels obtainable for Blackwell, attaining as much as 2x speedups over cuBLAS on H100s.

Actual-World Affect

The voice AI case research illustrates what this implies in manufacturing. The client had a tough constraint: time-to-first-64-tokens above roughly 100ms breaks conversational move. Their B200 deployment was hitting 281ms.

Collectively’s group hand-optimized a “Megakernel” implementation—operating a complete mannequin in a single kernel, focusing on the HBM bandwidth ceiling of NVIDIA H100s. Outcomes on Llama-3.2-1B: 77ms. On Qwen 2.5 1.5B: 127ms, down from 292ms.

The strategy traces again to FlashAttention’s unique perception. That Memorial Day 2022 paper proved the AI institution flawed about consideration being totally optimized. By making use of database methods ideas—information locality, reminiscence hierarchies—to transformer consideration, the group achieved 2-3x speedups the place earlier sparsity strategies confirmed solely 10% actual beneficial properties.

Tutorial-Business Pipeline

The group operates by means of an uncommon mannequin. Dan Fu runs his UCSD lab on higher-risk basic analysis. Collectively AI co-founder Tri Dao is at Princeton. Simran Arora is at Caltech. Concepts get de-risked in academia, then productionized at Collectively AI. PhD college students be part of the corporate. Interns work on longer-term analysis in educational labs.

This produces engineers who bridge principle and manufacturing—individuals who, as Fu places it, “lose sleep over reminiscence entry patterns” and “discover magnificence in information move diagrams.”

The work is not glamorous. No bulletins when a kernel optimization lands. Simply quicker coaching occasions, decrease prices, increased throughput. However these margins decide whether or not AI-native merchandise really feel instantaneous or sluggish, whether or not unit economics work or do not, whether or not firms scale to thousands and thousands of customers or plateau at 1000’s.

For enterprise AI deployments the place each millisecond issues—and each proportion level of effectivity interprets to important value financial savings—this invisible infrastructure layer could also be the place the true aggressive benefit lies.

Picture supply: Shutterstock

What's Hot

Collectively AI Kernels Staff Achieves 3.6x Efficiency Good points on NVIDIA {Hardware}

The Hidden Bottleneck

One Week Versus One Yr

Actual-World Affect

Tutorial-Business Pipeline

Related Posts

Subscribe to Updates