Alvin Lang
Jan 22, 2026 23:03
NVIDIA’s FlashAttention-4 achieves 71% {hardware} effectivity on Blackwell chips, delivering 3.6x speedup over FA2 for AI coaching workloads.
NVIDIA has launched FlashAttention-4, the newest optimization for transformer neural networks that squeezes 1,605 TFLOPS out of its Blackwell structure—capturing 71% of the {hardware}’s theoretical most efficiency.
The announcement issues for anybody watching AI infrastructure investments. As massive language fashions push towards longer context home windows, the eye mechanism’s quadratic reminiscence complexity turns into a brutal bottleneck. FlashAttention-4 assaults this downside instantly, and the benchmark numbers counsel significant features for manufacturing AI workloads.
What the Numbers Present
On the B200 GPU, FA4 delivers a 3.6x speedup over FlashAttention-2 throughout ahead passes at 32,768 sequence size. Backward go efficiency hits 3.15x quicker than FA2 below the identical circumstances. In opposition to present frameworks, FA4 posts 1.3x enchancment over cuDNN and a pair of.4x over Triton Inference Server implementations.
The reminiscence effectivity features are equally important. Customary consideration scales at O(N²) with sequence size—which means doubling your context window quadruples reminiscence necessities. FA4 brings this right down to O(N) via tiling and incremental softmax normalization. NVIDIA claims 20x decrease reminiscence utilization in comparison with PyTorch baselines.
{Hardware}-Software program Co-Design
FA4 was constructed particularly for Blackwell’s quirks. The structure presents an uneven scaling downside: compute energy roughly doubles whereas reminiscence bandwidth would not hold tempo. Conventional approaches depart tensor cores sitting idle whereas ready for knowledge.
The answer leverages Blackwell’s devoted Tensor Reminiscence (TMEM)—256 KB of on-chip reminiscence per streaming multiprocessor. By storing intermediate calculations instantly in TMEM as a substitute of shared reminiscence, FA4 sidesteps the bandwidth bottleneck that might in any other case throttle the quicker compute models.
Bigger tile sizes (as much as 128×128) and deeper pipelines hold the {hardware} busy. The backward go—usually the slower half of coaching—advantages from bypassing register accumulation totally.
Manufacturing Integration
Main inference frameworks together with SGLang and vLLM already assist FA4 prefill operations. NVIDIA has included these strategies into cuDNN 9.14, making the optimizations accessible to builders with out customized kernel work.
For AI firms burning via compute budgets, the effectivity features translate on to value financial savings. A 3x+ speedup on coaching passes means both quicker iteration cycles or the power to coach bigger fashions inside present infrastructure constraints.
The broader development right here: as transformer fashions develop, algorithmic effectivity on the kernel degree turns into as necessary as uncooked {hardware} functionality. FlashAttention-4 represents the present frontier of that optimization work.
Picture supply: Shutterstock

