2000 character limit reached
Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
Published 22 Jan 2026 in cs.PF, cs.AI, cs.LG, and cs.OS | (2601.16032v1)
Abstract: High-performance attention kernels are essential for LLMs. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.