Sawtooth Wavefront Reordering on GPUs
- Sawtooth wavefront reordering is an algorithmic technique that alternates tile scan directions to cut L2 cache reuse distances in GPU kernels.
- It achieves up to a 2× reduction in non-compulsory L2 misses, significantly enhancing throughput in transformer inference workloads like FlashAttention.
- Optimal performance requires balancing tile sizes and thread block configurations to maximize cache reuse while managing shared memory constraints.
Sawtooth wavefront reordering is an algorithmic and scheduling technique developed to mitigate L2 cache miss rates in memory-bound GPU kernels with strong wavefront synchrony, particularly in large-tiled attention mechanisms such as FlashAttention operating on NVIDIA GB10 ("Grace Blackwell") GPUs. By modifying the access pattern of streaming data tiles during parallel execution, sawtooth wavefront reordering substantially reduces non-compulsory L2 misses and increases throughput, especially in transformer inference workloads with long sequence lengths (Zhu et al., 22 Jan 2026).
1. Architectural and Algorithmic Context
The NVIDIA GB10 GPU integrates a large monolithic L2 cache (24 MiB) shared across @@@@1@@@@ (SMs), each with a private L1 region ("L1Tex" or scratchpad). In high-performance attention kernels—such as split-Q FlashAttention—the memory access pattern exhibits sequential scans over key (K) and value (V) tiles for each query (Q) tile. For large sequence lengths (S ≳ 10⁵), the working set of K+V tiles exceeds L2 capacity, and both L1 and L2 act as pass-throughs for global (HBM/LPDDR5X) accesses, with L1 hit rate measured at <0.03% for FlashAttention (Zhu et al., 22 Jan 2026). This context makes improving L2 cache reuse central to scaling transformer throughput.
2. L2 Miss Phenomena in Cyclic Tile Scans
Standard cyclic tile scans, in which each Cooperative Thread Array (CTA) processes Q blocks and iterates through K,V tiles in identical forward order, induce high L2 miss rates due to large reuse distances and synchronized wavefronts. The access distance () between two uses of the same cache line scales with the tile count , such that:
As all CTAs progress synchronously, simultaneous requests evict useful cache lines from L2, causing nearly all accesses to become non-compulsory misses once the per-CTA working set (bytes) sits above the effective L2/SM budget. Empirically, L2 hit rate exhibits:
where is the number of active SMs (Zhu et al., 22 Jan 2026). This synchrony also exacerbates L2 bank conflicts.
3. Sawtooth Wavefront Reordering Mechanism
Sawtooth wavefront reordering modifies the access trajectory across K and V tiles: it alternates the scan direction for each neighboring Q tile (per CTA). On even Q tiles, the inner scan is forward; on odd Q tiles, backward. This "sawtooth" pattern (Editor's term) spatially and temporally interleaves accesses, reducing the contention and, crucially, halving the average reuse distance relative to cyclic scan:
Consequently, the upper bound for non-compulsory miss probability decreases:
With , the resulting miss probability is reduced by up to 2×, directly improving effective memory bandwidth and kernel throughput (Zhu et al., 22 Jan 2026).
Key Implementation
Below is the schematic pseudocode for applying sawtooth reordering within each CTA:
1 2 3 4 5 6 7 8 9 |
// i_local: Q tile index within CTA bool forward = (i_local & 1) == 0; int j0 = forward ? 0 : T_c−1; int j1 = forward ? T_c : -1; int step = forward ? 1 : -1; for (int j = j0; j != j1; j += step) { load K_j, V_j; compute_attention(q, K_j, V_j); } |
In CuTile, the transformation uses "parfor 2" and reversed range idioms to generate the desired alternation:
1 2 3 |
tile_seq = range(0, T_c) if i%2==0 else range(T_c-1, -1, -1) for kv_tile in tile_seq: compute_tile(kv_tile) |
4. Quantitative Impact on Memory Behavior and Throughput
Systematic evaluation on GB10 with split-Q FlashAttention (sequence , tile , batch=8) demonstrates the impact:
| Variant | L2 Misses (×10⁶) | Throughput Non-Causal (TFLOPS) | Throughput Causal (TFLOPS) |
|---|---|---|---|
| Cyclic-Static | 370 | 61.0 | 41.0 |
| Sawtooth-Static | 120 | 69.0 (+13%) | 66.0 (+61%) |
| Cyclic-Tile | 350 | 62.5 | 42.0 |
| Sawtooth-Tile | 115 | 70.8 (+13%) | 65.6 (+56%) |
Measured with Nsight Compute counters, sawtooth reordering achieves a 50%–67% reduction in L2 sector misses and throughput improvements up to 60% (causal mask), with a consistent ≈13% boost even for non-causal patterns (Zhu et al., 22 Jan 2026).
5. Implementation Tuning and Trade-offs
Optimal deployment of sawtooth wavefront reordering requires balancing tile size and thread block layout:
- Smaller improves L2 cache reuse but increases arithmetic and scheduling overhead.
- Larger may exhaust shared memory or register file, forcing kernel splitting and diminishing both reuse and performance.
- Thread block occupancy, typically 128 threads per CTA arranged in an 8×16 layout, is tuned to match and maximize SM pipeline occupancy.
For best results, tile sizing and sawtooth directionality should correspond to on-chip SRAM and L2 cache availability, as well as workload synchrony.
6. Generalization, Applicability, and Limitations
Sawtooth wavefront reordering is applicable beyond GB10. Its benefits extend to other NVIDIA GPU generations (e.g., A100, H100) with similar coordinated SM wavefront execution and symmetric memory hierarchies. It is also effective in non-attention streaming kernels such as matrix multiplication and convolution, given sufficiently regular CTA progress (Zhu et al., 22 Jan 2026).
Limitations include:
- Reduced benefit when CTA runtime variability disrupts inter-CTA synchrony.
- In CuTile, very large tiles may trigger kernel splitting (due to register/shared-memory limits), potentially interfering with intended access ordering.
- The approach is less effective if the per-CTA tile working set is significantly larger than the cache or in architectures with non-uniform memory access domains.
Future Directions
Potential avenues for enhancement include run-time adaptation of the sawtooth pattern based on cache miss counters, dynamic tile sizing, and analytic integration with cache-reuse models to predict optimal scan orders at compile time.
7. Relationship to GB10 Microarchitecture and Broader Implications
Sawtooth wavefront reordering leverages the structural features and constraints of the GB10 memory system. With L2 hit latency of ~358 cycles and DRAM hit latency of ≈876 cycles, optimizing for L2 cache reuse is critical for high-performance ML workloads (Jarmusch et al., 14 Jul 2025). The technique interacts with higher-level kernel tuning, such as tile decomposition and memory staging, as discussed in microbenchmark analyses of Blackwell’s tensor core and memory subsystems (Jarmusch et al., 1 Dec 2025).
A plausible implication is that as LLM and foundation model inference pushes sequence lengths higher, systematic scan-pattern reordering becomes indispensable in maximizing effective on-chip memory bandwidth and cache utility, especially given the trend toward larger, shared L2 architectures.
References
- Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10 (Zhu et al., 22 Jan 2026)
- Microbenchmarking NVIDIA's Blackwell Architecture: An In-depth Architectural Analysis (Jarmusch et al., 1 Dec 2025)
- Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks (Jarmusch et al., 14 Jul 2025)