FlashAttention-3 Baseline Benchmarks
- FlashAttention-3 Baseline is a set of reference implementations used to benchmark GPU attention mechanisms, comparing performance, memory throughput, and numerical accuracy.
- Baseline variants differ in memory access and synchronization, with standard attention materializing full matrices while FlashAttention-2/3 fuse operations to minimize costly HBM transfers.
- Empirical results on the H100 show FlashAttention-3 achieving up to 67% Tensor-Core utilization and significantly lower RMSE compared to FP16/BF16 and FP8 baselines.
FlashAttention-3 Baseline refers to the reference attention mechanisms and implementations against which the FlashAttention-3 algorithm is benchmarked. These baselines provide a comprehensive framework for evaluating performance, utilization, and numerical accuracy on modern GPU hardware such as the NVIDIA H100. They extend across standard (“naïve”) attention, FlashAttention-2, and vendor-supplied or custom FP8 quantized attention implementations, each with defining features in memory access patterns, synchronization, and precision handling, specifically detailed for comparison in (Shah et al., 2024).
1. Standard and FlashAttention-2 Baseline Implementations
The standard (“naïve”) attention baseline (FP16/BF16) operates by computing the score matrix using a single GEMM operation and materializing the complete matrix in high-bandwidth memory (HBM). A dedicated kernel then computes the row-wise softmax, storing the normalized matrix in HBM, followed by a second GEMM to obtain . This approach requires synchronous, stepwise execution and incurs two full memory passes—writing and reading both and matrices.
In contrast, FlashAttention-2 (FP16/BF16) tiles computation across blocks, fusing both GEMMs and the local softmax into a single CUDA kernel and utilizing a circular buffer in shared memory to eliminate HBM reads/writes for and . Producer warps synchronously copy , , blocks into shared memory, while consumer warps complete computations in strict lock-step, synchronizing intra-block via barriers. No overlap occurs between softmax and GEMM across different blocks. Hardware utilization on H100 is limited, achieving ∼35% of peak Tensor-Core throughput.
For FP8 attention, the baseline relies primarily on vendor library (e.g., cuDNN 9) routines, applying per-tensor quantization with a solitary scale factor per tensor for , , . GEMMs are mixed precision—FP8 input with FP32 accumulators—and output is typically cast to FP16 for the softmax, following the two-step (materialize-softmax-multiply) workflow. These FP8 baselines exhibit increased sensitivity to outliers and do not implement block-wise quantization or incoherent processing.
2. Key Algorithmic and Practical Differences
Baseline methods have sharply divergent approaches to computation and data movement:
- Memory Traffic: Standard attention incurs the highest cost, streaming elements for both and . FlashAttention variants avoid materializing these matrices in HBM, only transferring necessary blocks to shared memory.
- Synchronization Model: The standard baseline is strictly sequential. FlashAttention-2 uses intra-block barriers but remains synchronous at the block level, with no computation/data movement overlap. FP8 baselines operate analogously to standard attention with no pipelining or asynchrony.
- Precision and Layout: Standard and FlashAttention-2 operate in FP16/BF16, with accumulations in FP32 and softmax evaluated in FP32 for numerical stability. FP8 baselines quantize entire tensors, increasing error susceptibility due to quantization of outliers and imposing layout constraints (k-major tiling on FP8 Tensor Cores).
- Kernel Fusion and Pipelining: Baseline methods, including FlashAttention-2, do not utilize streaming asynchronous copies (TMA), warp specialization, pipelined GEMM-softmax overlap, or block quantization, all of which are introduced in FlashAttention-3.
3. Empirical Benchmarking on H100
Empirical performance is benchmarked using 16,384 tokens per iteration and reported as an average over at least 30 runs:
| Standard Attention | FlashAttention-2 | FlashAttention-3 | |
|---|---|---|---|
| Precision | FP16/BF16 | FP16/BF16 | FP16/BF16 |
| Head Dim | 128 | 128 | 128 |
| Seq. Length | 8,448 | 8,448 | 8,448 |
| Throughput | 50 TFLOPs/s (5%) | 370 TFLOPs/s (37%) | 661 TFLOPs/s (67%) |
| Latency | 31.2 ms | 4.02 ms | 3.54 ms |
| Speedup (FA-2) | – | – | 1.17× |
| RMSE (FP16) |
For the FP8 baseline, flash attention-3 achieves a speedup and error reduction as follows:
| cuDNN FP8 | FA-3 FP8 | |
|---|---|---|
| Head Dim | 256 | 256 |
| Seq. Length | 8,448 | 8,448 |
| Throughput | 640 TFLOPs/s (32%) | 1,180 TFLOPs/s (60%) |
| Speedup | – | 1.84× |
| RMSE (FP8) |
Overall, FlashAttention-3 benchmarks against these baselines demonstrate substantial improvements in throughput and numerical error, notably approaching 75% utilization of H100 Tensor Cores in FP16 and 60% in FP8, compared to 35% and 32% for the respective best baselines.
4. Numerical Error Profiles
Baseline accuracy is assessed by root-mean-squared error (RMSE) of the forward output against an FP64 reference under a challenging heavy-tailed input ().
- FP16/BF16 Standard Attention: RMSE = .
- FP16/BF16 FlashAttention-2/3: RMSE = (1.7× lower).
- FP8 Baseline: RMSE = .
- FP8 FlashAttention-3: RMSE = (2.6× lower).
Ablation studies indicate the critical role of block-wise quantization and incoherent processing: removing block quantization yields an RMSE of , and omitting incoherent processing increases RMSE to .
5. Theoretical Cost and Resource Utilization
Arithmetic cost per forward pass is identical across standard and FlashAttention variants: where is the number of heads, is sequence length, and is per-head hidden dimension. For backward, the cost is approximately forward, accounting for five GEMMs.
Memory access patterns diverge sharply:
- Standard attention: global loads/stores due to materialization of and in HBM.
- FlashAttention-2/3: only block loads for , , ; and are never materialized globally.
Expected throughput under ideal conditions is governed by
where is the Tensor-Core bound fraction, is the memory bandwidth bound fraction, denotes peak Tensor-Core TFLOPs/s, and is peak memory bandwidth. FlashAttention-3 increases from ∼0.35 (baseline) to ∼0.75 (FP16) and ∼0.60 (FP8), while remains minimized through kernel fusion.
6. Summary Comparison and Significance
A consolidated summary makes explicit the performance gap between baseline and FlashAttention-3 approaches:
| Standard | FA-2 | FA-3 | |
|---|---|---|---|
| Precision | FP16/BF16 | FP16/BF16 | FP16/BF16 |
| Throughput | 50 TF/s | 370 TF/s | 661 TF/s |
| Utilization | 5% | 37% | 67% |
| Latency (ms) | 31.2 | 4.02 | 3.54 |
| RMSE |
| cuDNN FP8 | FA-3 FP8 | |
|---|---|---|
| Throughput | 640 TF/s | 1,180 TF/s |
| Utilization | 32% | 60% |
| RMSE |
In summary, baseline implementations set the operational context for quantifying advances in GPU attention kernel design, including memory access optimization, hardware utilization, and low-precision arithmetic. FlashAttention-3 benchmarks reveal that advances in warp-specialized asynchrony, pipelined kernel fusion, and block quantization yield improvements over these baselines across all key dimensions, raising utilization from approximately 35% (FP16) or 32% (FP8) to 75% and 60%, respectively, and effecting significant reductions in quantization error (Shah et al., 2024).