Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashAttention-3 Baseline Benchmarks

Updated 20 December 2025
  • FlashAttention-3 Baseline is a set of reference implementations used to benchmark GPU attention mechanisms, comparing performance, memory throughput, and numerical accuracy.
  • Baseline variants differ in memory access and synchronization, with standard attention materializing full matrices while FlashAttention-2/3 fuse operations to minimize costly HBM transfers.
  • Empirical results on the H100 show FlashAttention-3 achieving up to 67% Tensor-Core utilization and significantly lower RMSE compared to FP16/BF16 and FP8 baselines.

FlashAttention-3 Baseline refers to the reference attention mechanisms and implementations against which the FlashAttention-3 algorithm is benchmarked. These baselines provide a comprehensive framework for evaluating performance, utilization, and numerical accuracy on modern GPU hardware such as the NVIDIA H100. They extend across standard (“naïve”) attention, FlashAttention-2, and vendor-supplied or custom FP8 quantized attention implementations, each with defining features in memory access patterns, synchronization, and precision handling, specifically detailed for comparison in (Shah et al., 2024).

1. Standard and FlashAttention-2 Baseline Implementations

The standard (“naïve”) attention baseline (FP16/BF16) operates by computing the score matrix S=αQKS = \alpha QK^\top using a single GEMM operation and materializing the complete N×NN \times N matrix in high-bandwidth memory (HBM). A dedicated kernel then computes the row-wise softmax, storing the normalized matrix PP in HBM, followed by a second GEMM to obtain O=PVO = PV. This approach requires synchronous, stepwise execution and incurs two full N×NN \times N memory passes—writing and reading both SS and PP matrices.

In contrast, FlashAttention-2 (FP16/BF16) tiles computation across blocks, fusing both GEMMs and the local softmax into a single CUDA kernel and utilizing a circular buffer in shared memory to eliminate HBM reads/writes for SS and PP. Producer warps synchronously copy QQ, KK, VV blocks into shared memory, while consumer warps complete computations in strict lock-step, synchronizing intra-block via barriers. No overlap occurs between softmax and GEMM across different blocks. Hardware utilization on H100 is limited, achieving ∼35% of peak Tensor-Core throughput.

For FP8 attention, the baseline relies primarily on vendor library (e.g., cuDNN 9) routines, applying per-tensor quantization with a solitary scale factor per tensor for QQ, KK, VV. GEMMs are mixed precision—FP8 input with FP32 accumulators—and output is typically cast to FP16 for the softmax, following the two-step (materialize-softmax-multiply) workflow. These FP8 baselines exhibit increased sensitivity to outliers and do not implement block-wise quantization or incoherent processing.

2. Key Algorithmic and Practical Differences

Baseline methods have sharply divergent approaches to computation and data movement:

  • Memory Traffic: Standard attention incurs the highest cost, streaming 2N22N^2 elements for both SS and PP. FlashAttention variants avoid materializing these matrices in HBM, only transferring necessary blocks to shared memory.
  • Synchronization Model: The standard baseline is strictly sequential. FlashAttention-2 uses intra-block barriers but remains synchronous at the block level, with no computation/data movement overlap. FP8 baselines operate analogously to standard attention with no pipelining or asynchrony.
  • Precision and Layout: Standard and FlashAttention-2 operate in FP16/BF16, with accumulations in FP32 and softmax evaluated in FP32 for numerical stability. FP8 baselines quantize entire tensors, increasing error susceptibility due to quantization of outliers and imposing layout constraints (k-major tiling on FP8 Tensor Cores).
  • Kernel Fusion and Pipelining: Baseline methods, including FlashAttention-2, do not utilize streaming asynchronous copies (TMA), warp specialization, pipelined GEMM-softmax overlap, or block quantization, all of which are introduced in FlashAttention-3.

3. Empirical Benchmarking on H100

Empirical performance is benchmarked using 16,384 tokens per iteration and reported as an average over at least 30 runs:

Standard Attention FlashAttention-2 FlashAttention-3
Precision FP16/BF16 FP16/BF16 FP16/BF16
Head Dim 128 128 128
Seq. Length 8,448 8,448 8,448
Throughput 50 TFLOPs/s (5%) 370 TFLOPs/s (37%) 661 TFLOPs/s (67%)
Latency 31.2 ms 4.02 ms 3.54 ms
Speedup (FA-2) 1.17×
RMSE (FP16) 3.2×1043.2 \times 10^{-4} 1.9×1041.9 \times 10^{-4} 1.9×1041.9 \times 10^{-4}

For the FP8 baseline, flash attention-3 achieves a speedup and error reduction as follows:

cuDNN FP8 FA-3 FP8
Head Dim 256 256
Seq. Length 8,448 8,448
Throughput 640 TFLOPs/s (32%) 1,180 TFLOPs/s (60%)
Speedup 1.84×
RMSE (FP8) 2.4×1022.4 \times 10^{-2} 9.1×1039.1 \times 10^{-3}

Overall, FlashAttention-3 benchmarks against these baselines demonstrate substantial improvements in throughput and numerical error, notably approaching 75% utilization of H100 Tensor Cores in FP16 and 60% in FP8, compared to 35% and 32% for the respective best baselines.

4. Numerical Error Profiles

Baseline accuracy is assessed by root-mean-squared error (RMSE) of the forward output OO against an FP64 reference under a challenging heavy-tailed input (XN(0,1)+N(0,100)Bernoulli(0.001)X \sim \mathcal{N}(0,1)+\mathcal{N}(0,100)\cdot \mathrm{Bernoulli}(0.001)).

  • FP16/BF16 Standard Attention: RMSE = 3.2×1043.2 \times 10^{-4}.
  • FP16/BF16 FlashAttention-2/3: RMSE = 1.9×1041.9 \times 10^{-4} (1.7× lower).
  • FP8 Baseline: RMSE = 2.4×1022.4 \times 10^{-2}.
  • FP8 FlashAttention-3: RMSE = 9.1×1039.1 \times 10^{-3} (2.6× lower).

Ablation studies indicate the critical role of block-wise quantization and incoherent processing: removing block quantization yields an RMSE of 9.3×1039.3 \times 10^{-3}, and omitting incoherent processing increases RMSE to 2.4×1022.4 \times 10^{-2}.

5. Theoretical Cost and Resource Utilization

Arithmetic cost per forward pass is identical across standard and FlashAttention variants: FLOPsforward=4hN2d\text{FLOPs}_{\text{forward}} = 4 h N^2 d where hh is the number of heads, NN is sequence length, and dd is per-head hidden dimension. For backward, the cost is approximately 2.5×2.5 \times forward, accounting for five GEMMs.

Memory access patterns diverge sharply:

  • Standard attention: 4N2+3Nd\sim 4N^2 + 3Nd global loads/stores due to materialization of SS and PP in HBM.
  • FlashAttention-2/3: only (N/Br)+(N/Bc)(N/B_r) + (N/B_c) block loads for QQ, KK, VV; SS and PP are never materialized globally.

Expected throughput under ideal conditions is governed by

Tattn4N2dhρPTC+βBmemT_{\mathrm{attn}} \approx \frac{4N^2dh}{\rho \cdot P_{\mathrm{TC}} + \beta \cdot B_{\mathrm{mem}}}

where ρ\rho is the Tensor-Core bound fraction, β\beta is the memory bandwidth bound fraction, PTCP_{\mathrm{TC}} denotes peak Tensor-Core TFLOPs/s, and BmemB_{\mathrm{mem}} is peak memory bandwidth. FlashAttention-3 increases ρ\rho from ∼0.35 (baseline) to ∼0.75 (FP16) and ∼0.60 (FP8), while β\beta remains minimized through kernel fusion.

6. Summary Comparison and Significance

A consolidated summary makes explicit the performance gap between baseline and FlashAttention-3 approaches:

Standard FA-2 FA-3
Precision FP16/BF16 FP16/BF16 FP16/BF16
Throughput 50 TF/s 370 TF/s 661 TF/s
Utilization 5% 37% 67%
Latency (ms) 31.2 4.02 3.54
RMSE 3.2×1043.2 \times 10^{-4} 1.9×1041.9 \times 10^{-4} 1.9×1041.9 \times 10^{-4}
cuDNN FP8 FA-3 FP8
Throughput 640 TF/s 1,180 TF/s
Utilization 32% 60%
RMSE 2.4×1022.4 \times 10^{-2} 9.1×1039.1 \times 10^{-3}

In summary, baseline implementations set the operational context for quantifying advances in GPU attention kernel design, including memory access optimization, hardware utilization, and low-precision arithmetic. FlashAttention-3 benchmarks reveal that advances in warp-specialized asynchrony, pipelined kernel fusion, and block quantization yield improvements over these baselines across all key dimensions, raising utilization from approximately 35% (FP16) or 32% (FP8) to 75% and 60%, respectively, and effecting significant reductions in quantization error (Shah et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashAttention-3 Baseline.