FlashAttention-3 Baseline Benchmarks

Updated 20 December 2025

FlashAttention-3 Baseline is a set of reference implementations used to benchmark GPU attention mechanisms, comparing performance, memory throughput, and numerical accuracy.
Baseline variants differ in memory access and synchronization, with standard attention materializing full matrices while FlashAttention-2/3 fuse operations to minimize costly HBM transfers.
Empirical results on the H100 show FlashAttention-3 achieving up to 67% Tensor-Core utilization and significantly lower RMSE compared to FP16/BF16 and FP8 baselines.

FlashAttention-3 Baseline refers to the reference attention mechanisms and implementations against which the FlashAttention-3 algorithm is benchmarked. These baselines provide a comprehensive framework for evaluating performance, utilization, and numerical accuracy on modern GPU hardware such as the NVIDIA H100. They extend across standard (“naïve”) attention, FlashAttention-2, and vendor-supplied or custom FP8 quantized attention implementations, each with defining features in memory access patterns, synchronization, and precision handling, specifically detailed for comparison in (Shah et al., 2024).

1. Standard and FlashAttention-2 Baseline Implementations

The standard (“naïve”) attention baseline (FP16/BF16) operates by computing the score matrix $S = \alpha QK^\top$ using a single GEMM operation and materializing the complete $N \times N$ matrix in high-bandwidth memory (HBM). A dedicated kernel then computes the row-wise softmax, storing the normalized matrix $P$ in HBM, followed by a second GEMM to obtain $O = PV$ . This approach requires synchronous, stepwise execution and incurs two full $N \times N$ memory passes—writing and reading both $S$ and $P$ matrices.

In contrast, FlashAttention-2 (FP16/BF16) tiles computation across blocks, fusing both GEMMs and the local softmax into a single CUDA kernel and utilizing a circular buffer in shared memory to eliminate HBM reads/writes for $S$ and $P$ . Producer warps synchronously copy $Q$ , $K$ , $V$ blocks into shared memory, while consumer warps complete computations in strict lock-step, synchronizing intra-block via barriers. No overlap occurs between softmax and GEMM across different blocks. Hardware utilization on H100 is limited, achieving ∼35% of peak Tensor-Core throughput.

For FP8 attention, the baseline relies primarily on vendor library (e.g., cuDNN 9) routines, applying per-tensor quantization with a solitary scale factor per tensor for $Q$ , $K$ , $V$ . GEMMs are mixed precision—FP8 input with FP32 accumulators—and output is typically cast to FP16 for the softmax, following the two-step (materialize-softmax-multiply) workflow. These FP8 baselines exhibit increased sensitivity to outliers and do not implement block-wise quantization or incoherent processing.

2. Key Algorithmic and Practical Differences

Baseline methods have sharply divergent approaches to computation and data movement:

Memory Traffic: Standard attention incurs the highest cost, streaming $2N^2$ elements for both $S$ and $P$ . FlashAttention variants avoid materializing these matrices in HBM, only transferring necessary blocks to shared memory.
Synchronization Model: The standard baseline is strictly sequential. FlashAttention-2 uses intra-block barriers but remains synchronous at the block level, with no computation/data movement overlap. FP8 baselines operate analogously to standard attention with no pipelining or asynchrony.
Precision and Layout: Standard and FlashAttention-2 operate in FP16/BF16, with accumulations in FP32 and softmax evaluated in FP32 for numerical stability. FP8 baselines quantize entire tensors, increasing error susceptibility due to quantization of outliers and imposing layout constraints (k-major tiling on FP8 Tensor Cores).
Kernel Fusion and Pipelining: Baseline methods, including FlashAttention-2, do not utilize streaming asynchronous copies (TMA), warp specialization, pipelined GEMM-softmax overlap, or block quantization, all of which are introduced in FlashAttention-3.

3. Empirical Benchmarking on H100

Empirical performance is benchmarked using 16,384 tokens per iteration and reported as an average over at least 30 runs:

	Standard Attention	FlashAttention-2	FlashAttention-3
Precision	FP16/BF16	FP16/BF16	FP16/BF16
Head Dim	128	128	128
Seq. Length	8,448	8,448	8,448
Throughput	50 TFLOPs/s (5%)	370 TFLOPs/s (37%)	661 TFLOPs/s (67%)
Latency	31.2 ms	4.02 ms	3.54 ms
Speedup (FA-2)	–	–	1.17×
RMSE (FP16)	$3.2 \times 10^{-4}$	$1.9 \times 10^{-4}$	$1.9 \times 10^{-4}$

For the FP8 baseline, flash attention-3 achieves a speedup and error reduction as follows:

	cuDNN FP8	FA-3 FP8
Head Dim	256	256
Seq. Length	8,448	8,448
Throughput	640 TFLOPs/s (32%)	1,180 TFLOPs/s (60%)
Speedup	–	1.84×
RMSE (FP8)	$2.4 \times 10^{-2}$	$9.1 \times 10^{-3}$

Overall, FlashAttention-3 benchmarks against these baselines demonstrate substantial improvements in throughput and numerical error, notably approaching 75% utilization of H100 Tensor Cores in FP16 and 60% in FP8, compared to 35% and 32% for the respective best baselines.

4. Numerical Error Profiles

Baseline accuracy is assessed by root-mean-squared error (RMSE) of the forward output $O$ against an FP64 reference under a challenging heavy-tailed input ( $X \sim \mathcal{N}(0,1)+\mathcal{N}(0,100)\cdot \mathrm{Bernoulli}(0.001)$ ).

FP16/BF16 Standard Attention: RMSE = $3.2 \times 10^{-4}$ .
FP16/BF16 FlashAttention-2/3: RMSE = $1.9 \times 10^{-4}$ (1.7× lower).
FP8 Baseline: RMSE = $2.4 \times 10^{-2}$ .
FP8 FlashAttention-3: RMSE = $9.1 \times 10^{-3}$ (2.6× lower).

Ablation studies indicate the critical role of block-wise quantization and incoherent processing: removing block quantization yields an RMSE of $9.3 \times 10^{-3}$ , and omitting incoherent processing increases RMSE to $2.4 \times 10^{-2}$ .

5. Theoretical Cost and Resource Utilization

Arithmetic cost per forward pass is identical across standard and FlashAttention variants: $\text{FLOPs}_{\text{forward}} = 4 h N^2 d$ where $h$ is the number of heads, $N$ is sequence length, and $d$ is per-head hidden dimension. For backward, the cost is approximately $2.5 \times$ forward, accounting for five GEMMs.

Memory access patterns diverge sharply:

Standard attention: $\sim 4N^2 + 3Nd$ global loads/stores due to materialization of $S$ and $P$ in HBM.
FlashAttention-2/3: only $(N/B_r) + (N/B_c)$ block loads for $Q$ , $K$ , $V$ ; $S$ and $P$ are never materialized globally.

Expected throughput under ideal conditions is governed by

$T_{\mathrm{attn}} \approx \frac{4N^2dh}{\rho \cdot P_{\mathrm{TC}} + \beta \cdot B_{\mathrm{mem}}}$

where $\rho$ is the Tensor-Core bound fraction, $\beta$ is the memory bandwidth bound fraction, $P_{\mathrm{TC}}$ denotes peak Tensor-Core TFLOPs/s, and $B_{\mathrm{mem}}$ is peak memory bandwidth. FlashAttention-3 increases $\rho$ from ∼0.35 (baseline) to ∼0.75 (FP16) and ∼0.60 (FP8), while $\beta$ remains minimized through kernel fusion.

6. Summary Comparison and Significance

A consolidated summary makes explicit the performance gap between baseline and FlashAttention-3 approaches:

	Standard	FA-2	FA-3
Precision	FP16/BF16	FP16/BF16	FP16/BF16
Throughput	50 TF/s	370 TF/s	661 TF/s
Utilization	5%	37%	67%
Latency (ms)	31.2	4.02	3.54
RMSE	$3.2 \times 10^{-4}$	$1.9 \times 10^{-4}$	$1.9 \times 10^{-4}$

	cuDNN FP8	FA-3 FP8
Throughput	640 TF/s	1,180 TF/s
Utilization	32%	60%
RMSE	$2.4 \times 10^{-2}$	$9.1 \times 10^{-3}$

In summary, baseline implementations set the operational context for quantifying advances in GPU attention kernel design, including memory access optimization, hardware utilization, and low-precision arithmetic. FlashAttention-3 benchmarks reveal that advances in warp-specialized asynchrony, pipelined kernel fusion, and block quantization yield improvements over these baselines across all key dimensions, raising utilization from approximately 35% (FP16) or 32% (FP8) to 75% and 60%, respectively, and effecting significant reductions in quantization error (Shah et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashAttention-3 Baseline.