SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Published 11 Feb 2026 in cs.LG and cs.CL | (2602.10718v1)

Abstract: While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.

Abstract PDF Upgrade to Chat

Summary

The paper presents a RoPE-aware per-token quantization method that minimizes numerical error and maintains high fidelity in deep layers.
The paper develops a quantized PV computation pipeline using scale fusion to streamline dequantization and exploit hardware tiling for efficient memory access.
The paper implements end-to-end dataflow optimization that delivers nearly 2× throughput speedup with negligible accuracy loss across various LLM benchmarks.

SnapMLA: Hardware-Aware FP8 Quantized Pipelining for Efficient Long-Context MLA Decoding

Introduction and Motivation

Efficient inference of LLMs with extended context lengths imposes substantial computational and memory challenges, particularly for attention mechanisms reliant on large Key-Value (KV) caches. The integration of FP8 attention within the decoding phase of DeepSeek Multi-head Latent Attention (MLA) architectures is impeded by numerical heterogeneity, quantization misalignment, and architectural hardware constraints. SnapMLA is proposed as a systematically hardware-aware, algorithm-kernel co-optimized framework targeting these challenges. The technical innovations of SnapMLA address: (a) RoPE-Aware Per-Token KV Quantization; (b) quantized PV computation pipeline reconstruction; and (c) end-to-end dataflow optimization. This enables high-throughput inference—up to 1.91× speedup—with negligible performance degradation on extensive long-context benchmarks including mathematical reasoning and code generation.

Technical Innovations

RoPE-Aware Per-Token KV Quantization

MLA's KV cache comprises a compressed latent content vector and a decoupled Rotational Position Embedding (RoPE) component. Empirical analysis demonstrates a marked divergence in dynamic ranges: RoPE exhibits outlier tails and spans up to $10^3$ , while content is tightly centered around zero. Uniform FP8 quantization is thus inadequate, significantly amplifying MSE for RoPE, especially in deeper layers.

SnapMLA implements RoPE-Aware Per-Token Quantization, restricting FP8 quantization to the content dimension and preserving RoPE in higher precision (BF16). Quantization granularity is set to per-token, facilitating immediate quantization in autoregressive decoding, eliminating buffer management complexity typical of block-wise approaches, and maintaining compatibility with contemporary inference frameworks.

Figure 1: Numerical values analysis.

Figure 2: Layer-wise numerical fidelity analysis (context length = 32k).

Quantized PV Computation Pipeline Reconstruction

Execution of the PV GEMM with FP8 precision on NVIDIA Hopper Tensor Cores requires k-major contiguous memory layout for $V$ . However, MLA’s $V$ inherits per-token scales from the latent cache, misaligned with the reduction dimension, breaking conventional dequantization pipelines.

SnapMLA resolves this via Scale Fusion, integrating $S_V$ directly into the attention probability matrix $P$ . This is followed by block-wise dynamic quantization and implicit dequantization fused into the softmax computation. The result is a scale-homogeneous pipeline, robust to dynamic range extension induced by fusion, and compatible with hardware tiling and efficient memory access patterns.

Figure 3: Overview of the scale fusion pipeline in SnapMLA.

End-to-End Dataflow Optimization

SnapMLA features hardware-optimized compute-memory fusion, cache-aware tiling (aligned with HBM and SMEM layouts), and zero-overhead data layout transformation. These include fused kernels for token preparation and cache management, and efficient in-kernel transpositions using Hopper’s asynchronous execution. The holistic pipeline minimizes kernel launches, memory traffic, and layout adaptation overhead, achieving near-peak hardware utilization.

Empirical Evaluation

Benchmark Accuracy

Comparative evaluation across DeepSeek-V3.1 and LongCat-Flash-Thinking demonstrates SnapMLA achieves comparable or superior accuracy to FlashMLA, even under full FP8 quantization. Notably, performance in general QA, reasoning, and coding benchmarks remains stable, with only minor losses in mathematical reasoning tasks. This substantiates the preservation of accuracy through sensitivity-aware mixed-precision quantization and granular scale management.

Numerical Fidelity

Layer-wise analysis indicates that quantizing RoPE introduces substantial errors and “error explosion” as depth increases. Coarse-grained configurations (per-tensor/block) fail to capture token-level outlier activations. SnapMLA’s per-token, RoPE-aware quantization delivers the lowest error rates, closely matching BF16 ground truth across all metrics.

System Throughput and Kernel Performance

SnapMLA delivers a 1.91× speedup under DP8/TP1 parallelism, enabling substantially larger batch sizes due to reduced memory footprint. Kernel performance closely tracks the effective theoretical TFLOPS peak, indicating negligible overhead from pipeline reconstruction and layout transformations.

Figure 4: End-to-end decoding throughput comparison.

Figure 5: Kernel-level compute performance (TFLOPS).

Kernel-level and configuration sensitivity analysis further reveal that throughput rises with number of heads and multi-token prediction (MTP), stabilizing at 85% peak for $H \geq 64$ , and SnapMLA consistently outperforms FlashMLA.

Figure 6: Kernel performance across different input configurations.

Quantization Granularity and Hardware Scaling

SnapMLA exploits FP8’s dynamic range and the Hopper architecture's new Tensor Cores, systematically optimizing for per-token granularity and hardware-aligned layouts. This demonstrates a significant advancement in low-bit quantization methodologies for LLMs, leveraging hardware features to maximize resource efficiency and scalability without sacrificing accuracy.

Figure 7: Illustration of various quantization granularities.

Implications and Future Directions

SnapMLA’s architectural innovations enable practical deployment of efficient, long-context LLMs on hardware-constrained environments by simultaneously reducing memory requirements and achieving compute-bound throughput scaling. The implications for distributed inference and large-scale deployment are substantial: SnapMLA unlocks concurrency scaling, batch-size increases, and lower operational costs at production scale. Theoretical insights into activation outlier mitigation and mixed-precision domain alignment extend to other compressive architectures, suggesting broad applicability.

Potential future directions include further reducing quantization error with adaptive outlier detection, generalized mixed-precision pipelines, and extending algorithm-kernel co-design methodologies across emerging GPU architectures (e.g., Blackwell), and exploring context extrapolation limits in increasingly long-context settings with granular quantization policies.

Conclusion

SnapMLA presents a rigorously engineered, hardware-aware FP8 decoding pipeline for MLA LLMs, overcoming numerical, quantization, and architectural bottlenecks. Through empirical substantiation, SnapMLA demonstrates nearly 2× throughput scaling with negligible accuracy loss across demanding benchmarks, validating its utility for efficient long-context LLM inference and paving the way for future hardware-software optimization initiatives in AI systems.

Markdown Report Issue