Grouped-Query Attention with Shared K/V Projections
- The paper introduces Grouped-Query Attention with shared K/V projections, significantly reducing memory and compute overhead compared to traditional multi-head attention.
- It details rigorous group formation and adaptive routing strategies, including static, dynamic, and low-rank variations, to optimize performance.
- Empirical results demonstrate throughput improvements up to 538.7× with minimal accuracy loss for efficient transformer inference and training.
Grouped-Query Attention (GQA) with shared key/value (K/V) projections—alternatively termed Grouped-Head Attention—constitutes a class of attention mechanisms for Transformers in which multiple query heads are bundled into groups, each group sharing key and value projections. This design substantially reduces the memory and compute costs associated with the conventional multi-head attention (MHA) paradigm, primarily by lowering the number of distinct K/V projections and corresponding cache entries. Modern variants further optimize group formation via algorithmic or data-driven strategies, dynamically route tokens to per-group experts, introduce low-rank or latent-space compression, and exploit hardware-aware architecture. GQA with shared K/V projections underpins a range of advanced architectures for efficient autoregressive LLM inference and training.
1. Mathematical Formulation and Variants
Standard GQA partitions attention heads into groups, indexing each group as , each of size . For input :
- Queries: for .
- Keys/Values (shared): , , each in .
- Every head in group computes
Grouped-Tied Attention (GTA) further ties the K/V projections within each group () and applies rotary positional encoding (RoPE) selectively, nearly halving memory relative to GQA, yet retaining expressiveness by only rotating a low-dimensional subspace (Zadouri et al., 27 May 2025).
Dynamic GQA variants, such as Mixture-of-Experts Shared Group Attention (mixSGA) and MoSKA, incorporate data-dependent token routing and allow tokens to select between multiple group sizes or shared K/V contexts based on learned or computed importance (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).
Latent/Low-Rank GQA—notably in LRKV (O'Neill et al., 16 Jan 2026) and the CCA/CCGQA family (Figliolia et al., 6 Oct 2025)—compress the shared K/V representations into a lower-dimensional latent or low-rank space, with per-head or per-token decoding for additional memory and FLOP savings.
2. Implementation Algorithms and Grouping Strategies
Static Grouping
The canonical GQA approach partitions heads evenly by index, with each group mean- or SVD-averaging the K/V parameters (Yu et al., 2024). This can be formalized as:
A rigorous SVD-based approach optimizes K/V group projections to minimize cache reconstruction error on calibration data, and can accommodate RoPE by compressing the actual cached keys after positional modulation (Yu et al., 2024).
Activation-Informed and Adaptive Grouping
- AsymGQA leverages calibration activations, measuring pairwise head similarity via activation statistics, and clusters heads accordingly—potentially with asymmetric group sizes (Chen et al., 2024).
- QCQA applies multi-objective (accuracy–memory) evolutionary search, using a “weight-sharing error” metric as a proxy for quality, supporting both equal- and arbitrary-cardinality groups (Joshi et al., 2024).
- Key-driven and Dynamic GQA (KDGQA, DGQA) dynamically assign queries to groups based on key norms, updated per batch or via EMA, for vision models (Khan et al., 2024).
- Mixture-of-expert GQA (mixSGA) learns a per-token router over several pre-set group sizes, relying on an auxiliary loss for training–inference consistency and global weight sharing to avoid excessive parameter cost (Song et al., 16 Jun 2025).
Chunked and Sparse Shared Attention
MoSKA applies chunk-based partitioning of extremely long shared context, routing queries to relevant chunks via learned or heuristic chunk embeddings, then batches all selected requests for high-arithmetic-intensity batched GEMMs (Rhee et al., 8 Nov 2025).
3. Computational Complexity and Hardware-Efficiency
Cost Reductions
| Variant | Distinct K/V Heads | KV Cache Size | Projection Parameters | KV Load per Token |
|---|---|---|---|---|
| MHA | ||||
| GQA (group ) | ||||
| GTA | (tied K/V) | |||
| CCGQA () |
- Arithmetic intensity improves G-fold over MHA in GQA, and $2G$-fold in GTA (Zadouri et al., 27 May 2025, Rhee et al., 8 Nov 2025).
- Prefill FLOPs remain unchanged for static GQA, can be lowered by latent compression (CCA/CCGQA): scaling (Figliolia et al., 6 Oct 2025).
- Cache memory: GQA achieves linear reduction in K/V cache with respect to ; advanced compression/low-rank approaches yield sublinear or multiplicative further savings (Figliolia et al., 6 Oct 2025, O'Neill et al., 16 Jan 2026).
- Batchwise serving with shared attention (MoSKA) transforms per-request, memory-bound GEMVs into compute-bound batched GEMMs, enabling up to throughput increase for high-sharing workloads (Rhee et al., 8 Nov 2025).
Parallelism and Practical Optimizations
- GQA and GTA scale identically across tensor-parallel devices; group size limits duplication-free device count.
- Advanced paging, chunking, and custom fused kernels (e.g., in Opt-GPTQ) reduce fragmentation and maximize hardware utilization (Kong et al., 5 May 2025).
- MoSKA disaggregates inference across unique-attention and shared-attention nodes, pushing shared K/V serving into high-FLOP, batched compute pools (Rhee et al., 8 Nov 2025).
4. Empirical Results and Comparative Analyses
Multiple evaluations confirm that GQA with shared K/V achieves substantial memory and compute gains, typically with minor (or no) perplexity/accuracy penalties if group formation and head compression are carefully managed:
- LoRA-finetuned GQA with half/quarter heads on Llama2-7/13B: drops in zero-shot PPL and accuracy are (up to for aggressive 75% removal); gain is – throughput (Yu et al., 2024).
- Activation-informed (AsymGQA) and evolutionary (QCQA) groupings yield up to (MMLU, AsymGQA) and (QCQA, no fine-tune) average accuracy improvements over naïve even grouping at constant cache/memory (Chen et al., 2024, Joshi et al., 2024).
- MoSKA achieves end-to-end throughput speedup of up to (full disaggregated system) over FlashAttention for concurrent requests with highly shared $16$M-token context (Rhee et al., 8 Nov 2025).
- CCA/CCGQA: KV-cache compression with no drop in MoE model performance; prefill latency reduced by , backward by compared to MHA (Figliolia et al., 6 Oct 2025).
- mixSGA: at a KV budget, ROUGE-L over GQA, and $2$–$4$ lower perplexity on standard benchmarks (Song et al., 16 Jun 2025).
- GTA: matches or exceeds GQA quality with roughly half the KV cache, zero-shot accuracy on FineWeb-Edu with $1,152$ vs. $2,048$ bytes per token per layer (Zadouri et al., 27 May 2025).
5. Design Considerations and Limitations
Trade-Offs and Tuning
- Larger group sizes (fewer KV heads): higher memory/computation savings, but increasing risk of quality degradation due to underdiversified heads (Chen et al., 2024).
- Activation- or quality-aware grouping (AsymGQA, QCQA): preserves performance at high compression rates by clustering heads with similar statistics (Chen et al., 2024, Joshi et al., 2024).
- Dynamic/statistical routing is crucial when token or group importance skews heavily across context or tasks (mixSGA, MoSKA) (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).
- Latency/throughput is a function of batching/group-size and chunk size in shared-K/V architectures; is recommended to maximize speedup with minimal extra latency (Rhee et al., 8 Nov 2025).
Limitations
- Extreme grouping () risks notable loss in fine-grained attention diversity and potential collapse of attention patterns, particularly in model layers highly sensitive to context (Chen et al., 2024, O'Neill et al., 16 Jan 2026).
- MoSKA and similar approaches rely on high instance or request sharing—benefits diminish as shared context fraction decreases (Rhee et al., 8 Nov 2025).
- For structured tasks or non-text modalities, static or even grouping may underperform without careful adaptation (e.g. vision transformers show gains from DGQA/PGQA (Khan et al., 2024)).
6. Extensions, Hybrids, and Research Directions
- MoE-driven and token-wise dynamic GQA (mixSGA, MoSKA) continue to evolve, mixing adaptive grouping, sparse expert selection, and weight sharing to approach the memory–quality Pareto front (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).
- Low-rank and latent GQA (LRKV, CCGQA) provide a tunable spectrum from full sharing to per-head expressivity, leveraging latent or SVD-based representations for further compression with minimal loss (O'Neill et al., 16 Jan 2026, Figliolia et al., 6 Oct 2025).
- Chunked/shared-KV batching frameworks (MoSKA) exemplify infrastructural co-design, pairing algorithmic attention optimization with physical/node-level disaggregation (Rhee et al., 8 Nov 2025).
- Quality-guided head grouping (QCQA) demonstrates the use of multi-objective search and cheap surrogate objectives (e.g., weight-sharing error) as scalable alternatives to full retraining for grouping optimization (Joshi et al., 2024).
7. Selected Comparative Table
| Method | Parameter Overhead | Cache Compression | Sample Accuracy Gain/Drop | Notes |
|---|---|---|---|---|
| Static GQA | $1/G$ | loss @ 50% heads (LLaMA2-7B) | Strong baseline (Yu et al., 2024) | |
| AsymGQA/QCQA | +search, | (same) | (MMLU), (QCQA vs GQA, no FT) | Grouping informed by activations |
| Opt-GQA w/ GPTQ | +paging + quant | $1/G$ | tokens/sec, mem cut | Hardware/page/fused kernel optimized |
| MoSKA (Shared KV) | infra + router | throughput at | Requires high shared-context fraction | |
| mixSGA | +router | Adaptive per-token | ROUGE-L, $2-4$ lower PPL | Token-wise expert/group routing |
| CCGQA/CCA | +down/up-proj, conv | Lossless up to compression (MoE) | Latent and grouped hybrid | |
| LRKV | +low-rank | (tunable) | fewer tokens to reach target BPB | Interpolates MHA ↔ GQA |
All claims supported by: (Rhee et al., 8 Nov 2025, Song et al., 16 Jun 2025, Joshi et al., 2024, O'Neill et al., 16 Jan 2026, Yu et al., 2024, Chen et al., 2024, Zadouri et al., 27 May 2025, Kong et al., 5 May 2025, Sun et al., 15 Jun 2025, Figliolia et al., 6 Oct 2025, Khan et al., 2024).