Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grouped-Query Attention with Shared K/V Projections

Updated 8 February 2026
  • The paper introduces Grouped-Query Attention with shared K/V projections, significantly reducing memory and compute overhead compared to traditional multi-head attention.
  • It details rigorous group formation and adaptive routing strategies, including static, dynamic, and low-rank variations, to optimize performance.
  • Empirical results demonstrate throughput improvements up to 538.7× with minimal accuracy loss for efficient transformer inference and training.

Grouped-Query Attention (GQA) with shared key/value (K/V) projections—alternatively termed Grouped-Head Attention—constitutes a class of attention mechanisms for Transformers in which multiple query heads are bundled into groups, each group sharing key and value projections. This design substantially reduces the memory and compute costs associated with the conventional multi-head attention (MHA) paradigm, primarily by lowering the number of distinct K/V projections and corresponding cache entries. Modern variants further optimize group formation via algorithmic or data-driven strategies, dynamically route tokens to per-group experts, introduce low-rank or latent-space compression, and exploit hardware-aware architecture. GQA with shared K/V projections underpins a range of advanced architectures for efficient autoregressive LLM inference and training.

1. Mathematical Formulation and Variants

Standard GQA partitions HH attention heads into G<HG < H groups, indexing each group as g=1,,Gg = 1, \dots, G, each of size s=H/Gs = H/G. For input XRN×DX \in \mathbb{R}^{N \times D}:

  • Queries: Qi=XWiQRN×dhQ_{i} = X W^{Q}_i \in \mathbb{R}^{N \times d_h} for i=1Hi = 1 \dots H.
  • Keys/Values (shared): K(g)=XWgKK^{(g)} = X W^K_g, V(g)=XWgVV^{(g)} = X W^V_g, each in RN×dh\mathbb{R}^{N \times d_h}.
  • Every head ii in group g(i)g(i) computes

Attentioni=softmax(Qi(K(g(i)))dh)V(g(i))\mathrm{Attention}_i = \mathrm{softmax}\biggl(\frac{Q_{i} (K^{(g(i))})^{\top}}{\sqrt{d_h}}\biggr)V^{(g(i))}

Grouped-Tied Attention (GTA) further ties the K/V projections within each group (K=VK = V) and applies rotary positional encoding (RoPE) selectively, nearly halving memory relative to GQA, yet retaining expressiveness by only rotating a low-dimensional subspace (Zadouri et al., 27 May 2025).

Dynamic GQA variants, such as Mixture-of-Experts Shared Group Attention (mixSGA) and MoSKA, incorporate data-dependent token routing and allow tokens to select between multiple group sizes or shared K/V contexts based on learned or computed importance (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).

Latent/Low-Rank GQA—notably in LRKV (O'Neill et al., 16 Jan 2026) and the CCA/CCGQA family (Figliolia et al., 6 Oct 2025)—compress the shared K/V representations into a lower-dimensional latent or low-rank space, with per-head or per-token decoding for additional memory and FLOP savings.

2. Implementation Algorithms and Grouping Strategies

Static Grouping

The canonical GQA approach partitions heads evenly by index, with each group mean- or SVD-averaging the K/V parameters (Yu et al., 2024). This can be formalized as:

WgK=AvgiGg(WiK),WgV=AvgiGg(WiV)W^K_g = \text{Avg}_{i \in G_g}(W^K_i), \quad W^V_g = \text{Avg}_{i \in G_g}(W^V_i)

A rigorous SVD-based approach optimizes K/V group projections to minimize cache reconstruction error on calibration data, and can accommodate RoPE by compressing the actual cached keys after positional modulation (Yu et al., 2024).

Activation-Informed and Adaptive Grouping

  • AsymGQA leverages calibration activations, measuring pairwise head similarity via activation statistics, and clusters heads accordingly—potentially with asymmetric group sizes (Chen et al., 2024).
  • QCQA applies multi-objective (accuracy–memory) evolutionary search, using a “weight-sharing error” metric as a proxy for quality, supporting both equal- and arbitrary-cardinality groups (Joshi et al., 2024).
  • Key-driven and Dynamic GQA (KDGQA, DGQA) dynamically assign queries to groups based on key norms, updated per batch or via EMA, for vision models (Khan et al., 2024).
  • Mixture-of-expert GQA (mixSGA) learns a per-token router over several pre-set group sizes, relying on an auxiliary loss for training–inference consistency and global weight sharing to avoid excessive parameter cost (Song et al., 16 Jun 2025).

Chunked and Sparse Shared Attention

MoSKA applies chunk-based partitioning of extremely long shared context, routing queries to relevant chunks via learned or heuristic chunk embeddings, then batches all selected requests for high-arithmetic-intensity batched GEMMs (Rhee et al., 8 Nov 2025).

3. Computational Complexity and Hardware-Efficiency

Cost Reductions

Variant Distinct K/V Heads KV Cache Size Projection Parameters KV Load per Token
MHA HH 2NHdh2N H d_h 3Hddh3Hd d_h 2Hdh2H d_h
GQA (group GG) GG 2NGdh2N G d_h Hddh+2GddhH d d_h + 2G d d_h 2Gdh2G d_h
GTA GG (tied K/V) (G+0.5)dh(G+0.5) d_h Hddh+GddhH d d_h + G d d_h (G+0.5)dh(G+0.5) d_h
CCGQA (C2C_2) G=H/C2G’=H/C_2 2NGdh2N G’ d_h O(d2/C2)O(d^2/C_2) 2Gdh2G’ d_h

Parallelism and Practical Optimizations

  • GQA and GTA scale identically across tensor-parallel devices; group size GG limits duplication-free device count.
  • Advanced paging, chunking, and custom fused kernels (e.g., in Opt-GPTQ) reduce fragmentation and maximize hardware utilization (Kong et al., 5 May 2025).
  • MoSKA disaggregates inference across unique-attention and shared-attention nodes, pushing shared K/V serving into high-FLOP, batched compute pools (Rhee et al., 8 Nov 2025).

4. Empirical Results and Comparative Analyses

Multiple evaluations confirm that GQA with shared K/V achieves substantial memory and compute gains, typically with minor (or no) perplexity/accuracy penalties if group formation and head compression are carefully managed:

  • LoRA-finetuned GQA with half/quarter heads on Llama2-7/13B: drops in zero-shot PPL and accuracy are <1%<1\% (up to 2%2\% for aggressive 75% removal); gain is 50%\approx 50\%170%170\% throughput (Yu et al., 2024).
  • Activation-informed (AsymGQA) and evolutionary (QCQA) groupings yield up to 7.5%7.5\% (MMLU, AsymGQA) and 20%20\% (QCQA, no fine-tune) average accuracy improvements over naïve even grouping at constant cache/memory (Chen et al., 2024, Joshi et al., 2024).
  • MoSKA achieves end-to-end throughput speedup of up to 538.7×538.7\times (full disaggregated system) over FlashAttention for M=256M=256 concurrent requests with highly shared $16$M-token context (Rhee et al., 8 Nov 2025).
  • CCA/CCGQA: 8×8\times KV-cache compression with no drop in MoE model performance; prefill latency reduced by 1.7×1.7\times, backward by 1.3×1.3\times compared to MHA (Figliolia et al., 6 Oct 2025).
  • mixSGA: at a 50%50\% KV budget, +2.3+2.3 ROUGE-L over GQA, and $2$–$4$ lower perplexity on standard benchmarks (Song et al., 16 Jun 2025).
  • GTA: matches or exceeds GQA quality with roughly half the KV cache, 60.2%60.2\% zero-shot accuracy on FineWeb-Edu with $1,152$ vs. $2,048$ bytes per token per layer (Zadouri et al., 27 May 2025).

5. Design Considerations and Limitations

Trade-Offs and Tuning

  • Larger group sizes GG (fewer KV heads): higher memory/computation savings, but increasing risk of quality degradation due to underdiversified heads (Chen et al., 2024).
  • Activation- or quality-aware grouping (AsymGQA, QCQA): preserves performance at high compression rates by clustering heads with similar statistics (Chen et al., 2024, Joshi et al., 2024).
  • Dynamic/statistical routing is crucial when token or group importance skews heavily across context or tasks (mixSGA, MoSKA) (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).
  • Latency/throughput is a function of batching/group-size and chunk size in shared-K/V architectures; M64256M \approx 64–256 is recommended to maximize speedup with minimal extra latency (Rhee et al., 8 Nov 2025).

Limitations

  • Extreme grouping (GHG\ll H) risks notable loss in fine-grained attention diversity and potential collapse of attention patterns, particularly in model layers highly sensitive to context (Chen et al., 2024, O'Neill et al., 16 Jan 2026).
  • MoSKA and similar approaches rely on high instance or request sharing—benefits diminish as shared context fraction decreases (Rhee et al., 8 Nov 2025).
  • For structured tasks or non-text modalities, static or even grouping may underperform without careful adaptation (e.g. vision transformers show gains from DGQA/PGQA (Khan et al., 2024)).

6. Extensions, Hybrids, and Research Directions

  • MoE-driven and token-wise dynamic GQA (mixSGA, MoSKA) continue to evolve, mixing adaptive grouping, sparse expert selection, and weight sharing to approach the memory–quality Pareto front (Song et al., 16 Jun 2025, Rhee et al., 8 Nov 2025).
  • Low-rank and latent GQA (LRKV, CCGQA) provide a tunable spectrum from full sharing to per-head expressivity, leveraging latent or SVD-based representations for further compression with minimal loss (O'Neill et al., 16 Jan 2026, Figliolia et al., 6 Oct 2025).
  • Chunked/shared-KV batching frameworks (MoSKA) exemplify infrastructural co-design, pairing algorithmic attention optimization with physical/node-level disaggregation (Rhee et al., 8 Nov 2025).
  • Quality-guided head grouping (QCQA) demonstrates the use of multi-objective search and cheap surrogate objectives (e.g., weight-sharing error) as scalable alternatives to full retraining for grouping optimization (Joshi et al., 2024).

7. Selected Comparative Table

Method Parameter Overhead Cache Compression Sample Accuracy Gain/Drop Notes
Static GQA O(2Gd2/H)O(2Gd^2/H) $1/G$ <12%<1-2\% loss @ 50% heads (LLaMA2-7B) Strong baseline (Yu et al., 2024)
AsymGQA/QCQA +search, O(1)O(1) >1/G>1/G (same) +7.5%+7.5\% (MMLU), +20%+20\% (QCQA vs GQA, no FT) Grouping informed by activations
Opt-GQA w/ GPTQ +paging + quant $1/G$ +2.6%+2.6\% tokens/sec, 75%75\% mem cut Hardware/page/fused kernel optimized
MoSKA (Shared KV) infra + router O(1)O(1) N/AN/A 538.7×538.7\times throughput at M=256M=256 Requires high shared-context fraction
mixSGA +router O(DE)O(DE) Adaptive per-token +2.3+2.3 ROUGE-L, $2-4$ lower PPL Token-wise expert/group routing
CCGQA/CCA +down/up-proj, conv 1/C21/C_2 Lossless up to 8×8\times compression (MoE) Latent and grouped hybrid
LRKV +low-rank O(Hdr)O(Hdr) 1ϕ(r/H)1-\phi(r/H) (tunable) 1830%18-30\% fewer tokens to reach target BPB Interpolates MHA ↔ GQA

All claims supported by: (Rhee et al., 8 Nov 2025, Song et al., 16 Jun 2025, Joshi et al., 2024, O'Neill et al., 16 Jan 2026, Yu et al., 2024, Chen et al., 2024, Zadouri et al., 27 May 2025, Kong et al., 5 May 2025, Sun et al., 15 Jun 2025, Figliolia et al., 6 Oct 2025, Khan et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped-Query Attention with Shared K/V Projections.