Grouped Head Attention (GHA)

Updated 18 February 2026

Grouped Head Attention (GHA) is an architectural technique that reduces memory and computation by grouping attention heads to share key and value projections.
By mean-pooling within groups, GHA achieves 2–4× reductions in KV cache size while adaptive methods like QCQA and AsymGQA can boost accuracy by up to 20% in large models.
Recent innovations in GHA integrate data-driven grouping, weighted aggregation, and hardware optimizations, making it applicable across language, vision, and speech tasks.

Grouped Head Attention (GHA) refers to a family of architectural techniques for reducing the computational and memory complexity of attention mechanisms—especially multi-head attention (MHA)—by partitioning attention heads into groups that share or compress key (K) and value (V) projections. Originally introduced as Grouped-Query Attention (GQA), and later generalized across settings and modalities, GHA exploits the empirical redundancy among head-specific attention computations to achieve substantial efficiency gains in large-scale models, particularly LLMs, vision transformers (ViT), and efficient speech encoders. Recent methodological innovations further enhance the trade-off between quality and efficiency, including data- or activation-informed grouping, weighted grouping, and multi-objective search.

1. Formal Definitions and Core Mechanism

Let $X \in \mathbb{R}^{B \times T \times D}$ denote the input to an attention layer, where $B$ is batch size, $T$ is sequence length, and $D$ is hidden dimension. In standard MHA with $H$ heads, each head $i$ computes $Q_i = X W^Q_i$ , $K_i = X W^K_i$ , $V_i = X W^V_i$ (with $W^{Q/K/V}_i \in \mathbb{R}^{D \times d}$ , $d = D/H$ ), yielding per-head attention output: $A_i = \mathrm{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right) V_i$ GHA partitions the $H$ query heads into $P$ (or $G$ ) groups $\{\mathcal{G}_0, ..., \mathcal{G}_{P-1}\}$ . Within each group $\mathcal{G}_j$ , the corresponding key and value heads are mean-pooled: $K_{\mathcal{G}_j} = \frac{1}{|\mathcal{G}_j|} \sum_{i \in \mathcal{G}_j} K_i, \qquad V_{\mathcal{G}_j} = \frac{1}{|\mathcal{G}_j|} \sum_{i \in \mathcal{G}_j} V_i$ The forward pass for head $i \in \mathcal{G}_j$ employs these merged projections: $\tilde{A}_i = \mathrm{softmax}\left(\frac{Q_i K_{\mathcal{G}_j}^T}{\sqrt{d}}\right) V_{\mathcal{G}_j}$ This reduces the number of key/value caches and corresponding parameter storage from $H$ to $P$ per layer. The canonical case of Multi-Query Attention (MQA) sets $P=1$ , i.e., all heads share the same $(K,V)$ , but this incurs major quality loss.

2. Parameter Complexity and Efficiency Analysis

Under GHA/GQA, the per-layer memory required for key/value caching reduces from $2 B T H d$ (MHA) to $2 B T P d$, for a normalized cache size $P/H$ . The computational cost of multi-head dot products, particularly the quadratic cost of attention scoring, is similarly reduced in variants that group not only $K/V$ but also operate on temporally or feature-grouped tokens (as in grouped MHSA for speech encoders).

A comparison table:

Technique	KV Projections	KV Cache Size	Attention Scoring	FLOPs Reduction
MHA	$H$	$2 B T H d$	$2 H d T^2$	None
GQA/GHA	$P < H$	$2 B T P d$	$2 H d T^2$	$\times (P/H)$ if grouping scoring
MQA	$P=1$	$2 B T d$	$2 H d T^2$	Max
Grouped MHSA	see (Burchi et al., 2021)	see above	$O\left(\frac{T^2 D}{g}\right)$	$\times g$

Empirically, setting $P$ to $H/2$ or $H/4$ achieves a 2–4 $\times$ reduction in cache and parameter size, with only modest decrease in model quality (quality varies by modality and downstream task) (Joshi et al., 2024, Khan et al., 2024, Chen et al., 12 Mar 2025).

3. Data-Driven and Adaptive Grouping Strategies

While classical GQA forms groups uniformly and statically (i.e., consecutive or evenly partitioned), several recent approaches employ data- or activation-driven grouping:

Quality- and Capacity-Aware Grouped Query Attention (QCQA): QCQA employs a two-stage NSGA-II multi-objective evolutionary search using a lightweight Weight-Sharing Error (WSE) proxy to find per-layer, per-head groupings that best trade off KV cache size and downstream accuracy. Arbitrary (uneven) grouping (QCQA-AC) outperforms both uniform grouping and equal-cardinality (QCQA-EC), yielding up to 20% absolute gain in accuracy at fixed cache size compared to standard GQA on LLaMA-2 7B (Joshi et al., 2024).
Activation-Informed Grouping (AsymGQA): AsymGQA performs a search on a similarity metric computed from head activations on a held-out set, forming groups of heads with similar functional behavior. Asymmetric (variable-sized) grouping further improves accuracy over uniform grouping, particularly at small group sizes, with up to 7.5% gain on MMLU task against naive grouping for LLaMA-2-7B (Chen et al., 2024).
Key-Distribution-Driven Grouping: Methods like KDGQA and DGQA allocate query heads to groups based on the $L_2$ -norm of the key projections, dynamically adapting group assignments during training based on running statistics. On ViT-L, DGQA realizes up to 8% accuracy improvement on Tiny ImageNet over static GQA (Khan et al., 2024).
Weighted Grouped-Query Attention (WGQA): WGQA introduces learnable per-head weights for aggregating the key/value projections within a group, allowing per-head weighting during finetuning. After folding the learned weights into the projections, no runtime cost is incurred, and up to 0.53% improvement over mean-pooling GQA is observed as model size increases (Chinnakonduru et al., 2024).

4. Empirical Results and Trade-offs

Key empirical findings across modalities and recent works include:

Quality-Efficiency Trade-off: Reducing the number of KV groups yields substantial memory and parameter savings but incurs an accuracy drop that is recoverable through optimal group assignment, fine-tuning, or adaptive grouping. For LLaMA-2 7B, QCQA-AC achieves $44.3\%$ average accuracy at KV=0.5, compared to $24.3\%$ using GQA, and after fine-tuning, improves absolute accuracy by $10.55\%$ at the same cache size (Joshi et al., 2024).
Hardware and Throughput: For speech recognition with Efficient Conformer (group size $g_1=3$ ), grouped MHSA yields the same word error rates as full MHSA but accelerates inference by $27\%$ and training by $35\%$ (Burchi et al., 2021). On LLMs, Opt-GQA custom kernels (paged KV memory, fused operations) provide additional memory and latency benefits for deployment on GPUs or DCUs (Kong et al., 5 May 2025).
Optimal Configuration: Analytical recipes grounded in empirical loss–head-count scaling laws show that, given a target loss and context length, it is often optimal for long-context LLMs to aggressively reduce the number of query and KV heads while slightly increasing model size. This approach reduces inference FLOPs and KV memory cost by $\sim$ 50\% (or more) with no loss in language modeling quality, as demonstrated in cost-optimal GQA for Llama-3 (Chen et al., 12 Mar 2025).
Expressivity Limitations: Overzealous grouping ( $P$ too small) severely degrades model quality, especially on complex tasks or with inadequate finetuning. Adaptive and activation-informed methods mitigate this loss.

5. Methodological Variants and Generalizations

Grouped Head Attention is not limited to mean-pooling or uniform grouping, and has generalized across several axes:

Non-Uniform and Dynamic Grouping: Arbitrary group sizes (Joshi et al., 2024, Chen et al., 2024), data-adaptive allocation (Khan et al., 2024), and per-layer heterogeneity enable more fine-grained balancing of efficiency and expressivity.
Learned Weighted Aggregation: Introduction of per-head or per-dimension weights for key/value aggregation (Chinnakonduru et al., 2024).
Latent Value Decoding and Gating: Grouped-head latent attention mechanisms, e.g., GTA, introduce a nonlinear decoder over compressed latent values and sigmoid gating to restore head diversity while minimizing memory (Sun et al., 15 Jun 2025).
Parameter Compression via Shared Projections: Collaborative multi-head attention (Cordonnier et al., 2020) employs group-shared $W_Q$ , $W_K$ with per-head value projections and mixing vectors, enabling 2–4 $\times$ savings in key/query parameters with negligible accuracy loss.
Auxiliary Supervision and Pruning: Self-supervised group-regularized training with post-hoc voting-based pruning achieves both parameter compression and boosted task performance, outperforming uniform-pruned or "lite" baselines (Ni et al., 2023).

6. Practical Implementation and Deployment Considerations

For deployment in large-scale and long-context settings, best practices derived from the literature include:

Recipe for Application: Extract per-head K, V matrices, conduct grouping search (optionally activation-based), select per-layer groupings according to hardware constraints and quality targets, then optionally fine-tune briefly to recover accuracy (Joshi et al., 2024, Chen et al., 2024).
Hardware Optimization: Paged memory management and custom kernels (with shared memory, warp-level parallelism, fused softmax and bias) lower memory fragmentation and improve utilization. Such techniques, as used in Opt-GQA, are compatible with dynamic batching and large-scale multitenant inference (Kong et al., 5 May 2025).
Scaling Laws: Empirical scaling of loss with number of attention heads follows a "power + constant" law, enabling accurate extrapolation and principled configuration selection for new contexts or model sizes (Chen et al., 12 Mar 2025).

7. Limitations, Open Questions, and Future Research

Expressivity and Uniqueness: Very small numbers of KV groups degrade complex task performance, with nonlinear value decoding and adaptive grouping partially mitigating the trade-off (Sun et al., 15 Jun 2025).
Dynamic and Modal-General Grouping: Future work includes learning dynamic group assignments, integrating grouping with sparse attention or multimodal setups, and hardware–software co-design for even lower kernel launch overhead (Khan et al., 2024, Sun et al., 15 Jun 2025).
Optimality of Assignments: Despite progress, there is no closed-form solution for optimal grouping in all tasks; most approaches rely on search heuristics or proxies (e.g., WSE, activation similarity). Evolving data distributions or deployment domains may require online adaptation.
Metric Limitations: Evaluation remains tethered to imperfect proxies (e.g., BLEU, ROUGE, accuracy), and scaling gains observed may not always extend unaltered to very large or cross-modal systems (Chinnakonduru et al., 2024).

Grouped Head Attention and its variants (GQA, QCQA, AsymGQA, WGQA, GTA, etc.) establish a new paradigm of quality- and efficiency-aware architectural compression for attention in large models, enabling flexible, hardware-friendly, and scalable solutions for modern machine learning workloads (Joshi et al., 2024, Chen et al., 2024, Khan et al., 2024, Chen et al., 12 Mar 2025, Kong et al., 5 May 2025, Sun et al., 15 Jun 2025).