Value-State Gated Attention (VGA)

Updated 24 December 2025

Value-State Gated Attention (VGA) is a transformer modification that uses dynamic, sigmoid-based gating on value vectors to decouple attention allocation from token information flow.
It introduces a learnable, data-dependent gate that suppresses irrelevant tokens without collapsing value norms, thereby mitigating pathological attention behaviors such as attention sinks and drains.
Empirical results demonstrate that VGA reduces extreme activation norms and improves model stability and quantization performance in large-scale language models.

Value-State Gated Attention (VGA) is an architectural modification to transformer attention mechanisms, designed to address pathological behaviors such as attention sinks and value-state drains in LLMs. VGA introduces a learnable, data-dependent gating function computed from value vectors, creating a dedicated "no-op" path that decouples attention allocation from token information flow without collapsing value norms. This mechanism targets improved stability, quantization robustness, and interpretability while preserving the inductive biases of softmax attention.

1. Motivation: Extreme-Token Phenomena and No-Op Attention

Large transformer models exhibit "extreme-token phenomena," primarily attention sinks and value-state drains, collectively sustained through a mutual reinforcement cycle. An attention sink arises when one or a few tokens (often with no semantic importance) attract nearly all attention mass, driven by the necessity for softmax normalization to allocate the full budget of attention weights even in "no-op" scenarios. Optimizers then force the value vector norms of these sink tokens towards zero—a phenomenon termed value-state drain—to cancel their effect on the layer output. The near-zero value further reinforces the sink's attractiveness, perpetuating degraded accuracy, unstable value activations, loss of interpretability (as high attention ceases to equate to importance), and post-training quantization failures (Bu et al., 10 Oct 2025).

Standard attention offers no clean mechanism for "no-op" behaviors. Suppressing a token's contribution requires collapsing its value, entangling attention allocation and semantic flow. VGA introduces a specific regulator that dynamically gates each V_j based on its emergent value, allowing tokens to be suppressed functionally even when receiving high attention. This architectural decoupling is critical for stability and interpretability, particularly in large-scale and quantized models (Bu et al., 10 Oct 2025, Qiu et al., 10 May 2025).

2. Architectural Formulation: Gating Mechanism and Integration

Let $Q, K, V \in \mathbb{R}^{n \times d}$ denote the query, key, and value projections of the input sequence $X$ . The vanilla attention output for query $i$ is:

$z_i = \sum_{j=1}^n \alpha_{ij} V_j, \quad \alpha_{ij} = \frac{\exp(Q_i \cdot K_j / \sqrt{d})}{\sum_{j'} \exp(Q_i \cdot K_{j'} / \sqrt{d})}$

VGA augments this formulation by introducing a gating function $g_j$ over the value vectors:

$g_j = \sigma(W_g V_j + b_g)$

where $W_g$ is a small trainable projection (typically $d \times d_g$ ), $b_g$ is a bias, and $\sigma(\cdot)$ is an element-wise sigmoid. The attention output becomes:

$\text{VGA}(Q, K, V)_i = \sum_{j=1}^n \alpha_{ij} (g_j \odot V_j)$

The gate $g_j$ (often scalar per token in practice) can "close" (towards 0) to suppress the information flow from token $j$ independently of the attention mass, or "open" (towards 1) to permit transmission.

Several gating variants exist, but empirical and theoretical analysis demonstrates the superiority of gating on value-states (post-SDPA) over input-state or key/query-based gating for mitigating pathological behaviors and decoupling information flow (Bu et al., 10 Oct 2025, Qiu et al., 10 May 2025).

3. Gradient Dynamics and Theoretical Rationale

In standard attention, heavy attention mass on a token ( $\alpha_{is} \rightarrow 1$ ) directs all gradient flow into $V_s$ , driving its norm towards zero. With VGA, product rule differentiation for $g_j V_j$ yields two gradient pathways:

$\frac{\partial L}{\partial V_j} = \sum_{i} \alpha_{ij} \left[ g_j I + (\partial g_j/\partial V_j)^T V_j \right] \frac{\partial L}{\partial z_i}$

The first, the "content path," scales the gradient by $g_j$ ; the second, a self-regulatory path, is mediated through the value gate:

$\partial g_j / \partial V_j = \operatorname{diag}(g_j (1-g_j)) W_g^T$

When the gate closes ( $g_j \rightarrow 0$ ), both terms vanish, so large attention no longer enforces value-norm collapse. This effectively breaks the mutual reinforcement that leads to pathological sinks. By contrast, input-state-gated attention (where $g_j$ is computed from $X_j$ ) does not provide reactive feedback; its gradients grow unbounded if attention mass accumulates, failing to proactively counteract pathologies (Bu et al., 10 Oct 2025).

4. Empirical Performance: Metrics and Benchmarks

Extensive empirical validation demonstrates that VGA consistently mitigates attention sinks and stabilizes activation statistics. In controlled synthetic tasks (e.g., Bigram-Backcopy), vanilla transformers develop dominant attention sinks and value-state drains, while VGA maintains diffuse attention and normalized value norms at no loss in accuracy (Bu et al., 10 Oct 2025).

In large-scale LLMs (BERT, OPT-125m, GPT-2 124M), the following improvements are observed:

Model	Method	PPL	Max I+O Norm	Avg. Kurtosis
BERT	Vanilla	4.55	905	3,251
	IGA	4.53	47	94
	VGA	4.52	37.7	83.8
OPT-125m	Vanilla	15.96	1.04e3	2,182
	IGA	15.65	0.51e3	101.6
	VGA	15.49	0.47e3	16.8
GPT-2	Vanilla	17.03	240.3	1.6e4
	IGA	16.61	15.8	36.0
	VGA	16.53	12.3	35.1

VGA achieves lower perplexity, shrinks extreme activation ranges (key for quantization), and reduces activation kurtosis. In 8-bit post-training quantization (BERT/OPT), vanilla models encounter drastic PPL blow-up (e.g., $+785$ on BERT), whereas VGA limits $\Delta$ PPL to $+0.12$ (BERT) and $+0.94$ (OPT) and achieves the lowest activation outliers and best quantization-friendly statistics (Bu et al., 10 Oct 2025).

In even larger models (15B MoE, 1.7B dense) trained on trillions of tokens, head-specific sigmoid gating after SDPA (the VGA pattern) yields the strongest accuracy, sparsity, and long-context scaling: perplexity improvements ( $\Delta=$ –0.265), +2 MMLU points, and enhanced long-context robustness. Mean head-wise gate values as low as 0.12 indicate effective sparsity induction; maximum per-head attention fractions drop from $\sim$ 46.7% to $\sim$ 4.8% (baseline vs. VGA), directly addressing attention sink formation (Qiu et al., 10 May 2025).

5. Relation to Other Gated-Attention Variants

Input-State Gated Attention (IGA) computes gates from $X_j$ , making the gating function static with respect to the changing value state. Unlike VGA, IGA provides a "predictive sink" rather than a reactive, dynamically synchronized one. Analysis indicates IGA reduces outliers somewhat but does not consistently decouple the reciprocal feedback between attention scores and value-state updates. VGA’s value-dependent gating is unique in directly breaking this loop and efficiently transmitting or suppressing token effects (Bu et al., 10 Oct 2025).

Gated attention has been explored in a broad context, including LSTMs, Highway Networks, and state space models, but gating after SDPA with head/element-wise sigmoid gating (the VGA approach) is empirically optimal for eliminating attention sinks, maximizing non-linearity and sparsity, and improving scaling and stability, as established by comparisons across 30 gating-augmented architectures (Qiu et al., 10 May 2025).

6. Implementation Considerations and Overhead

VGA introduces per-head gating weights $W_g$ and biases $b_g$ , typically at cost $h\cdot (d_{\text{model}}\cdot d_k + d_k)$ parameters, where $h$ is the number of heads. For a 1.7B-parameter model with $d_{\text{model}}=2048$ , $d_k=128$ , $h=16$ , this is about 33 million parameters ( $\sim$ 2%). Wall-time overhead is $<2\%$ in optimized code (Qiu et al., 10 May 2025). Practically, element-wise (dimension-wise) gates perform best, but head-wise scalar gates also provide significant improvements.

Gates are initialized to produce $g\approx0.5$ (e.g., small random $W_g$ , $b_g=0$ ), providing symmetric exploration at initialization. The self-regulatory gradient term $g(1-g)$ peaks at $g=0.5$ , yielding strong gradients during gate transitions but vanishing near gate saturation (0 or 1), conferring stability and preventing oscillations (Bu et al., 10 Oct 2025).

Separation of attention scores from effective information flow enables explicit discriminability between "semantic importance" (high $\alpha$ , high $g$ ) and "intentional suppression" (high $\alpha$ , low $g$ ), enhancing interpretability.

Integration into transformer blocks involves a small modification after SDPA, as shown in provided PyTorch-style pseudocode. The mechanism adds negligible complexity relative to the core Q/K/V projections and can be adopted in dense or Mixture-of-Experts architectures (Qiu et al., 10 May 2025).

7. Impact, Applications, and Outlook

VGA provides a theoretically grounded, lightweight, and robust solution to extreme-token pathologies in transformer attention, with documented benefits for model interpretability, training and quantization stability, and downstream performance. Experimental evidence demonstrates consistent reductions in attention sinks, value-state drains, and activation extremes, with qualitative and quantitative improvements in language modelling and transfer tasks (Bu et al., 10 Oct 2025, Qiu et al., 10 May 2025).

A plausible implication is that VGA will inform future transformer design, particularly for settings requiring stable quantized deployments, robust interpretability, or operation over long contexts. The release of open-source codes and models by researchers such as Qiu et al. is likely to accelerate broader adoption and further benchmarking. While other forms of gating remain actively explored, the value-state-centric, post-SDPA VGA formulation presently sets the standard for targeted mitigation of attention-pathological behaviors.

Markdown Report Issue Upgrade to Chat

References (2)

Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers (2025)

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Value-State Gated Attention (VGA).

Value-State Gated Attention (VGA)

1. Motivation: Extreme-Token Phenomena and No-Op Attention

2. Architectural Formulation: Gating Mechanism and Integration

3. Gradient Dynamics and Theoretical Rationale

4. Empirical Performance: Metrics and Benchmarks

5. Relation to Other Gated-Attention Variants

6. Implementation Considerations and Overhead

7. Impact, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Value-State Gated Attention (VGA)

1. Motivation: Extreme-Token Phenomena and No-Op Attention

2. Architectural Formulation: Gating Mechanism and Integration

3. Gradient Dynamics and Theoretical Rationale

4. Empirical Performance: Metrics and Benchmarks

5. Relation to Other Gated-Attention Variants

6. Implementation Considerations and Overhead

7. Impact, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research