Sink-Attention Gating Modules

Updated 1 February 2026

Sink-attention gating modules are architectural interventions that redistribute attention weights to prevent pathological sinks in sequence models.
They utilize techniques like softmax relaxation, data-dependent gating, and structural masking to achieve sparser and more stable attention distributions.
Empirical results reveal significant improvements in model performance, including lower perplexity, enhanced quantization robustness, and better long-context handling.

Sink-attention gating modules are architectural and algorithmic interventions within sequence modeling frameworks—especially Transformers and Structured State Space Models (SSMs)—designed to identify, control, or eliminate the phenomenon of “attention sinks.” Attention sinks occur when the softmax-normalized attention mechanism, due to its sum-to-one constraint, allocates a disproportionate amount of attention to specific tokens (typically initial tokens in language or [CLS] tokens in vision), often irrespective of their semantic relevance. This can degrade model expressivity, lead to unstable activations, impede long-context generalization, and hinder interpretability. Sink-attention gating modules represent a diverse set of approaches—including gating, normalization relaxation, and structural masking—that regularize or redistribute excess attention mass, yielding improved model quality, sparsity, quantization robustness, and stability across modalities and architectures.

1. Phenomenology and Origins of Attention Sinks

Attention sinks emerge from the softmax normalization’s requirement that the sum of attention weights for each query equals one. When no semantically strong key exists, softmax “dumps” the probability mass on certain positions, most often the first token(s) in LLMs or the [CLS] token in vision transformers. In LLMs, this is systemic: the initial tokens lie in the causal receptive field of all subsequent queries, so the model learns to route background or “waste” attention mass onto them during training. Occlusion and ablation studies confirm that it is absolute position, not content, that determines sink status—for example, perplexity is largely recovered by inserting meaningless tokens at the initial positions after removal (Xiao et al., 2023).

In vision transformers, the [CLS] token acts as an attention sink, causing attention mass to centralize disproportionately, as shown by heatmaps and quantitative diagnostics (e.g., attention allocated to [CLS] exceeds patch average by more than an order of magnitude at all layers) (Feng et al., 9 Apr 2025). Empirical studies reveal that attention sinks also occur at later positions or at formatting/punctuation tokens, not just the initial position, in LLMs (Yu et al., 2024).

2. Architectural Realizations and Mathematical Formulation

Several classes of sink-attention gating modules have been developed to mitigate or eliminate attention sinks. The following table summarizes principal methods and their mathematical character:

Module Type	Key Mechanism	Normalization/Activation
Softmax Relaxation (Elastic, ReLU)	Elastic-Softmax, Softpick	Allow weights <1 sum, ReLU clamp
Post-SDPA Gating	Sigmoid, Value State	σ(W_g X) or σ(W_g V), elementwise
Structural Masking	Patch/[CLS] separation	Mask [CLS] from patch self-attention
Cache Design	Retain sinks in KV	Explicit sink KV cache + rolling window

Softmax relaxation modules, such as Elastic-Softmax (“Lazy Attention”) (Fu et al., 1 Jan 2026) and Softpick (Zuhri et al., 29 Apr 2025), modify the standard attention normalization. For Elastic-Softmax, given softmax-normalized weights $w_{ij}$ for a query $i$ and head $h$ , the output $\alpha_{ij}^{(h)}$ is computed as:

$\alpha_{ij}^{(h)} = \operatorname{ReLU}\left(w_{ij}^{(h)} + \frac{\tau^{(h)}}{i}\right),$

where $\tau^{(h)}$ is a learnable head-specific offset. This mechanism enables the model to allocate zero mass to all tokens if none are relevant, thus explicitly avoiding forced dumps onto sink positions.

Softpick (Zuhri et al., 29 Apr 2025) replaces softmax with a rectified form:

$\mathrm{Softpick}(x)_i = \frac{\operatorname{ReLU}(e^{x_i} - 1)}{\sum_j |e^{x_j} - 1|}.$

This outputs strictly sparse attention weights and no longer enforces a sum-to-one constraint.

Gating-based modules such as Value-State Gated Attention (VGA) (Bu et al., 10 Oct 2025), sigmoid post-SDPA gating (Qiu et al., 10 May 2025), and “catch, tag, release” gating (Zhang et al., 2 Feb 2025) compute a data-dependent gate per value vector (either from the value itself or the query/key sequence). For VGA:

$g_j = \sigma(W_g^T V_j + b_g), \quad z_i = \sum_j \alpha_{ij} (g_j V_j),$

where $g_j$ modulates the contribution of each value vector $V_j$ , severing the pathological reinforcement between attention sinks and value state collapse.

Structural masking, as in EDIT (Feng et al., 9 Apr 2025), physically separates patch token self-attention from [CLS] token communication, moving the gating role to a layer-aligned encoder-decoder cross-attention pathway. This prevents patches from acting as accidental attention sinks.

In efficient streaming LLMs (e.g., StreamingLLM (Xiao et al., 2023)), sink-attention gating is achieved by always retaining a fixed number $i$ 0 of initial KV pairs alongside the sliding window of length $i$ 1, and optionally by introducing a dedicated learned sink token or a mathematically equivalent “softmax-off-by-one” denominator augmentation.

3. Empirical Effects, Performance, and Model Properties

Sink-attention gating modules have been demonstrated to yield multiple desirable effects across models and tasks:

Elimination or drastic reduction of sink rates: Baselines report first token sink rates of 46–63%, which are reduced to ∼4% with sigmoid gating (Qiu et al., 10 May 2025) and to effectively 0% with Softpick (Zuhri et al., 29 Apr 2025) and Elastic-Softmax (Fu et al., 1 Jan 2026).
True attention sparsity: Elastic-Softmax and Softpick consistently produce sparse attention matrices, with up to ~47–60% of attention weights being exactly zero (Zuhri et al., 29 Apr 2025, Fu et al., 1 Jan 2026).
Long-context generalization: Attaching sink-aware gating, StreamingLLM matches full recomputation baselines in perplexity up to 4M tokens, where standard window attention collapses (Xiao et al., 2023); Lazy Attention (Elastic-Softmax) and sigmoid gating modules likewise maintain strong long-context extrapolation (Qiu et al., 10 May 2025, Fu et al., 1 Jan 2026).
Stability and scaling: Inducing nonlinearity and gating in attention projections increases training stability, enables the use of larger learning rates, and allows deeper networks without divergence (Qiu et al., 10 May 2025).
Quantization and activation outlier control: Rectified/elastic gating or value-gated modules suppress activation outliers and kurtosis (e.g., softmax kurtosis ≈33510 vs. Softpick ≈340), improving both quantization fidelity and post-training INT8/2–4 bit performance (Bu et al., 10 Oct 2025, Zuhri et al., 29 Apr 2025).

Key quantitative results include:

StreamingLLM: window attention perplexity jumps to ~5,000 beyond window limit, StreamingLLM (with S=4) maintains ∼5–6 (Xiao et al., 2023).
VGA: post-INT8 quantization, vanilla BERT jumps to PPL 789, VGA remains at 4.64 (Bu et al., 10 Oct 2025).
EDIT: ImageNet-1K Top-1 accuracy increases by 0.1–0.5%, and segmentation mIoU increases by up to 2.0 over DeiT3 (Feng et al., 9 Apr 2025).

4. Algorithmic Design, Integration, and Practical Variants

Sink-attention gating modules can be integrated at various locations in the network pipeline, with design trade-offs including parameter cost, computational overhead, and composability. Salient design patterns include:

Post-attention gating: A layer- or head-specific sigmoid gate applied after SDPA and before output projection; elementwise (per-token and per-feature) gating yields higher expressivity but increased parameters (Qiu et al., 10 May 2025).
Softmax replacement: Elastic-Softmax and Softpick require only small kernel modifications—one offset addition, ReLU, and altered normalization in fused GPU kernels; negligible parameter or runtime overhead (Zuhri et al., 29 Apr 2025, Fu et al., 1 Jan 2026).
Structural modules: In GFSSM (Meng et al., 2024), attention sink gating is realized via parallel "sink state" skip connections, aiding SSMs in mitigating vanishing/exploding state recurrences and carrying block-level anchor information. The cost is $i$ 2 for small $i$ 3 (block size).
Stream management: In StreamingLLM, a fixed number of initial tokens are always kept in cache, enabling compatibility of pretrained sliding-window LLMs with infinite-length streams (Xiao et al., 2023).

Hyperparameters for gating modules (e.g., gate rank $i$ 4 in low-rank sink gates (Zhang et al., 2 Feb 2025), offset $i$ 5 in Elastic-Softmax) have been empirically tuned (e.g., $i$ 6– $i$ 7 gives negligible performance drop, $i$ 8 initialized to $i$ 9).

Training-free approaches, such as ACT (Yu et al., 2024), detect sinks in softmax attention matrices at inference, reduce their strength, and redistribute mass, yielding accuracy improvements of up to +7.30% (Llama-30B on Hellaswag), though architectural gating modules generalize this principle into fully trainable components.

5. Theoretical Analysis and Connection to Outlier Features

Sink-attention gating modules are closely linked to the emergence and regulation of outlier features—dimensions in the value vector space that carry exceptionally large magnitudes due to sink-head copying. In the formal “catch, tag, release” mechanism (Zhang et al., 2 Feb 2025), an attention sink enables subdivision of the sequence (catch), injects an outlier tag (tag), and then triggers selective release in downstream layers. Explicit gating blocks can learn to detect and control these cycles, and low-rank parameterization is sufficient for effective gating.

Gradient-level analysis shows that gating value vectors by their own content, as in VGA (Bu et al., 10 Oct 2025), can break the pathological reinforcement loop between attention sinks and value-state collapse more effectively than gating on input X or key projections. In the VGA framework, gradients with respect to $h$ 0 vanish when both attention mass and gate output go to zero. Thus, the model can avoid value-state collapse and associated no-op heads.

Lazy Attention (Fu et al., 1 Jan 2026) and Softpick (Zuhri et al., 29 Apr 2025) can be interpreted as enforcing an explicit “kill switch” that the model can invoke to suppress attention on all tokens, thereby mitigating underload-induced sinks at the normalization level.

6. Application Domains and Extensions

Sink-attention gating modules are instantiated in various modalities and architectures:

Autoregressive language modeling: Windowed streaming, retrieval-augmented LLMs, quantization-friendly LMs, and ultra-long-context models (Xiao et al., 2023, Fu et al., 1 Jan 2026, Zuhri et al., 29 Apr 2025).
Vision Transformers: Encoder-decoder separation for [CLS] sink suppression, interpretable feature pooling, semantic segmentation, and transfer learning (Feng et al., 9 Apr 2025).
Structured State Space Models: Block prompt sink-gating in Grouped FIR SSMs (GFSSM) with linear complexity (Meng et al., 2024).
Compression and LoRA-style fine-tuning: Low-rank sink gates yield strong parameter efficiency in restoring or enhancing few-shot performance (Zhang et al., 2 Feb 2025).
Inference-only sink suppression: ACT and its learnable module variants can be used to recalibrate and suppress harmful sinks without retraining (Yu et al., 2024).

Further, the inclusion of explicit sink tokens or register vectors during pretraining (e.g., learnable <SINK> tokens) induces the model to centralize residual attention mass on a dedicated vector, reducing the number of cached tokens required at inference and enabling true infinite-length extrapolation (Xiao et al., 2023).

In summary, sink-attention gating modules constitute a principled solution to pathological attention allocation in global and streaming sequence models. They couple theoretical rigor, algorithmic simplicity, and broad empirical impact, serving as a crucial mechanism for sparsity, stability, and generalization in state-of-the-art Transformer and SSM architectures.