BOS Sink Phenomenon in Transformer Models

Updated 18 January 2026

BOS Sink Phenomenon is a pattern in transformers where the beginning-of-sequence token attracts 50–80% of self-attention across layers.
It arises from geometric and statistical factors that lead to representational collapse, redundancy, and controlled token mixing.
Mitigation and exploitation strategies include regularization, orthogonal token selection, and head pruning to enhance model efficiency and safeguard against adversarial risks.

The BOS sink phenomenon—often called the "attention sink" or simply "sink"—is a pervasive and theoretically profound pattern in modern sequence models employing attention mechanisms, particularly large transformer-based LLMs. In the canonical case, the beginning-of-sequence (BOS) token, which typically carries negligible semantic content, consistently attracts a disproportionate fraction of self-attention from other tokens across layers and heads. This behavior roots in fundamental architectural, statistical, and geometric principles, and has significant consequences for model efficiency, compression, representational dynamics, and even security.

1. Defining the BOS Sink Phenomenon

In transformer models, multi-head self-attention routes each query token's representation across all keys in the input sequence. Empirically, starting from the shallow layers and persisting or accelerating through depth, the BOS token (usually position 0 or 1) acts as a universal "sink": most tokens send a large portion of their attention mass there, often exceeding 50–80% by mid to late layers (Shin et al., 5 Jul 2025, Sok et al., 11 Jan 2026). Formally, for an attention map $A^{(\ell)}$ , the BOS sink behavior can be characterized as:

$A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$

with the layer-wise average

$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$

This "sink score" is robust across architectures, context lengths, model scales, optimization regimes, and even input randomness (Gu et al., 2024). Removal, masking, or disturbance of the BOS token typically degrades performance (Ruscio et al., 4 Aug 2025).

2. Geometric and Representational Foundations

The sink token is not merely an attention-matrix artifact, but is deeply tied to the geometry of deep representation learning. Cosine similarity analysis between layer-normalized hidden states shows that, as depth increases, all tokens' representations $\hat h_i^{(\ell)}$ monotonically approach the BOS direction $\hat h_0^{(\ell)}$ , which itself remains nearly fixed across depth:

$\cos \theta_i^{(\ell)} = \langle \hat h_i^{(\ell)}, \hat h_0^{(\ell)} \rangle$ increases from $\sim$ 0.1–0.3 to $\sim$ 0.6–0.8 through the stack.
$\cos \theta_0^{(\ell_1, \ell_2)}$ remains $>0.9$ across $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 0.

This implies a representational "collapse" or attractor phenomenon, with all token embeddings spiraling toward the static BOS vector. The geometric explanation ties the sink to the establishment of a reference frame: in high-dimensional space, transformers naturally assign a stable axis or anchor—embodied by the BOS token—around which other representations are organized (Ruscio et al., 4 Aug 2025). The softmax operation, by enforcing the probability-simplex constraint, further promotes sparsity and centralization of attention on such anchors.

3. Functional Consequences: Redundancy, Compression, and Over-Mixing Control

The BOS sink effect has several core functional implications:

Redundancy: Heads (and even entire layers) with high BOS sink scores serve as "dumping grounds" for attention mass but contribute little to functional routing or mixing, becoming superfluous in downstream computation (Sok et al., 11 Jan 2026). Pruning heads or layers ranked by $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 1 preserves predictive accuracy substantially better than magnitude- or activation-based criteria, especially in deeper layers.
Over-Mixing Mitigation: Unchecked global attention promotes rank and representational collapse (over-mixing), destroying distinctions among tokens and degrading model expressivity. The BOS sink acts as a throttle, providing a "controlled no-op": heads saturated on BOS pass previous representations unchanged, slowing down mixing and preserving diversity (Barbero et al., 3 Apr 2025).
Information Partitioning: The degree of orthogonality to the BOS direction can be leveraged to identify tokens carrying genuinely novel information versus those collapsing toward redundancy, a principle exploited in orthogonality-based token selection for inference speedup (Shin et al., 5 Jul 2025).

4. Quantification, Induction, and Universality

Quantitative metrics are well specified for sink detection:

Per-head Sink Score: $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 2
Model-level Sink Rate: $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 3 Sink rates exceeding 30–40% at $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 4 are ubiquitous in autoregressive models beyond the smallest scales (routinely $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 570–99%) (Gu et al., 2024).

Emergence: Sink behavior develops rapidly once sufficient data, optimization, and weight decay are applied during pretraining, saturating at large scale. The effect is stable across input domains, model scales, and even under strong variations in token content.

Cause: The underlying mechanism is the combination of softmax normalization and key/query parameterization. If any $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 6· $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 7 consistently dominates the logits, $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 8 concentrates, leading to a global "attention bias register." Removing softmax normalization—e.g., by using sigmoid attention without L1 normalization—completely abolishes the sink effect, even in large models, confirming it is not an inherent necessity but a byproduct of attention competition (Gu et al., 2024).

5. Mitigation, Exploitation, and Model Design

A range of strategies are available, either to suppress, utilize, or redistribute the BOS sink effect.

Mitigation via Regularization:
- Penalizing large $A^{(\ell)}_{i \rightarrow 0} \gg A^{(\ell)}_{i \rightarrow j}, \quad j \geq 1, \quad \forall i, \forall \ell \geq \ell_{\text{sink}}$ 9 during training suppresses emergent sinks (Shang et al., 19 Oct 2025).
- Dropout or randomization of BOS embeddings prevents the model from over-relying on a fixed anchor.
- Modifying positional encoding (e.g., NTK-aware scaling, ALiBi) distributes attention mass over multiple reference points, softening the centralized sink (Ruscio et al., 4 Aug 2025).
Utilization:
- The stability and universality of the BOS vector make it an effective global representational readout or coordinate anchor for downstream tasks.
- Orthogonality to BOS—measured as $S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 0—serves as a criterion for dynamic token selection (OrthoRank), enhancing inference efficiency by routing computation only through informative tokens (Shin et al., 5 Jul 2025).
Exploitation for Compression:
- Reliable pruning of high-sink heads or layers identifies structurally redundant elements without major loss in perplexity or downstream accuracy, outperforming magnitude- or activation-based methods (Sok et al., 11 Jan 2026).
Security and Adversarial Risks:
- Backdoor attacks targeting model unlearning leverage the BOS sink as a transmission gateway, with prefix triggers amplifying and propagating backdoor signals more efficiently due to the sink's "global amplifier" role. Mitigation requires attention sink auditing for adversarial risk (Shang et al., 19 Oct 2025).

6. Extension Beyond LLMs

BOS sink and related attention sink phenomena extend to audio-visual transformers and other multimodal architectures. In audio-visual speech recognition, the BOS and other intermediate tokens (e.g., prompt or modality markers) emerge as sinks, attracting not only attention but also massive activations in hidden state features, amplified by the MLP pathway. These sinks align in hidden space (cos-sim $S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 1 0.9) and amplify correspondence between attention and representation collapse (Anand et al., 26 Oct 2025). Decorrelation objectives penalizing cosine similarity to BOS suppress sink behaviors and yield measurable improvement in downstream metrics such as word error rate.

Metric / Phenomenon	Formal Definition / Quantification	Implication
Sink Score $S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 2	$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 3	Quantifies degree to which head/layer is a BOS sink; enables pruning (Sok et al., 11 Jan 2026)
Cosine alignment to BOS	$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 4	Orthogonality indicates token information value (Shin et al., 5 Jul 2025)
Representational collapse	$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 5	Warning for over-mixing; BOS sink as mitigation (Barbero et al., 3 Apr 2025)
Massive activations	$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 6	Feature amplification in sinks, pointer to representational dominance (Anand et al., 26 Oct 2025)
Sink mitigation loss	$S_{\text{BOS}}^{(\ell)} = \frac{1}{H} \sum_{h=1}^{H} \left( \frac{1}{T} \sum_{t=1}^{T} \alpha^{(\ell, h)}_{t,0} \right).$ 7	Mitigation of BOS sink via targeted regularization (Shang et al., 19 Oct 2025)

References

(Shin et al., 5 Jul 2025) OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference
(Sok et al., 11 Jan 2026) Garbage Attention in LLMs: BOS Sink Heads and Sink-aware Pruning
(Barbero et al., 3 Apr 2025) Why do LLMs attend to the first token?
(Ruscio et al., 4 Aug 2025) What are you sinking? A geometric approach on attention sink
(Gu et al., 2024) When Attention Sink Emerges in LLMs: An Empirical View
(Anand et al., 26 Oct 2025) Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
(Shang et al., 19 Oct 2025) Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning