Rank Collapse in Self-Attention Models

Updated 31 January 2026

The paper demonstrates that rank collapse occurs when token-wise representations converge to a rank-one subspace, drastically reducing model expressivity.
Spectral analysis reveals rapid decay of heterogeneity measures and singular values across layers, highlighting the impact of architectural choices.
Architectural remedies such as residual connections, scaling, and spectral regularization are essential to mitigate collapse and maintain robust gradient flow.

Rank collapse in self-attention denotes the phenomenon in which the token-wise representations produced by successive attention layers become increasingly uniform, ultimately converging toward a rank-one subspace in feature space. In this regime, the ability of the model to distinguish between different tokens diminishes, leading to reduced expressivity, impaired gradient propagation, and severe bottlenecks for training deep or wide transformer stacks. This effect is analytically characterized by the exponential or doubly-exponential decay of "heterogeneity" measures—such as the Frobenius residual to the token mean, average inter-token angle, or the second singular value—of the representation matrix across layers. Though initially identified for pure self-attention architectures, rank collapse persists under a variety of masking, normalization, and skip-connection schemes and has been rigorously connected to architectural choices, initialization, eigenspectrum of query-key matrices, and context length. A comprehensive understanding of rank collapse yields design principles for transformer variants and related sequence models.

1. Mathematical Formulation and Convergence Rates

A single self-attention layer transforms a sequence matrix $X \in \mathbb{R}^{N \times d}$ according to

$A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$

Rank collapse is formally measured via the Frobenius-norm residual to the token mean,

$\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$

which vanishes iff all rows of $X$ are identical (rank-one). Alternatively, collapse is characterized by $\sigma_2(X) \to 0$ , where $\sigma_2$ is the second largest singular value. For fully bidirectional, positive self-attention matrices $A^{(t)}$ , ergodicity guarantees exponential contraction: $\mu(X^{(t)}) \le C (1-\epsilon^N)^{t/N},$ for some $\epsilon>0$ determined by bounds on $A^{(t)}$ , and similarly for the singular value decay (Wu et al., 2024).

In pure multi-head self-attention networks with no skip connections or MLP, stacking $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 0 layers yields doubly-exponential contraction of the residual: $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 1 where $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 2 aggregates norms of weight matrices and $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 3 controls attention entry fluctuation (Dong et al., 2021). This strong inductive bias toward token uniformity is confirmed by empirical ablations and analytic path decompositions.

2. Geometric and Spectral Mechanisms

Softmax attention matrices are inherently low-rank for realistic query/key distributions, and their singular spectra exhibit rapid decay that intensifies with layer depth. At initialization, random matrix theory applies to the attention map $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 4, which is row-stochastic with a dominant singular outlier $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 5 resulting from the Perron-Frobenius theorem. The remaining spectrum forms a quarter-circular bulk with edge $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 6:

In depth: repeated application of $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 7 projects everything onto the top singular direction.
In width: as context length $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 8, effective rank collapses: $A(X) = \operatorname{softmax}\!\left(\frac{X W_Q (X W_K)^T}{\sqrt{d}} \right) V,\quad X^{(t+1)}=A(X^{(t)}).$ 9 for representation covariance matrices $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 0 (Saada et al., 2024).

This spectral gap elucidates not only rank collapse in depth but also the newly characterized width-induced collapse, where increasing context obliterates signal diversity among tokens, further compounding vanishing and exploding gradients.

3. Factors Modulating Collapse: Masks, Normalization, Initialization

Attention masks strongly modulate collapse rates. Sparse, local, or causal masks yield a masking graph $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 1 of diameter $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 2, leading to

$\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 3

with larger $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 4 (local attention) slowing collapse relative to global attention ( $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 5 collapses fastest) (Wu et al., 2024). LayerNorm applied post-attention does not generically prevent collapse; under orthogonal value matrices and open hemisphere initializations, rank collapse to the unit sphere proceeds at exponential rates. Nevertheless, value matrix choices can lead to nontrivial equilibrium configurations, supporting exact rank- $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 6 attractors for $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 7, and certain counterexamples explicitly prevent collapse even at minimal sequence sizes.

Initialization scale—the variance of query/key matrices—governs both condensation and rank collapse. Small initial weights prolong an initial condensation regime (outer parameter alignment), after which key and query matrices are driven toward low-rank limits via linearized gradient flow. The two-stage analysis predicts transitions in empirical training curves and shows that tailored regularizers (e.g., orthogonality penalties, dropout) and architectural features (multi-head diversity) can modulate collapse (Chen et al., 8 Oct 2025).

4. Architectural Remedies: Residuals, Scaling, and Eigenspectrum Regularization

Residual connections prevent the doubly-exponential collapse of pure attention, as the model always retains the possibility to follow a length-zero path that preserves higher-rank content (Dong et al., 2021). However, Alman & Song demonstrate that skip-connections alone do not suffice: if all weight norms are small, the network still undergoes "layer collapse," reducing a deep transformer to a shallow equivalent (2505.16284). Only sufficiently large weights can maintain depth-dependent expressivity.

Lambda-skip connections instantiate a layerwise update

$\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 8

where $\mu(X) = \left\| X - \frac{1_N 1_N^T}{N} X \right\|_F,$ 9 is row-wise LayerNorm. Analytic conditions on $X$ 0 guarantee that the residual norm $X$ 1 remains above $X$ 2 times the input, precluding collapse for appropriately chosen $X$ 3 (Joseph et al., 2024). Empirical validation on pretrained language and state-space models shows $X$ 4-controlled stabilization of token similarity statistics, with gating and parameterized residuals essential for robust depth scaling.

Spectral regularization, notably the LocAteR loss: $X$ 5 can shrink the eigenspectrum variance of the query-key matrix, simultaneously preventing rank and entropy collapse and enforcing attention localization (Bao et al., 2024). This reconciles disparate failure modes and yields strong empirical improvements in expressivity and trainability.

5. Impact of Context Length, Scaling, and Embedding Bottlenecks

As context length $X$ 6 increases, {rank collapse occurs in width}: attention scores flatten unless the logits are rescaled by a critical factor $X$ 7. Below the threshold, rank collapse is instantaneous; above, attention becomes the identity and cross-token mixing is lost. Only at the critical scaling do sparse, content-adaptive patterns persist (Chen et al., 7 Oct 2025). This phase transition underlies recent practical recommendations, e.g., Qwen, SSMax, and SWAN-GPT.

Embedding rank bottlenecks also induce effective collapse. If the vocabulary size or rank $X$ 8 satisfies $X$ 9 (model width), self-attention matrices and representations inherit the low rank, leading to expressivity loss beyond $\sigma_2(X) \to 0$ 0; depth favors expressivity over width in these regimes. This phenomenon explains architectural preferences across domains: NLP (large $\sigma_2(X) \to 0$ 1) supports wide, shallow models, while vision (small $\sigma_2(X) \to 0$ 2) and bioinformatics require deep, narrow configurations (Wies et al., 2021).

6. Alternate Views: Kernel-SVD, Diffusion, and Efficient Architectures

Self-attention may be interpreted as a kernel machine, and its empirical low-rank property motivates efficient approximations. Linformer leverages random projections to compress the $\sigma_2(X) \to 0$ 3 attention to rank- $\sigma_2(X) \to 0$ 4, reflecting the spectral decay seen in canonical architectures and hardware-efficient implementations (Wang et al., 2020). Primal-Attention explicitly maximizes projected variances, enforces sharper singular value decay via asymmetric Kernel-SVD regularization, and achieves higher end-task accuracy with empirical constraints on smaller singular modes (Chen et al., 2023).

A recently unified viewpoint treats the global self-attention update in transformers as a degenerate diffusion on the token-feature sphere, converging toward a Dirac measure at rate $\sigma_2(X) \to 0$ 5. This continuous-time PDE model further predicts effective rank trajectories and demonstrates that periodic token merging slows collapse both analytically and empirically, motivating interventions at the dynamics level (Li et al., 25 Dec 2025).

7. Implications for Transformer and Sequence Model Design

The universality of rank collapse implies that pure self-attention networks suffer from severe bottlenecks in both depth and width. Practical guidance includes:

Always employ nontrivial residual (skip) connections, but ensure weight magnitudes are sufficient to avoid layer collapse (2505.16284).
Use LayerNorm and robust value matrix choices to expand the set of expressible equilibria; with adversarial selection, rank- $\sigma_2(X) \to 0$ 6 attractors can be induced for arbitrary $\sigma_2(X) \to 0$ 7 (Wu et al., 2024).
Regularize the QK eigenspectrum to maximize trace and minimize variance, reconciling localization with entropy and rank requirements (Bao et al., 2024).
Scale attention logits by $\sigma_2(X) \to 0$ 8 for long-context transformers to maintain gradient flow and avoid collapse (Chen et al., 7 Oct 2025).
Design embedding and value dimensions with attention to vocabulary rank bottlenecks, deepening rather than widening when necessary (Wies et al., 2021).
For hardware or memory-efficient variants, exploit the low-rank nature of attention spectra as in Linformer, but avoid excessive collapse via dynamic regularization and architectural safeguards (Wang et al., 2020, Chen et al., 2023).

Contemporary research emphasizes spectral diagnostics, dynamical systems perspectives, and context- or domain-adaptive remedies as central to mitigating rank collapse and preserving model expressivity in deep self-attention-based architectures.