Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Published 26 Apr 2026 in cs.LG, cs.CL, and stat.ML | (2604.23681v1)

Abstract: A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.

Authors (1)

Summary

  • The paper rigorously decomposes collapse mechanisms to show that residual connections obstruct exponential rank collapse while layer normalisation remains affine-rank neutral.
  • The paper introduces head-channel non-identifiability, where summing head outputs renders individual contributions irrecoverable due to unbroken permutation-gauge symmetry.
  • The paper proposes the Position-Gated Output Projection (PG-OP) as a remedy to break head-permutation symmetry and restore distinctive head contributions.

Precise Analysis of Representational Collapse Mechanisms in Transformers

Introduction

This work provides a detailed theoretical analysis of the phenomena collectively known as representational collapse in Transformer architectures. The study critically evaluates prior findings, particularly the widely cited result by Dong et al. (2021), which identified rapid rank collapse in pure self-attention transformers devoid of skip connections and feed-forward networks. The author rigorously decomposes the mechanisms underlying representational collapse and introduces new concepts—most notably, head-channel non-identifiability. The findings clarify the respective roles of architectural elements such as residual connections, layer normalisation, the MLP, and the attention output projection. The analysis is mathematical, making extensive use of algebraic geometry and group-theoretic arguments to formalise the symmetries that cause and the interventions that break collapse phenomena.

Summary of Main Results

This paper establishes three major results:

  1. Layer Normalisation is Affine-Rank Neutral: Under mild non-degeneracy conditions, layer normalisation preserves the affine rank of the token representation matrix exactly and the matrix rank in the zero-bias case. Contrary to prior assertions that it "plays no role," the result is sharper: LN is precisely transparent with respect to representational rank.
  2. Residual Connections Obstruct Rank Collapse: For deployed architectures (e.g., BERT, GPT), residual connections generically prevent exponential rank collapse, even in the absence of the MLP. The primary function of the MLP is re-characterised: it is necessary not for anti-collapse, but for generating nonlinear feature directions outside the affine hull of input representations.
  3. Head-Channel Non-Identifiability: The author introduces and characterises the phenomenon where, after summing head outputs via the output projection, the contributions of individual heads become irrecoverable from the mixture. This form of collapse is mathematically distinct from classical rank collapse and cannot be remedied by the MLP, which operates after the mixture has been formed.

These findings have explicit architectural corollaries—chiefly, that the prevalent choice of feed-forward width (typically four times the model dimension) is not justified by rank-preserving considerations, but rather by the interplay of head specialisation, rank contraction, and the non-linear complexity of the given task.

Layer Normalisation and Rank Preservation

The paper's Rank-Neutrality Theorem establishes that under mild conditions (i.e., non-constant, non-degenerate representations and strictly positive scale parameters), layer normalisation acts as an affine-rank-preserving map. Through a stepwise algebraic analysis, the author demonstrates that mean subtraction, row-wise variance scaling, and application of learnable affine parameters introduce no decrease in affine rank—excepting the degenerate case where the vector of per-token standard deviations collapses into the column space of the centred representation, which is a measure zero event in practical scenarios.

The theorem clarifies ambiguity in the literature regarding LN's influence on collapse dynamics; it neither prevents nor catalyses rank collapse—it is transparent with respect to representational rank. As argued, the relevant measure is not the matrix rank but the affine rank of the row set, as mean subtraction always projects onto the orthogonal complement of the all-ones vector.

Residual Connections and the Role of the MLP

The primary obstruction to rank collapse in practical Transformers is the residual connection, as shown via the triangle inequality on matrix ranks and algebraic geometry arguments concerning analytic subsets and their measure theoretic properties. For almost every input with sub-full rank, the addition of an independent transformation (the residual update) will increase the rank. The formal proof utilises the properties of determinants and analytic maps, and the key conclusion is that the scenario leading to persistent low rank (as in Dong et al.'s result) is structurally non-generic.

The MLP, often claimed to be necessary for anti-collapse, is shown to be sufficient but not necessary in the presence of residuals. Instead, its irreplaceable role is the introduction of nonlinearity, enabling the model to traverse directions of the representation space unattainable by compositions of bilinear attention and residual updates alone (local affine constraints). This observation is substantiated via a first-order Taylor expansion of the attention-residual map, demonstrating its local affinity.

Rank Contraction from Head Specialisation

The work gives a granular analysis of how head specialisation leads not to shrunken per-head output subspaces, but to their increased alignment—driving contraction of the total rank after the multi-head summation ("rank contraction"). Strong empirical and algebraic arguments are provided showing that the dimension of individual head output subspaces is architecturally fixed (as min(rank(X),dk)\min(\operatorname{rank}(X), d_k)), but subspace alignment reduces their sum. The notion of directional specialisation is formalised and conjectured to induce subspace alignment, based on measurements of principal angles.

Importantly, the expansion role of the MLP is now understood as restoring directions lost due to subspace alignment, not individual subspace contraction. This reframes the purpose of setting the feed-forward width: the minimal necessary width is a function of both the rank contraction induced by alignment and the nonlinear complexity of the target function. In practical terms, it motivates re-examination of the heuristic dffn=4dd_\text{ffn}=4d, particularly in regimes of minimal nonlinearity or limited head specialisation.

Head-Channel Non-Identifiability

A fundamentally distinct form of representational collapse is head-channel non-identifiability, rigorously defined and proven in this paper. The summation of head outputs via the output projection is shown to make the assignment of output components to their generating heads irrecoverable, even in principle, for downstream components including the MLP. The ambiguity is quantified: for BERT-scale models, n(H1)dkn(H-1)d_k degrees of freedom per sequence per layer remain latent in the combined representation.

The analysis demonstrates that this phenomenon is a consequence not of standard rank collapse but of an unbroken permutation-gauge symmetry (the invariance of the output to coordinated invertible transformations across heads and corresponding permutations). This symmetry is broken neither by the MLP nor by standard attention mechanisms.

A constructive design remedy, Position-Gated Output Projection (PG-OP), is proposed: by allowing the output projection's weighting of each head to be dependent on token position and content, the architecture partially breaks the head-permutation symmetry, reducing the ambiguity in source attribution.

Unifying Algebraic Framework

The author formalises a symmetry-breaking perspective, characterising representational collapse as arising from different unbroken symmetries of the Transformer forward map:

  • Rank collapse in depth: Row-averaging symmetry, broken by residual connections.
  • Rank collapse in width: Spectral concentration symmetry, controlled by weight magnitude.
  • Head-channel non-identifiability: Head permutation and gauge symmetry, addressed by interventions like PG-OP.
  • Entropy collapse: Softmax temperature symmetry, potentially broken by regularisation.

From this vantage, architectural design reduces to the explicit breaking of symmetries that are incompatible with the target task or representational goal.

Empirical Validation

The theoretical claims are supported by targeted experiments on BERT architectures. These confirm: (i) LN's neutrality with respect to rank (across practical numerical settings and parameter configurations); (ii) the inevitability of exponential rank collapse in the absence of residual connections, and its obstruction when they are present; (iii) the invariance of MHA under gauge transformations that embody the head-permutation symmetry; and (iv) the absence of direct correlation between scalar head specialisation and contraction of total output rank, as predicted.

Implications and Future Directions

The findings have both theoretical and practical implications:

  • Architectural choices: The justification for large feed-forward widths should be grounded in task complexity and the contraction-expansion interplay, not in naive rank restoration requirements.
  • Component interventions: Obviating representational collapse demands interventions tailored to the broken symmetry; for head-channel non-identifiability, pre-summation remedies are required rather than increased MLP capacity.
  • Unified design methodology: Transformer design can adopt an algebraic approach, mapping desiderata to symmetry-breaking interventions.

Open questions articulated include: empirical quantification of principal angle alignment, direct testing of PG-OP's effect on real tasks, and the dynamics of representation rank through training.

Conclusion

This work refines and extends the foundational understanding of representational collapse in Transformers. Through careful analysis and algebraic formalism, it establishes the precise mechanisms by which various architectural elements interact with, preserve, or break critical symmetries—delineating when and how representational collapse, in its several mathematically distinct forms, is either inevitable or avoidable. The introduction and quantification of head-channel non-identifiability substantially advance the theoretical toolkit for Transformer design and interpretation. The proposed symmetry-breaking framework and the PG-OP intervention represent actionable insights with immediate relevance for the future development of efficient and expressive Transformer variants.


Reference: "Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers" (2604.23681)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.