Inlet Rank Collapse in Neural Networks

Updated 9 February 2026

Inlet Rank Collapse is a phenomenon where early neural network layers fail to propagate full input rank, creating a bottleneck that limits expressivity.
It arises in various architectures—from transformers to implicit neural representations—leading to vanishing gradients and reduced training efficiency.
Remedies such as batch normalization, rank-expanding initialization, and λ-skip connections help mitigate collapse and improve model performance.

Inlet Rank Collapse refers to the degeneracy of representations or gradients at early or initial layers in deep neural architectures, especially those processing low-dimensional inputs within high-dimensional hidden spaces. This effect manifests as a rank deficiency at the model “inlet,” creating a structural bottleneck that limits expressivity, impedes optimization, or induces vanishing gradients. The phenomenon is central in contemporary understanding of transformers, self-attention networks, implicit neural representations, deep linear and nonlinear networks, graph neural networks, and sequence models—with both theoretical and practical ramifications for training speed, expressivity, and design of architectural remedies.

1. Formal Definitions and General Phenomenology

Inlet Rank Collapse denotes the failure of early network representations, often at the very first hidden layer, to propagate sufficient rank from the input manifold to the latent space. For a model with input $X \in \mathbb{R}^{n \times p}$ , inlet embedding $\phi: \mathbb{R}^p \to \mathbb{R}^D$ , and first-layer output $Z_0 = \phi(X)$ , inlet rank collapse means

$\mathrm{rank}(Z_0) = R_0 \ll D.$

The strict form, $R_0 = p \ll D$ , implies that the high-dimensional hidden space is populated only by a low-dimensional submanifold, fundamentally limiting the degrees of freedom available for subsequent non-linear representations (Zheng et al., 2 Feb 2026).

Rank Collapse (transformers and general deep nets) occurs when the matrix of token or embedding representations, $X^{(l)}$ or $Y^{(k)}$ , collapses to (approximately) rank one. Formally, for self-attention networks, Dong, Cordonnier & Loukas define an “ $\epsilon$ -rank collapse” if

$\| X^{(l)} - 1_n y^* \|_\infty \leq \epsilon \| X^{(l)} \|_\infty,$

i.e., the output is within $\epsilon$ (entrywise max) of a rank-1 matrix for all inputs (2505.16284).

Layer Collapse is a stronger property: an $L$ -layer model is said to have $\delta$ -layer collapse if it can be uniformly approximated (entrywise infinity norm) by some architecturally shallower (e.g., single-layer) network, up to scale $\delta$ for all possible inputs. In this regime, the depth of the model offers no additional expressive power (2505.16284).

Token-Spread Metric $\,\mu(Y)=\|Y - \mathbf{1} \gamma_Y\|_F$ measures deviation from rank-1 equilibrium (where all token embeddings are identical) and serves as a key diagnostic for collapse in sequence models and transformers (Joseph et al., 2024).

2. Mechanistic Origins Across Architectures

Transformers and Self-Attention

In transformer networks, stacking many self-attention layers (with or without skip connections) leads to rapid propagation of rank collapse in depth. The phenomenon arises because each self-attention operation with small weights or uniform attention tends to force the representation matrix towards its dominant singular vector, progressively shrinking all “directions” except one (Noci et al., 2022, Saada et al., 2024). When the model is initialized with small weights ( $\|Q\|_{op}, \|K\|_{op}, \|V\|_{op} \ll 1$ ), the contraction towards rank-1 is compounded across layers; even skip connections only slow, rather than avert, the collapse (2505.16284).

Implicit Neural Representations (INRs)

INRs, which use MLPs to model high-dimensional signals from low-dimensional coordinates, are especially susceptible to inlet rank collapse since the first hidden layer receives only $p$ -dimensional inputs and thus cannot populate the full hidden width $D$ unless structural remedies are used (Zheng et al., 2 Feb 2026). The bottleneck implies that the Neural Tangent Kernel (NTK) for the full model is at most rank $p$ , irrespective of width or depth, strictly limiting expressivity and trainability.

Deep Linear and Nonlinear Networks

In deep linear or ReLU networks without appropriate normalization or architectural regularization (e.g., BatchNorm, skip connections), products of randomly initialized weight matrices—under the dynamics of random matrix theory—quickly suppress all but the top singular mode. This leads to an exponential decay of the “stable rank” with depth, culminating in almost one-dimensional representations. BatchNorm provably enforces a $\Omega(\sqrt{d})$ lower bound on layerwise stable rank, blocking collapse (Daneshmand et al., 2020).

Graph Neural Networks and Aggregator Spectra

For deep GNNs, iterative application of aggregators $A$ and feature transforms $W^{(l)}$ leads, by spectral dominance, to the representations aligning with the principal eigenspace of the aggregation matrix. Unless the spectrum of $A$ possesses sufficient multiplicity, the representations almost surely converge to the span of a single eigenvector (classical over-smoothing), yielding rank collapse even as feature transforms vary (Roth et al., 2023).

3. Mathematical Characterization and Theoretical Guarantees

Precise Collapse Criteria

Inlet Rank in INRs (Zheng et al., 2 Feb 2026): For a coordinate-based MLP with input dimension $p$ and hidden width $D$ , $\mathrm{rank}(Z_0) \leq p$ at the first hidden layer; subsequent layers cannot increase this rank. Thus, any subspace orthogonal to the input (of dimension $D-p$ ) is untrainable and unresponsive, confining NTK rank and network capacity.
Layer-wise Rank Decay in Deep Nets (Daneshmand et al., 2020): The stable rank of activations, $r(X^{(\ell)})$ , decays exponentially with depth in unnormalized networks, but with BatchNorm:

$\lim_{L \to \infty} \frac{1}{L} \sum_{\ell=1}^L r(H_\ell) = \Omega(\sqrt{d}).$

Collapse Under Attention with Small Weights (2505.16284): For all per-head matrices with operator norm $\leq \varepsilon$ ,

$\| \mathrm{Res}(\mathrm{SAtt}(X)) \|_\infty \leq (e^\varepsilon - 1) \| \mathrm{Res}(X) \|_\infty.$

Therefore, per-layer residuals contract towards rank-1; across $L$ layers, the model can be uniformly approximated by a single-layer attention network with approximation error $O(\varepsilon)$ .

Two-Stage Training Dynamics in Transformers

A recent analysis via linearized gradient flow identifies two phases (Chen et al., 8 Oct 2025):

Condensation (Stage I): Outer parameters (value/FFN weights) rapidly align along single data-driven directions, while the attention parameters remain inert.
Activation/collapse (Stage II): Key-query matrices $Q,K$ engage, but gradient flow drives them to align with the leading singular vectors of a fixed matrix $A$ . Normalized $Q,K$ converge to rank-1, so attention becomes degenerate, restricting the model’s adaptive capacity.

4. Practical Implications: Expressivity, Trainability, and Remedies

Inlet and rank collapse fundamentally limit the ability of deep architectures to learn complex, high-rank functions:

Expressivity loss: The effective model capacity is bottlenecked by inlet rank. E.g., an INR with input coordinates in $\mathbb{R}^2$ and first layer hidden width $D\gg2$ is, without rank-restoring measures, fundamentally no more expressive than a two-neuron network (Zheng et al., 2 Feb 2026).
Gradient pathologies: In collapsed regimes, gradients with respect to some parameter groups vanish. In transformers, rank-one representations of tokens nullify the gradients for queries and keys (as the correlation matrix $H^{(\ell)} {}^T H^{(\ell)} - n \bar{x} \bar{x}^T = 0$ ), precluding effective learning of attention (Noci et al., 2022).
Runtime/Expressivity tradeoff: Large weights preserve high-rank, multi-layer expressivity at the expense of computational cost (quadratic-time attention). Small weights enable fast, low-rank approximations but induce representational collapse (2505.16284).

Proven Remedies and Their Theoretical Basis:

Positional Encoding, SIREN, and BatchNorm: All act by inflating inlet rank so that the functional basis after the first layer is rich, enabling the network to leverage its full hidden capacity (Zheng et al., 2 Feb 2026).
Rank-Expanding Initialization: By carefully constructing weights and biases in the first layer, full rank in $Z_0$ can be achieved, removing the bottleneck (Zheng et al., 2 Feb 2026).
λ-Skip Connections: Parameterizing the strength of residual pathways allows tuning of the “token-spread metric” to prevent collapse in various sequence model architectures (transformers, SSMs), with a proven spectral norm lower bound required for prevention (Joseph et al., 2024).
Sum-of-Kronecker-Products in GNNs: Using multiple parallel aggregators per layer, each contributing an independent dominant direction, entirely blocks rank collapse in node representations (Roth et al., 2023).
Spectral Outlier Removal in Attention: Subtracting the rank-1 uniform component from the Markov (softmax) attention matrix eliminates the macroscopic spectral gap, dramatically increasing effective width rank and stabilizing gradients (Saada et al., 2024).

5. Empirical Evidence and Experimental Validation

Evidence for inlet rank collapse and its mitigation arises from multiple domains:

INRs: ReLU MLPs with default initialization achieve substantially lower PSNR and IoU on signal reconstruction tasks than the same architectures with PE, SIREN, BN, or rank-expanding initialization. The metrics strongly correlate to measured layerwise singular value spectra and NTK rank (Zheng et al., 2 Feb 2026).
Deep Linear/ReLU Nets: BN universally rescues rank—it remains stable even at depth $32$ in wide MLPs, contrasting with collapse to rank $1$ for unnormalized nets (Daneshmand et al., 2020).
Transformers: Controlling residual strength, using LayerNorm, and architectural modifications such as learnable residuals (λ-skip) can maintain high token diversity and avert exponential decay of the token-spread metric, as demonstrated in Mamba-2 and other large models (Joseph et al., 2024).
GNNs and SKP: Synthetic and real-data experiments show that only SKP-type architectures can propagate more than one feature direction beyond moderate depth ( $l=8$ ), aligning theoretical predictions and empirical loss curves (Roth et al., 2023).
Spectral Outlier Removal: Empirically, removing the uniform eigenvector in attention matrices restores the stable rank of layerwise token covariances to $O(T)$ (sequence length), matching theoretical sections, and avoids gradient blow-up in width (Saada et al., 2024).

6. Design Principles and Recommendations

Monitor effective rank: Tracking metrics such as stable rank, soft rank, or token-spread during training provides an early warning of collapse.
Use rank-restoring components at the inlet: PE, SIREN at only the first hidden layer, or tailored initializations suffice to populate the latent space in coordinate-based models (Zheng et al., 2 Feb 2026).
Integrate λ-skip connections: Parameterize skip strengths and treat them as learnable; combining with LayerNorm and gating is recommended (Joseph et al., 2024).
Avoid architectures with single mode amplification: GNNs using single Kronecker product updates are highly susceptible and should instead employ multiple parallel aggregators (Roth et al., 2023).
Utilize BatchNorm or pseudo-whitening: Not only does BN stabilize training by controlling activation covariances, but it provably blocks rank collapse at scale (Daneshmand et al., 2020).
Spectral modifications in attention: Removing or damping the uniform eigenmode in softmax attention layers resolves width collapse and improves gradient conditioning in transformers (Saada et al., 2024).
Balance expressivity and efficiency: Recognize the computational trade-off: only models with sufficiently large weights (avoiding the low-rank regime) achieve maximal expressivity, at the cost of increased runtime complexity (2505.16284).

7. Interconnections, Open Problems, and Outlook

Inlet rank collapse arises from deep phenomena in linear algebra, random matrix theory, and the geometry of neural signal propagation. While specific technical remedies have been identified and justified across architectural domains, open questions remain regarding the optimality of scaling rules, universality classes of collapse for stochastic optimizers, and application-specific requirements for avoiding representational degeneracy. Furthermore, as transformer-based and INR models scale to ever-larger depths and widths, practical constraints on initialization, normalization, and kernel design will continue to challenge both theorists and practitioners to reconcile trade-offs between efficiency, stability, and expressivity.

Key References:

"Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse" (2505.16284)
"The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy" (Zheng et al., 2 Feb 2026)
"Lambda-Skip Connections: the architectural component that prevents Rank Collapse" (Joseph et al., 2024)
"Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers" (Saada et al., 2024)
"Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks" (Daneshmand et al., 2020)
"Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks" (Roth et al., 2023)
"Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse" (Noci et al., 2022)
"From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics" (Chen et al., 8 Oct 2025)
"Neural Rank Collapse: Weight Decay and Small Within-Class Variability Yield Low-Rank Bias" (Zangrando et al., 2024)