Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Published 5 Feb 2026 in cs.LG and cs.AI | (2602.06208v1)

Abstract: Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that MLP training with smooth activations confines weight updates to a fixed low-dimensional subspace, with an effective rank bounded by 2K.
The study employs perturbation theory to show that most parameter update energy (>95%) remains aligned with the initial gradient's principal subspace, ensuring minimal drift.
Practical experiments confirm that using initialization-derived low-rank projections achieves comparable performance to fully-parameterized models on benchmarks like Fashion MNIST and CIFAR-10.

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Overview and Motivation

The study systematically analyzes the emergence of low-rank structure in the weight dynamics of multi-layer perceptrons (MLPs) with smooth (i.e., twice differentiable) activation functions during gradient-based training. Motivated by empirical reports of parameter updates and optimization trajectories being highly concentrated in low-dimensional subspaces, the authors provide the first rigorous characterization of this phenomenon in nonlinear feedforward networks. The primary focus is on settings where the output dimensionality $K$ is significantly smaller than the input or hidden dimensions (i.e., $K \ll d, m$ ), a regime of practical importance for multiclass classification tasks.

Initial investigations reveal marked qualitative differences between networks with smooth activations (e.g., ELU, GELU, SiLU) versus nonsmooth variants (e.g., ReLU, Leaky-ReLU). For smooth activations, the evolution of most singular vectors of the weight matrices is slow and largely confined to a fixed, small-dimensional subspace over the course of training; in contrast, for nonsmooth activations, the parameter dynamics are spread over a larger dimensional space.

Figure 1: The middle singular subspace of the first-layer weight matrix in the ELU network evolves noticeably slower than that in the ReLU network, and the corresponding singular values remain closer to their initialization.

Theoretical Analysis: Two-Layer Networks with Smooth Activations

A central theoretical result comprehensively characterizes the low-rank nature of training-induced weight dynamics in the first-layer weights of two-layer MLPs with smooth activations and small output dimension. The authors prove that, under the assumptions of whitened input data, fixed output layer, and gradient descent (GD) training on squared-error loss, each GD update in the weight space is effectively contained within a fixed, low-dimensional subspace. This subspace, which is identified by the principal singular vectors of the gradient at initialization, is entirely determined by the data, initialization, activation smoothness, and the fixed output weights.

More formally:

The effective rank of the parameter change over training is at most $2K$, with all significant updates occurring in a matrix subblock spanned by vectors derived from the initialization gradients and weights.
The remainder of the parameter space exhibits only $\mathcal{O}(\epsilon)$ -bounded perturbations for initialization scale $\epsilon$ , and subsequent GD updates produce only exponentially decaying changes within those perturbation directions as a function of training time.
Figure 2: Under the exact theoretical setting, $\widetilde{\bm W}_{1,1}(t)$ accounts for almost all of the change in $\widetilde{\bm W}_1(t)$ , confirming that updates predominantly occur in the predicted low-dimensional subspace.

The analysis leverages perturbation theory (Wedin’s sin-theta theorem) to rigorously bound the drift of these crucial subspaces and establishes that low-rank weight evolution persists so long as the initialization scale and learning rate are not overly large.

Extension Beyond Theory: Deep Networks and Practical Optimizers

The study empirically verifies that the theoretical phenomenon is robust well beyond the narrow two-layer, squared-error setting:

For deep MLPs trained with SGD (with momentum), Adam, and on cross-entropy loss (with or without input whitening), pronounced low-dimensional concentration of weight evolution is consistently observed for all layers.
Quantitative measurement of subspace drift shows minimal deviation from the initialization-aligned low-rank structure, confirming the phenomenon’s generic character under both idealized and practical training regimes.

Figure 3: Training deep MLPs with SGD or Adam on unwhitened inputs with cross-entropy loss maintains the previously observed low-rank dynamics, especially for smooth activations.

Empirically, the key role of smoothness is further validated: when nonsmooth activations are used, a much larger fraction of the weight parameter space is involved in the learning dynamics, and the invariant subspaces are no longer present.

Low-Rank Parameterization: From Observations to Efficient Architectures

Building on these findings, the authors propose a constructive procedure for synthesizing low-rank MLP parameterizations that match the accuracy of their fully-parameterized counterparts on real-world tasks. The method involves:

Projecting both the input and top-layer features into the aligned low-rank subspaces given by the singular vectors of the initialization gradient.
Training only small internal weight matrices; the alignments are fixed at initialization.

Critically, the success of this approach depends sharply on using the correct, initialization-derived subspaces; random orthogonal choices for these projections consistently lead to poor convergence and local minima, affirming that the emergent low-rank structure is highly instance-specific and not random.

Experiments on Fashion MNIST and CIFAR-10 (using VGG-16 with a low-rank classifier MLP head) demonstrate that, when the construction is followed, test loss and accuracy learning curves for the low-rank MLP are virtually indistinguishable from those of the full model—not only under full fine-tuning but also under classifier-only settings.

Figure 4: The properly-initialized low-rank MLP achieves nearly-identical test loss and accuracy as the fully-parameterized MLP; with random subspace initialization, it is trapped in a suboptimal local minimum.

Figure 5: On CIFAR-10 with VGG-16, the classification performance of the low-rank MLP head with proper initialization closely matches that of the fully-parameterized counterpart. Random subspace initialization yields no training progress.

Technical Implications and Insights

Strong numerical outcomes:

Across training runs, in multiple settings, greater than 95–99% of parameter update energy is confined to the identified low-rank subspaces.
Low-rank MLPs constructed as described need only $r = 2K$ internal width to replicate full-model performance, with diminishing returns for $r > 4K$ .

Sharp, nontrivial claim:

The critical subspace is not data-universal nor random but is both data- and initialization-specific, governed by the top $K$ singular vectors of the initial gradient.

Contradictory to intuition: Despite the non-convexity and compositional nature of deep nonlinear architectures, most updates are not distributed throughout parameter space, but remain “trapped” in a small affine subspace determined at initialization.

Implications and Future Directions

This work both formalizes and validates the conjecture that overparameterized neural network training proceeds in (and can be restricted to) very low-dimensional subspaces, so long as smooth activation dynamics dominate—an observation with wide ramifications:

Optimization and generalization: Explicit low-rank parameterization can yield massive reductions in training and storage costs, without impacting accuracy—especially in problems with small output dimension.
Network compression and adaptation: Insights support and explain the empirical success of LoRA-type approaches, and suggest direct, theoretically-justified design of compressed architectures for warm-started training.
Initialization protocols: The findings reinforce the criticality of initialization, as alignment with the emergent subspaces is necessary for optimal learning trajectories.
Limits for nonsmooth activations: The lack of analogous low-rank dynamics for ReLU/Leaky-ReLU suggests differing implicit regularization and representation capacity properties with practical implications for architecture selection.

Future prospects include extending these findings to attention-based architectures (Transformers), structured data scenarios, or task transfer settings, where the low-rank phenomenon may underpin scalable adaptation to new problems.

Conclusion

The results decisively demonstrate that, under broad and realistic conditions, the evolution of weights in MLPs with smooth activations is strictly confined to invariant, low-dimensional, initialization-dependent subspaces. This provides both a new theoretical foundation for—and a practical prescription toward—directly exploiting low-rank structure in neural network training, with implementation-ready implications for model efficiency, transfer learning, and network design.

Reference:

"Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations" (2602.06208)

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview of the paper

This paper looks at how certain neural networks learn. The authors find that, during training, most of the changes in the network’s weights happen in just a few directions, not across all possible directions. In simple terms: even though a network can have a huge number of settings to change, it mostly learns within a small, low-dimensional “space.” They explain why this happens for specific kinds of networks and show evidence that it happens more broadly. They also use this idea to build smaller, more efficient networks that can perform as well as bigger ones.

Key questions the paper asks

The paper focuses on three easy-to-understand questions:

Do neural networks really learn mostly in a few directions (a low-dimensional space) while training?
Why does this happen, especially when the network uses smooth activation functions (like ELU or GELU)?
Can we use this to design smaller “low-rank” networks that work just as well as full-sized ones?

How the researchers studied the problem (methods)

Think of training a neural network like trying to reach a target by taking small steps downhill (this is “gradient descent”). Each step changes the network’s weights. The authors study where these steps go.

They do two things:

Theory (math analysis) on a simple network:

They analyze a two-layer MLP (a basic neural network) with smooth activation functions. “Smooth” here means the function changes gently and doesn’t have sharp corners, like ELU or GELU; ReLU is not smooth.
They consider data with many input features (like pixels) but only a few outputs (like 10 classes in classification). This is common in real tasks: the input dimension is large, but the number of classes is small.
They use gradient descent and show that most of the learning steps stay inside specific small “subspaces” (sets of directions) that are determined at the start (by the initialization and the first gradient).
To make the math clean, they fix the second layer during training and use a squared-error loss (which is like measuring how far predictions are from the target with a simple distance). They also assume “whitened” inputs (a common trick that treats different features fairly). Later, they test more realistic settings too.

Everyday analogy: Imagine you’re exploring a giant city but almost all the interesting places you need to visit are along a couple of main roads. Even though there are many side streets, your trip mostly stays on those main roads. Here, the “main roads” are the low-dimensional subspaces.

Experiments (tests on computers):

They train deeper MLPs (more layers), with different smooth activations.
They try more realistic training: cross-entropy loss (common for classification), minibatch SGD with momentum, and Adam optimizer.
They also use unwhitened data (no special preprocessing).
They track how the weight updates move and measure whether the changes stick to a small set of directions.

They use tools like singular values and singular subspaces (from linear algebra) to measure “how many strong directions” the updates use. You can think of singular values as telling you how much the network changes along each direction; most change being concentrated in the top few means the process is “low-rank.”

What they found (main results)

Theoretical insight: For two-layer networks with smooth activations, the weight updates concentrate in fixed low-dimensional subspaces across training. These subspaces are essentially set at the beginning (based on the initial weights and their first gradient). The number of important directions is tied to the number of output classes K; they show the “small-update” part of the weights barely changes outside roughly 2K directions.
Smooth activations encourage low-rank learning: In experiments, networks with smooth activations (ELU, GELU, SiLU) change their weights mainly along a few directions throughout training. Nonsmooth ones (like ReLU) spread changes more widely. In other words, smooth activations make learning more focused.
This behavior holds in deeper networks: Even with more layers, the early layers’ weight updates mostly happen in the corresponding small subspaces defined by the initialization, showing the phenomenon isn’t just for simple networks.
It still mostly holds with practical training: Using cross-entropy loss, minibatch SGD with momentum, and Adam (and without whitening the data), the networks still show mostly low-rank training dynamics. The effect is a bit weaker with Adam, but it’s still there.
A useful trick: low-rank MLPs: Based on these observations, the authors build low-rank versions of MLPs—basically, networks whose weights are constrained to those few important directions. If they initialize them correctly (inside the right subspaces), these smaller networks can reach almost the same classification accuracy as full-sized networks on tasks like Fashion-MNIST and CIFAR-10.

Why this is important:

It explains (theoretically and empirically) why many modern “low-rank” training methods (like LoRA) work so well: because the network naturally learns in low-dimensional spaces.

Why this matters (implications and impact)

Faster, cheaper training: If most learning happens in a small set of directions, we can focus updates there. That means fewer parameters to train, less memory usage, and faster training—especially useful for big models.
Better fine-tuning: Low-rank adaptation methods (like LoRA for LLMs) become more justified. This paper offers a reason “under the hood” for their success.
Model compression: You can keep performance while making models smaller. This helps deploy neural networks on phones or embedded devices.
Understanding neural learning: It gives a clearer picture of how neural networks learn: despite having many parameters, they often use only a few meaningful directions during training, especially with smooth activations and small output dimension.

In short: This paper shows that neural networks with smooth activation functions tend to learn in a few main directions. That insight helps us design smarter, smaller, and more efficient models without sacrificing accuracy.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and open questions that remain unresolved and could guide future research:

Scope of theory limited to two-layer MLPs with smooth activations
- No formal guarantees for deeper networks; the “small-update subspace” is defined heuristically via recursive initialization but lacks proof of invariance or bounds analogous to the two-layer case.
- No analysis of how depth, width, and layer shapes affect subspace dimensions, stability, or alignment dynamics.
Dependence on whitened inputs and small output dimension
- Theory assumes whitened data and K < d/2; empirical results suggest approximate behavior with unwhitened data, but no conditions are provided for when subspace invariance persists without whitening.
- Theoretical characterization when K is not small (e.g., regression or multi-label settings) is absent; thresholds or additional structural conditions on X that preserve low-rank dynamics are not identified.
Fixed second layer assumption
- The analysis freezes W2; it remains open whether similar low-rank update subspaces exist when both layers are jointly trained (including the interplay of W2 updates with the first-layer subspace).
- Impact of the rank, conditioning, and initialization of W2 on subspace emergence is not studied.
Smoothness assumptions on the activation
- The theory depends on φ being smooth (bounded φ′, φ″), but no necessary/sufficient conditions are established for low-rank dynamics (e.g., explicit dependence of bounds on β, μ).
- What modifications (e.g., smoothed ReLU, softplus) or architectural tricks restore low-rank dynamics for nonsmooth activations are not analyzed.
Optimizer and loss function mismatch between theory and practice
- Theory covers full-batch GD on squared-error; empirical observations show approximate phenomena under cross-entropy, SGD with momentum, and Adam, but there are no theoretical results for these settings.
- The effect of stochasticity (minibatching), optimizer-induced momentum/variance, and adaptive learning rates on subspace alignment and rank is not quantified.
Technical assumptions on gradient singular values and spectral gaps
- The theory requires tail singular values of ∇W1L(t) to be small and a spectral gap to apply Wedin’s theorem; conditions under which these gaps emerge (and how they depend on data, initialization scale ε, and step size η) are not derived.
- Tight rates and constants (currently expressed as Θ(·), O(·) terms) for alignment A(t), gradient decay, and update bounds are not specified, limiting actionable guarantees.
Initialization dependence and sensitivity
- Subspaces depend on initialization (both W1(0) and ∇W1L(0)); the sensitivity of the subspace to initialization distribution, scale ε, and random seed is not quantified.
- No guidance on robust subspace discovery when ε is large or initialization is data-dependent (e.g., Kaiming/Glorot).
Missing components commonly used in practice
- Bias parameters, normalization layers (BatchNorm/LayerNorm), residual connections, dropout, and weight decay are excluded; their impact on subspace emergence and rank is unknown.
- Extension to architectures beyond MLPs (CNNs, Transformers, attention mechanisms) is not analyzed.
Subspace identification and practical exploitation
- While U and V are defined theoretically from initialization and the initial gradient, the paper does not provide a practical algorithm for estimating these invariant subspaces in realistic settings (with unwhitened data, stochastic training, and changing loss landscapes).
- Online tracking or adaptive refinement of the subspaces during training is not explored.
Low-rank parameterization design
- The proposed low-rank MLP parameterization lacks a principled choice of rank r (vs. the theoretically motivated 2K) and does not analyze trade-offs between rank, accuracy, and stability.
- Initialization “within the appropriate subspaces” is crucial, but the method to construct these initializations for deeper networks (and under practical constraints) is not specified.
Generalization, robustness, and scalability
- The empirical validation primarily considers synthetic data and classification on Fashion-MNIST and CIFAR-10; broader evaluation on diverse, large-scale, and distribution-shifted datasets is missing.
- The effect of label noise, class imbalance, and data augmentation on subspace emergence and low-rank dynamics is not studied.
- Computational savings (memory, throughput) and training-time benefits of the low-rank parameterization are not quantified.
Interaction with modern fine-tuning and compression methods
- The connection to LoRA and subspace-projected training is mostly conceptual; it remains open how the identified subspaces align with adapter directions, and whether combining these approaches yields provable or empirical gains.
- Whether the emergent subspaces are stable across tasks/domains (supporting parameter-efficient transfer) is not investigated.
Role of the output head and label structure
- The theory leverages small K and the structure of Y; the dependence on label encoding (e.g., soft labels, multi-hot) and output-layer design (shared heads, multi-task heads) is not examined.
- The origin of the 2K dimension in the “active” subspace (vs. K) is used but not deeply justified under variations in loss, label structure, and biases.
Convergence and critical points
- No characterization of the specific stationary points reached under GD and how their geometry relates to subspace invariance and rank.
- The possibility of multiple basins with different subspaces, and transitions between them during training (particularly with stochastic optimizers), remains unexplored.
Measurement choices and metrics
- The primary metric (normalized squared distance of block components from initialization) captures relative contribution but does not isolate causal mechanisms; more direct measures (e.g., projection of updates onto estimated invariant subspaces over time) could provide sharper evidence.
- How subspace alignment A(t) behaves across layers and training phases (early/mid/late) is reported qualitatively but not quantitatively characterized with explicit rates or thresholds.
Data model assumptions
- Theoretical reliance on whitened X sidesteps data correlations; conditions under which preconditioning (data whitening or feature normalization) can recover theory-like behavior are not derived.
- The sensitivity of the phenomenon to N (number of samples), d (input dimension), and m (width) is not mapped out (e.g., finite-sample vs. asymptotic regimes).
Reproducibility and implementation details
- Code is “to be released”; until then, reproducibility of the empirical subspace construction and low-rank parameterization remains limited, especially for the deeper network subspace definitions.

These gaps suggest concrete avenues: proving subspace invariance in deeper networks, extending theory to stochastic optimizers and cross-entropy loss, relaxing data and activation assumptions, developing practical subspace estimation algorithms, and evaluating scalability and robustness across modern architectures and large-scale datasets.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging the paper’s finding that MLP training dynamics with smooth activations concentrate within low-dimensional, initialization-dependent subspaces, and by using the proposed low-rank parameterization when initialized in those subspaces.

Industry — training cost and energy reduction for tabular and vision MLPs
- Use case: Reduce compute, memory bandwidth, and wall-clock time for training classification MLPs (e.g., fraud detection, churn prediction, credit scoring, industrial defect classification) by constraining updates to the empirically observed low-dimensional subspaces and/or adopting the low-rank parameterization initialized with subspace bases derived from the initial gradient.
- Workflow: Select smooth activations (ELU/GELU/SiLU), compute top-K gradient singular subspaces at initialization, construct layerwise “small-update” subspaces, initialize low-rank factors accordingly, train with SGD/GD; monitor subspace alignment over time.
- Tools/products: “Rank-2K MLP” training module; optimizer state projection add-ons (in the spirit of GaLore/LDAdam) for MLPs that respect the invariant subspaces; subspace-aware LoRA-like adapters for MLPs.
- Sector: Software, energy, finance, manufacturing.
- Assumptions/dependencies: Small output dimension relative to input; smooth activations; ability to estimate gradient SVD efficiently; effectiveness demonstrated on MNIST/CIFAR-style tasks and synthetic data; Adam shows weaker alignment than SGD.
Industry — on-device and edge inference via low-rank models
- Use case: Deploy compressed, low-rank MLPs for mobile/embedded classification (keyboard next-word prediction for small vocabularies, image filters, on-device anomaly detection in IoT) with minimal accuracy loss when initialized in empirically identified subspaces.
- Workflow: Distill or retrain full models into subspace-initialized low-rank MLPs; validate accuracy; ship compressed artifacts.
- Tools/products: Edge-ready low-rank MLP export pipelines; integration with mobile inference runtimes (Core ML, NNAPI).
- Sector: Robotics, consumer devices, IoT.
- Assumptions/dependencies: Small K tasks; accurate subspace estimation at init; data may be unwhitened but empirical results indicate the phenomenon approximately holds.
MLOps — training health monitoring and auto-tuning
- Use case: Monitor principal-angle alignment between gradient singular subspaces and their initialization to detect training instabilities, trigger early stopping, or adjust learning rates when subspace misalignment rises faster than gradient magnitude decays.
- Workflow: Add SVD-based probes to training loops; compute alignment metrics; define thresholds for alerts/actions.
- Tools/products: “Gradient Subspace Monitor” dashboards; callbacks for PyTorch/TF that adapt optimizer hyperparameters based on alignment metrics.
- Sector: Software, cloud ML platforms.
- Assumptions/dependencies: Overhead of SVD computation (mitigate with randomized SVD); effectiveness higher with smooth activations and small init scales.
Academia — reproducible benchmarking and curricula
- Use case: Establish benchmarks for subspace-aware low-rank training on common datasets (Fashion MNIST, CIFAR-10, tabular datasets) comparing smooth vs. nonsmooth activations, optimizers, and initialization scales.
- Workflow: Release code to compute subspaces, apply low-rank parameterization, and evaluate performance vs. full models across optimizers/losses.
- Tools/products: Open-source “Subspace-Init” library; teaching materials demonstrating emergent low-rank dynamics in MLPs.
- Sector: Education, research.
- Assumptions/dependencies: Availability of code; small K (classification) focus; two-layer theoretical basis extended empirically to deeper MLPs.
Policy and sustainability — immediate carbon impact reporting for MLP training
- Use case: Incorporate low-rank training practices into “Green AI” reporting to reduce energy use in model development pipelines, encouraging smooth activations and subspace-aware updates where applicable.
- Workflow: Add low-rank training checklists to procurement and internal ML governance; document energy savings tied to subspace-aware methods.
- Tools/products: Sustainability guidelines; audit templates referencing subspace-based training.
- Sector: Energy, public policy, enterprise governance.
- Assumptions/dependencies: Evidence strongest for MLPs with small output dimension; limited guarantees for large-output tasks or complex architectures.
Daily life — faster updates to personalization models
- Use case: Speed up periodic retraining of small personalization models (recommendations for small catalogs, notification prioritization) on consumer devices or edge servers using low-rank subspace-initialized MLPs.
- Workflow: Periodic fine-tuning with subspace-aware adapters; lightweight update packages.
- Tools/products: On-device updater using low-rank fine-tuning adapters.
- Sector: Consumer apps, e-commerce.
- Assumptions/dependencies: Task fits small K classification; subspace estimation at initialization; robustness across non-whitened data confirmed empirically.
Robotics — real-time control/classification
- Use case: Employ low-rank MLPs in control or perception pipelines where small class counts (e.g., obstacle vs. no obstacle; mode selection) enable efficient, stable training and inference with smooth activations.
- Workflow: Use subspace-initialized low-rank MLPs; monitor alignment to ensure reliability; deploy on constrained hardware.
- Tools/products: Robotics middleware plugin for low-rank MLP inference.
- Sector: Robotics, autonomous systems.
- Assumptions/dependencies: Small output dimension; the approach is strongest for classification tasks.
Healthcare — tabular risk stratification and triage
- Use case: Train low-rank, smooth-activation MLPs for small-class clinical classification (e.g., triage levels), reducing compute cost and potentially enabling privacy-preserving on-premise training.
- Workflow: Subspace-initialized low-rank training on hospital data; monitor subspace alignment as a stability signal.
- Tools/products: EHR-integrated low-rank MLP trainer.
- Sector: Healthcare.
- Assumptions/dependencies: Regulatory and privacy constraints; data may be non-whitened; empirical robustness indicates practicality.

Long-Term Applications

The following applications require further research, scaling, or development to generalize, automate, or extend the approach beyond current scope.

Subspace-preserving optimizers for nonlinear networks
- Vision: Design optimizers that explicitly constrain updates to invariant low-dimensional subspaces identified at initialization, with convergence guarantees in nonlinear settings.
- Products: “Subspace-Aware SGD/Adam” with theoretical guarantees; rank-constrained gradient flow variants.
- Dependencies: Stronger theory beyond two-layer networks; efficient and scalable subspace tracking; robustness under data shifts.
Automated low-rank architecture search and dynamic rank scheduling
- Vision: AutoML tools that infer per-layer ranks and construct low-rank MLPs with initial subspaces tailored to downstream tasks; dynamically adjust rank during training based on alignment and performance.
- Products: Auto-rank selection modules; “Rank Scheduler” for training pipelines.
- Dependencies: Fast, reliable subspace estimation; validation across diverse datasets and large-scale models.
Extension to CNNs, Transformers, and multimodal/foundation models
- Vision: Generalize invariant subspace identification and low-rank update dynamics to more complex architectures (attention layers, convolutional kernels), enabling subspace-aware fine-tuning for foundation models.
- Products: Subspace-aware adapters for Transformers/CNNs; cross-modal subspace monitors.
- Dependencies: New theory capturing nonlinearity and architectural specifics; empirical validation at scale; handling large output spaces.
Communication-efficient federated and distributed learning
- Vision: Use low-rank gradient/update representations aligned to invariant subspaces to dramatically reduce communication overhead in federated learning and distributed training.
- Products: Federated “low-rank update codecs”; subspace-synchronized training protocols.
- Dependencies: Secure, accurate subspace sharing; privacy guarantees; resilience to heterogeneous client data and non-iid distributions.
Privacy-preserving training via low-rank updates
- Vision: Explore whether low-rank, subspace-constrained updates reduce privacy leakage and improve differential privacy budgets while maintaining utility.
- Products: Privacy-optimized subspace training toolkits.
- Dependencies: Formal privacy analyses; empirical studies across sensitive domains.
Explainability, safety, and certification via subspace dynamics
- Vision: Use persistent subspace structures and principal-angle trajectories as interpretable signals for model behavior, stability, and failure modes; inform certification processes for safety-critical applications.
- Products: “Subspace Explainability” modules; audit tools for regulatory compliance.
- Dependencies: Mapping subspace metrics to human-understandable narratives; standardized reporting.
Hardware and kernel support for low-rank operations
- Vision: Specialized accelerators and BLAS kernels optimized for rank-constrained matrix multiplies and fast subspace tracking (e.g., randomized SVD), improving throughput and energy efficiency.
- Products: GPU/TPU kernel libraries; embedded DSP support for low-rank ops.
- Dependencies: Hardware vendor partnerships; workload characterization; widespread software integration.
Policy — standards for energy-efficient AI training
- Vision: Codify best practices (smooth activations, subspace-aware training, low-rank parameterization) into national/international standards and procurement requirements to reduce AI energy consumption.
- Products: Compliance frameworks; energy labels recognizing subspace-aware training.
- Dependencies: Broad evidence across architectures/tasks; stakeholder consensus; governance mechanisms.
Education — curricula on subspace-aware learning
- Vision: Integrate invariant subspace dynamics and low-rank training into ML education, building practitioner awareness of computational and sustainability benefits.
- Products: Course modules; interactive labs.
- Dependencies: Mature teaching materials; accessible tooling.
Cross-domain small-K analytics platforms
- Vision: Turn the insight “output dimension governs effective update rank” into platforms tailored for small-class classification across sectors (healthcare triage, industrial QA, public safety alerts), providing low-rank first solutions before scaling up.
- Products: Turnkey small-K low-rank ML services.
- Dependencies: Domain adoption; integration with existing data pipelines; accuracy guarantees for critical use cases.

Global assumptions and dependencies affecting feasibility

The strongest theoretical guarantees are for two-layer MLPs with fixed second layer, whitened inputs, squared-error loss, small output dimension K relative to input dimension, smooth activations, small initialization scale, and carefully chosen step sizes.
Empirically, the phenomenon extends to deeper MLPs, cross-entropy loss, SGD with momentum, Adam (to a lesser extent), and unwhitened data, but performance and subspace alignment can vary with optimizer and activation choice.
Practical deployment requires efficient subspace estimation (e.g., randomized SVD), scalable monitoring, and careful hyperparameter selection; benefits diminish as output dimension grows or in highly nonstationary tasks.

View Paper Prompt View All Prompts

Glossary

Adam: An adaptive stochastic optimization algorithm that adjusts learning rates using first and second moments of gradients. "We also show this phenomenon approximately holds for networks trained using SGD with momentum and Adam."
Cross-entropy loss: A classification loss that measures the divergence between predicted probability distributions and target labels. "all on cross-entropy loss"
ELU: Exponential Linear Unit, a smooth activation function that reduces vanishing gradients and can improve learning dynamics. "In the $ELU$ network's first layer, the middle $d - 2K$ singular subspace evolves very slowly throughout training, especially compared to that of the $ReLU$ network."
Frobenius norm: A matrix norm equal to the square root of the sum of the squares of all entries; the Euclidean norm of the matrix viewed as a vector. "We use $\sigma_i(\bm X)$ , $\| \bm X \|_F$ , $\| \bm X \|_1$ , and $\| \bm X \|_{\max}$ to respectively denote the $i^{th}$ singular value, Frobenius norm, matrix-$1$ norm, and maximum magnitude element."
GELU: Gaussian Error Linear Unit, a smooth activation that weights inputs by their magnitude using the Gaussian cumulative distribution. "three had smooth activations ( $ELU$ , $GELU$ , $SiLU$ )"
Gradient descent (GD): An iterative optimization method that updates parameters in the direction of the negative gradient of the loss. "multi-layer perceptrons (MLPs) under gradient descent (GD)."
Invariant low-dimensional subspaces: Fixed subspaces of comparatively small dimension in which the parameter updates concentrate during training. "the weight dynamics concentrate within invariant low-dimensional subspaces throughout training."
Leaky-ReLU: A ReLU variant that allows a small, non-zero gradient for negative inputs to mitigate dead neurons. "( $ReLU$ , $Leaky-ReLU$ , $\mathrm{Randomized-}ReLU$ )"
Low-rank adaptation (LoRA): A fine-tuning technique that adds trainable low-rank adapters to frozen model weights to reduce training cost. "low-rank adaptation (LoRA) \cite{hu2022lora} has recently emerged as a popular fine-tuning technique for LLMs by adding a low-rank adapter to frozen pre-trained weights."
Low-rank parameterization: Representing model weights or layers with low-rank structures to reduce dimensionality and computation while retaining performance. "there exists a low-rank MLP parameterization"
Matrix-1 norm: A matrix norm defined as the maximum absolute column sum. "We use $\sigma_i(\bm X)$ , $\| \bm X \|_F$ , $\| \bm X \|_1$ , and $\| \bm X \|_{\max}$ to respectively denote the $i^{th}$ singular value, Frobenius norm, matrix-$1$ norm, and maximum magnitude element."
Neural collapse: A phenomenon during late-stage training in classification where within-class features converge to their class means and class means form a simple geometric structure. "studied neural collapse \citep{papyan2020prevalence} in this exact setting."
Orthonormal basis: A set of vectors that are mutually orthogonal and each of unit length, spanning a subspace. "Let $\bm U_1, \bm U_2 \in {R}^{d \times r}$ be orthonormal bases of two $r$ -dimensional subspaces of ${R}^d$ ."
Orthogonal complement: The set of all vectors orthogonal to a given subspace. "and ${R}^\perp\left( \bm X \right)$ its orthogonal complement."
Principal angles (between subspaces): Angles that quantify the alignment between two subspaces; smaller angles indicate greater alignment. "the $i^{th}$ principal angle $\theta_i$ between $\bm U_1$ and $\bm U_2$ is defined as such:"
ReLU: Rectified Linear Unit, a piecewise linear activation function defined as max(0, x). "compared to that of the $ReLU$ network."
Semi-orthogonal matrix: A (generally rectangular) matrix with orthonormal columns or rows but not necessarily square. "assuming $\bm W_l(0)$ is initialized as an $\epsilon$ -scaled semi-orthogonal matrix."
SGD with momentum: A stochastic optimization method that accelerates gradient descent by accumulating a velocity vector in directions of consistent descent. "networks trained using SGD with momentum and Adam."
SiLU: Sigmoid Linear Unit (also known as Swish), a smooth activation defined as x·sigmoid(x). "three had smooth activations ( $ELU$ , $GELU$ , $SiLU$ )"
Singular subspace: The subspace spanned by the singular vectors corresponding to selected singular values (e.g., top-K) of a matrix. "Let $\bm L_{1, 1}(t)$ and $\bm R_{1, 1}(t)$ denote top- $K$ left and right singular subspaces of $\nabla_{\bm W_1} {L}\left( \bm W_1(t) \right)$ "
Stable rank: A continuous proxy for matrix rank defined via norms, indicating the effective dimensionality of a matrix. "GD converges to lower stable rank solutions at smaller initialization scales."
Wedin's Sin Theorem: A perturbation result that bounds changes in subspace angles (sines of principal angles) under matrix perturbations. "allows us to use Wedin's Sin Theorem \citep{wedin1972perturbation} to upper bound the change in singular subspace alignment."
Whitened data: Data transformed to have zero mean and identity covariance matrix, removing correlations and scaling differences. "The data $\bm X \in {R}^{d \times N}$ is whitened,"

View Paper Prompt View All Prompts

Open Problems

Approximate linearity of two-layer MLP training dynamics under small initialization

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Summary

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Overview and Motivation

Theoretical Analysis: Two-Layer Networks with Smooth Activations

Extension Beyond Theory: Deep Networks and Practical Optimizers

Low-Rank Parameterization: From Observations to Efficient Architectures

Technical Implications and Insights

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview of the paper

Key questions the paper asks

How the researchers studied the problem (methods)

What they found (main results)

Why this matters (implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Global assumptions and dependencies affecting feasibility

Glossary

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Summary

Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Overview and Motivation

Theoretical Analysis: Two-Layer Networks with Smooth Activations

Extension Beyond Theory: Deep Networks and Practical Optimizers

Low-Rank Parameterization: From Observations to Efficient Architectures

Technical Implications and Insights

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview of the paper

Key questions the paper asks

How the researchers studied the problem (methods)

What they found (main results)

Why this matters (implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Global assumptions and dependencies affecting feasibility

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets