Multi-Normalized Gradient Descent (MNGD)
- Multi-Normalized Gradient Descent (MNGD) is an optimization framework that normalizes gradients using multiple norms to balance objective scales and mitigate bias.
- It employs iterative alternating projections to approximate NP-hard multi-norm projections, enabling scalable LLM training and efficient Pareto optimization.
- Practical applications span LLM pretraining, multiobjective recommender systems, and constrained optimization, offering memory efficiency and convergence guarantees.
Multi-Normalized Gradient Descent (MNGD) encompasses a family of optimization algorithms that employ normalization of gradients with respect to multiple norms or objectives before aggregation into a search direction. The core strategy is to mitigate the bias introduced by disparate gradient magnitudes, scales, or objectives—permitting scalable, stateless, and Pareto-efficient learning in large-scale or multi-objective settings.
1. Conceptual Foundations and Problem Settings
MNGD was introduced as a general framework for gradient-based optimization where gradient normalization occurs with respect to multiple metrics. Applications span LLM pretraining (Scetbon et al., 10 Feb 2025), nonlinear inequality-constrained optimization (Chen et al., 2019), and multi-objective optimization (MOP) typical in recommender systems or vector-valued loss landscapes (Milojkovic et al., 2019, Yang, 2024).
In LLM training, adaptive optimizers like Adam incur prohibitive memory cost due to auxiliary state variables. MNGD targets “stateless” optimization by exclusively operating on the current instantaneous gradient, transforming it using projections under multiple norm constraints to approximate the effect of adaptive normalization. In multi-objective optimization, the challenge is the imbalance and scale discrepancy between individual objective gradients, addressed in MNGD by normalizing and efficiently combining them to approach Pareto stationarity.
2. Algorithmic Formulations
2.1 Multi-Norm Projection for LLMs
Let parametrize the model and be the sample loss. The MNGD step is constructed by:
- Compute the raw gradient: .
- Define norm constraints on .
- Obtain the multi-norm projected gradient via
Direct solution is NP-hard. An efficient approximation uses rounds of alternating norm projections: - For and : , where each is a closed-form norm projection (e.g., scaled or spectral norm normalization).
Update: (Scetbon et al., 10 Feb 2025).
2.2 Multi-Objective and Inequality-Constrained Formulations
For multi-objective settings, with objectives , the normalized gradients are defined via scaling:
where is the centered initial loss for (Milojkovic et al., 2019). The common descent direction solves:
with . The update: .
In inequality-constrained programs (Chen et al., 2019), MNGD constructs the search direction as:
where is the log-barrier and .
3. Fixed-Point Analysis and Convergence Guarantees
3.1 Alternating Projection Fixed Point
For the multi-norm projection in LLM training, under the assumption that each norm projection preserves -norm and is idempotent, the alternating sequence of projections converges to a common fixed point up to arbitrary precision. Specifically, for :
converges to , where (Scetbon et al., 10 Feb 2025). For , the analysis extends similarly.
Linear convergence in the Hilbert metric is established for the SinkGD variant, tied to the convergence of the Sinkhorn normalization (Scetbon et al., 10 Feb 2025).
3.2 Multi-Objective Pareto Stationarity
Deterministic MNGD for multi-objective optimization guarantees convergence of iterates to Pareto-stationary points, i.e., the convex hull of gradients at a limit point contains zero (Milojkovic et al., 2019, Yang, 2024):
Stochastic variants maintain this property almost surely under diminishing learning rates.
3.3 Constrained Optimization and Barrier Central Path
For inequality-constrained programs, MNGD trajectories converge in finite time to the boundary of the feasible set, yielding -KKT solutions, and under a "relative convexity" metric, exhibit local convergence toward second-order KKT points (Chen et al., 2019).
4. Key Algorithmic Instances
| Variant | Normalization Mechanism | Application Domain |
|---|---|---|
| SWAN | Row-norm + whitening (Schatten-∞) | LLM training (stateless) |
| SinkGD | Alternating row/column | LLM training (stateless) |
| GBBN | gradient norm/regularized | Multi-objective unconstrained |
| MGDRec | Objective-scale normalization | Recommender systems |
| (Chen et al., 2019) | Barrier and objective norm | Constrained optimization |
SinkGD, an instance of MNGD with row- and column-wise normalization, achieves linear-time and linear-memory complexity for LLM-scale training (Scetbon et al., 10 Feb 2025). SWAN performs row-norm followed by whitening, corresponding to Schatten-∞ normalization, but requires more expensive matrix operations. The GBBN algorithm incorporates gradient normalization (with an -plus- denominator for stability), Barzilai–Borwein step sizes, and nonmonotone Armijo line search (Yang, 2024).
5. Empirical Performance and Practical Implications
5.1 LLM Training
Experiments on pre-training LLaMA models (up to 1.3B parameters) with SinkGD demonstrated token-efficiency speed-ups (2.4×–2.8×) and 3× effective throughput versus Adam, with memory footprint reduced to parameter size only (e.g., 2.98GB vs. Adam's ~7.5GB for 1.3B) (Scetbon et al., 10 Feb 2025). Test perplexity matched or surpassed Adam, SWAN, and recent memory-efficient baselines.
5.2 Multiobjective Optimization
In multiobjective recommender systems, gradient normalization enabled simultaneous improvement across uncorrelated objectives (e.g., recall and content quality). Pareto-tradeoff plots show that only normalized MNGD and related strategies approach or dominate the empirical Pareto front (Milojkovic et al., 2019). For unconstrained numerical test suites, GBBN consistently achieved denser Pareto fronts and faster convergence than established Barzilai–Borwein variants, especially under moderate regularization (Yang, 2024).
5.3 Constrained Optimization
On benchmark constrained optimization problems and industrial frame-structure optimization, MNGD achieved steady reduction in objective while maintaining constraint satisfaction, efficiently tracking the normalized central path implied by the log-barrier interior-point methods (Chen et al., 2019).
6. Theoretical and Methodological Connections
MNGD generalizes classical steepest descent by introducing multi-norm constraints or normalization—subsuming adaptive optimization (Adam) when combined with per-coordinate moment scaling, equivalence to multiprojection synthesis (as in SWAN/SinkGD), and connections to interior-point flows via normalized barrier-gradients (Scetbon et al., 10 Feb 2025, Chen et al., 2019).
In multiobjective and vectorized loss settings, normalization enforces balance among gradients of different (possibly incommensurate) objectives, a necessity for scalable, practical Pareto optimization. Alternating projection fixed-point theory, dual-quadratic programming, and nonmonotone line-search regimes form the central theoretical apparatus for convergence and step-size control (Yang, 2024).
7. Limitations and Future Directions
While MNGD eliminates memory-intensive optimizer state and stabilizes multi-objective training, some limitations remain. The multi-norm projection step is generally NP-hard and practical schemes involve iterative alternating projections, whose efficiency depends on the selected norms. For deep model training, highly structured matrix normalization (e.g., whitening in SWAN) incurs significant computation unless simplified as in SinkGD (Scetbon et al., 10 Feb 2025). In multi-objective and constrained settings, the quality of Pareto-stationarity and final convergence is sensitive to normalization constants, aggregation schemes, and problem-specific geometry (Milojkovic et al., 2019, Yang, 2024). Further work could explore adaptive selection of norms, efficient high-rank projections, and extension to broader classes of nonconvex or stochastic objectives.