Multi-Normalized Gradient Descent (MNGD)

Updated 19 February 2026

Multi-Normalized Gradient Descent (MNGD) is an optimization framework that normalizes gradients using multiple norms to balance objective scales and mitigate bias.
It employs iterative alternating projections to approximate NP-hard multi-norm projections, enabling scalable LLM training and efficient Pareto optimization.
Practical applications span LLM pretraining, multiobjective recommender systems, and constrained optimization, offering memory efficiency and convergence guarantees.

Multi-Normalized Gradient Descent (MNGD) encompasses a family of optimization algorithms that employ normalization of gradients with respect to multiple norms or objectives before aggregation into a search direction. The core strategy is to mitigate the bias introduced by disparate gradient magnitudes, scales, or objectives—permitting scalable, stateless, and Pareto-efficient learning in large-scale or multi-objective settings.

1. Conceptual Foundations and Problem Settings

MNGD was introduced as a general framework for gradient-based optimization where gradient normalization occurs with respect to multiple metrics. Applications span LLM pretraining (Scetbon et al., 10 Feb 2025), nonlinear inequality-constrained optimization (Chen et al., 2019), and multi-objective optimization (MOP) typical in recommender systems or vector-valued loss landscapes (Milojkovic et al., 2019, Yang, 2024).

In LLM training, adaptive optimizers like Adam incur prohibitive memory cost due to auxiliary state variables. MNGD targets “stateless” optimization by exclusively operating on the current instantaneous gradient, transforming it using projections under multiple norm constraints to approximate the effect of adaptive normalization. In multi-objective optimization, the challenge is the imbalance and scale discrepancy between individual objective gradients, addressed in MNGD by normalizing and efficiently combining them to approach Pareto stationarity.

2. Algorithmic Formulations

2.1 Multi-Norm Projection for LLMs

Let $\Theta\subset\mathbb{R}^d$ parametrize the model and $\ell(\theta,x)$ be the sample loss. The MNGD step is constructed by:

Compute the raw gradient: $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ .
Define $K$ norm constraints $g_1,\cdots,g_K$ on $\mathbb{R}^d$ .
Obtain the multi-norm projected gradient via

$\hat{G}_t \in \arg\max_{z\in\mathbb{R}^d} \langle\nabla_t, z\rangle \;\text{s.t.}\; \forall i\in[K],\ g_i(z)=1$

Direct solution is NP-hard. An efficient approximation uses $L$ rounds of alternating norm projections: - For $l=1\ldots L$ and $i=1\ldots K$ : $\ell(\theta,x)$ 0, where each $\ell(\theta,x)$ 1 is a closed-form norm projection (e.g., scaled $\ell(\theta,x)$ 2 or spectral norm normalization).

Update: $\ell(\theta,x)$ 3 (Scetbon et al., 10 Feb 2025).

2.2 Multi-Objective and Inequality-Constrained Formulations

For multi-objective settings, with objectives $\ell(\theta,x)$ 4, the normalized gradients $\ell(\theta,x)$ 5 are defined via scaling:

$\ell(\theta,x)$ 6

where $\ell(\theta,x)$ 7 is the centered initial loss for $\ell(\theta,x)$ 8 (Milojkovic et al., 2019). The common descent direction $\ell(\theta,x)$ 9 solves:

$\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 0

with $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 1. The update: $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 2.

In inequality-constrained programs (Chen et al., 2019), MNGD constructs the search direction as:

$\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 3

where $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 4 is the log-barrier and $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 5.

3. Fixed-Point Analysis and Convergence Guarantees

3.1 Alternating Projection Fixed Point

For the multi-norm projection in LLM training, under the assumption that each norm projection $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 6 preserves $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 7-norm and is idempotent, the alternating sequence of projections converges to a common fixed point up to arbitrary precision. Specifically, for $\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 8:

$\nabla_t = \nabla_\theta\ell(\theta_t, x^{(t)})\in\mathbb{R}^d$ 9

converges to $K$ 0, where $K$ 1 (Scetbon et al., 10 Feb 2025). For $K$ 2, the analysis extends similarly.

Linear convergence in the Hilbert metric is established for the SinkGD variant, tied to the convergence of the Sinkhorn normalization (Scetbon et al., 10 Feb 2025).

3.2 Multi-Objective Pareto Stationarity

Deterministic MNGD for multi-objective optimization guarantees convergence of iterates to Pareto-stationary points, i.e., the convex hull of gradients at a limit point contains zero (Milojkovic et al., 2019, Yang, 2024):

$K$ 3

Stochastic variants maintain this property almost surely under diminishing learning rates.

3.3 Constrained Optimization and Barrier Central Path

For inequality-constrained programs, MNGD trajectories converge in finite $K$ 4 time to the boundary of the feasible set, yielding $K$ 5-KKT solutions, and under a "relative convexity" metric, exhibit local convergence toward second-order KKT points (Chen et al., 2019).

4. Key Algorithmic Instances

Variant	Normalization Mechanism	Application Domain
SWAN	Row-norm + whitening (Schatten-∞)	LLM training (stateless)
SinkGD	Alternating row/column $K$ 6	LLM training (stateless)
GBBN	$K$ 7 gradient norm/regularized	Multi-objective unconstrained
MGDRec	Objective-scale normalization	Recommender systems
(Chen et al., 2019)	Barrier and objective norm	Constrained optimization

SinkGD, an instance of MNGD with row- and column-wise $K$ 8 normalization, achieves linear-time and linear-memory complexity for LLM-scale training (Scetbon et al., 10 Feb 2025). SWAN performs row-norm followed by whitening, corresponding to Schatten-∞ normalization, but requires more expensive matrix operations. The GBBN algorithm incorporates gradient normalization (with an $K$ 9-plus- $g_1,\cdots,g_K$ 0 denominator for stability), Barzilai–Borwein step sizes, and nonmonotone Armijo line search (Yang, 2024).

5. Empirical Performance and Practical Implications

5.1 LLM Training

Experiments on pre-training LLaMA models (up to 1.3B parameters) with SinkGD demonstrated token-efficiency speed-ups (2.4×–2.8×) and 3× effective throughput versus Adam, with memory footprint reduced to parameter size only (e.g., 2.98GB vs. Adam's ~7.5GB for 1.3B) (Scetbon et al., 10 Feb 2025). Test perplexity matched or surpassed Adam, SWAN, and recent memory-efficient baselines.

5.2 Multiobjective Optimization

In multiobjective recommender systems, gradient normalization enabled simultaneous improvement across uncorrelated objectives (e.g., recall and content quality). Pareto-tradeoff plots show that only normalized MNGD and related strategies approach or dominate the empirical Pareto front (Milojkovic et al., 2019). For unconstrained numerical test suites, GBBN consistently achieved denser Pareto fronts and faster convergence than established Barzilai–Borwein variants, especially under moderate regularization (Yang, 2024).

5.3 Constrained Optimization

On benchmark constrained optimization problems and industrial frame-structure optimization, MNGD achieved steady reduction in objective while maintaining constraint satisfaction, efficiently tracking the normalized central path implied by the log-barrier interior-point methods (Chen et al., 2019).

6. Theoretical and Methodological Connections

MNGD generalizes classical steepest descent by introducing multi-norm constraints or normalization—subsuming adaptive optimization (Adam) when combined with per-coordinate moment scaling, equivalence to multiprojection synthesis (as in SWAN/SinkGD), and connections to interior-point flows via normalized barrier-gradients (Scetbon et al., 10 Feb 2025, Chen et al., 2019).

In multiobjective and vectorized loss settings, normalization enforces balance among gradients of different (possibly incommensurate) objectives, a necessity for scalable, practical Pareto optimization. Alternating projection fixed-point theory, dual-quadratic programming, and nonmonotone line-search regimes form the central theoretical apparatus for convergence and step-size control (Yang, 2024).

7. Limitations and Future Directions

While MNGD eliminates memory-intensive optimizer state and stabilizes multi-objective training, some limitations remain. The multi-norm projection step is generally NP-hard and practical schemes involve iterative alternating projections, whose efficiency depends on the selected norms. For deep model training, highly structured matrix normalization (e.g., whitening in SWAN) incurs significant computation unless simplified as in SinkGD (Scetbon et al., 10 Feb 2025). In multi-objective and constrained settings, the quality of Pareto-stationarity and final convergence is sensitive to normalization constants, aggregation schemes, and problem-specific geometry (Milojkovic et al., 2019, Yang, 2024). Further work could explore adaptive selection of norms, efficient high-rank projections, and extension to broader classes of nonconvex or stochastic objectives.

Markdown Report Issue Upgrade to Chat

References (4)

Gradient Multi-Normalization for Stateless and Scalable LLM Training (2025)

A gradient descent akin method for inequality constrained optimization (2019)

Multi-Gradient Descent for Multi-Objective Recommender Systems (2019)

A global Barzilai and Borwein's gradient normalization descent method for multiobjective optimization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Normalized Gradient Descent (MNGD).