Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Published 16 Mar 2026 in cs.LG | (2603.15958v1)

Abstract: Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.

Abstract PDF Upgrade to Chat

Summary

The paper presents explicit power-law scaling laws derived from non-asymptotic convergence bounds within the LMO framework.
It details how optimal learning rate, batch size, and momentum settings scale with token budget to achieve a T^-1/4 convergence rate.
Empirical validation on transformer pretraining underlines the practical implications for hyperparameter transfer in compute-constrained environments.

Explicit Hyperparameter Scaling Laws from Modern Optimization Theory

Introduction

The paper "Deriving Hyperparameter Scaling Laws via Modern Optimization Theory" (2603.15958) presents a comprehensive theoretical treatment of hyperparameter scaling in large-scale deep learning optimization, specifically focusing on learning rate, momentum, and batch size scheduling as functions of the training horizon and token budget for fixed model architectures. The analysis employs recent non-asymptotic convergence bounds within the Linear Minimization Oracle (LMO) framework, incorporating optimizers such as normalized SGD, signSGD (and thus Adam), and Muon. By minimizing these bounds in relevant regimes, the authors derive explicit power-law scaling schedules and clarify which empirical and previously heuristic scaling rules can be directly justified from optimization theory, thus providing substantial guidance for hyperparameter transfer in compute-constrained and high-cost training environments.

Theoretical Framework

The central contribution is a unified approach for extracting hyperparameter scaling laws by leveraging sharp convergence bounds for stochastic, momentum-normalized optimizers. Building on the LMO framework, the analysis captures a suite of critical methods used in modern LLM pretraining pipelines, notably connecting theory with Adam and Muon empirically.

The core non-convex bound has the form:

$\min_{1\le k\le K} \mathbb{E}[\|\nabla f(x^k)\|_\star] \leq \frac{\Delta_0}{\eta K} + \frac{2\rho\sigma}{\alpha \sqrt{b} K} + 2\rho\sigma\sqrt{\frac{\alpha}{b} + \frac{7L\eta}{2} + \frac{2L\eta}{\alpha}},$

where $\eta$ is the learning rate, $b$ the batch size, and $\alpha$ the normalized momentum parameter. This proxy includes deterministic optimization, momentum burn-in, a stochastic noise floor, and smoothness-induced error components. LMO-based optimizers, by virtue of explicit normalization and momentum coupling, go beyond classical SGD in their batch size–optimization horizon interactions, enabling non-trivial optimal batch size phenomena.

Scaling Law Derivations

Fixed Momentum Regime

With momentum fixed, minimizing the proxy bound leads to several concrete predictions:

For fixed iteration count $K$ , the optimal learning rate scales as $\eta^\star \propto K^{-1/2}$ , with the minimized bound improving with batch size.
For a fixed token budget $T = bK$ , the optimal learning rate exhibits $\eta^\star \propto b^{1/2} T^{-1/2}$ , with the jointly optimal batch size and learning rate obeying $b^\star \propto T^{1/2}, \, \eta^\star \propto T^{-1/4}$ . Performance, as captured by gradient norm proxy, decays as $T^{-1/4}$ . Crucially, the optimal batch size exceeds $1$ only after a momentum-dependent phase, contrasting with SGD where no such non-trivial batch size scaling is present.
Figure 1: Verification of Theorem~$\ref{thm:fixed-alpha}$, illustrating fixed-momentum scaling laws: optimal performance, batch size, and learning rate behaviors as functions of the token budget $T$ .

Fixed Batch Size and Momentum Tuning

By holding batch size fixed and minimizing the proxy with respect to momentum and learning rate, the optimal schedules become $\eta^\star \propto b^{1/2} \alpha^{1/2} T^{-1/2}$ and $\alpha^\star \propto b T^{-1/2}$ . Upon further minimization, the achievable rate remains $T^{-1/4}$ .

Figure 2: Numerical verification of the fixed-batch-size, momentum-tuned regime—Theorem~$\ref{thm:fixed_batch_mom}$.

Joint Tuning of All Hyperparameters

Minimizing the bound with respect to batch size, learning rate, and normalized momentum yields the asymptotics:

$b_T^\star \propto T^{1/6}$ , $\eta_T^\star \propto T^{-7/12}$ , $\alpha_T^\star \propto T^{-1/3}$ ,
The convergence rate with respect to token budget remains $T^{-1/4}$ regardless of the chosen scaling regime, up to constant factors.

Figure 4: Numerical verification of Theorem~$\ref{thm:tuned-token}$—asymptotic behavior under joint hyperparameter tuning.

The analysis reveals a flat suboptimality landscape in batch size for broad regimes, contingent on precise retuning of momentum and learning rate; only lower-order terms (burn-in and smoothness) select a specific batch growth law, with many alternative schedules being near-optimal. This is formalized in the multiple batch size scaling options uncovered in the paper.

Figure 6: Performance contours over batch size and token budget $T$ , showing near-optimal batch-size schedules as a function of token count.

Figure 8: Iso-performance contours as a function of batch size and iteration count, highlighting the hyperbolic tradeoff between batch and step count.

Empirical and Qualitative Verification

The theoretical predictions are corroborated by both numerical proxy minimization and qualitative experiments on transformer training with actual optimizers (PlainLM implementation), which reveal that optimal learning rate increases with batch size but decreases with longer horizons, consistent with the derived laws.

Figure 9: Experimental results on LLM pretraining, confirming the predicted relationships among batch size, learning rate, and token budget.

Comparison to SGD and Observed Deviations

A significant theoretical divergence is established between normalized-momentum methods and vanilla SGD. For SGD, classical proxy analysis under token-limited budgets yields trivial batch-size dependence after learning rate tuning, with all non-trivial batch effects eliminated. In contrast, the LMO framework for Adam-like optimizers encodes a genuine, horizon-dependent optimal batch size, reconciling with empirical observations in LLM scaling.

The paper also treats a variety of empirical anomalies and reports from recent large-scale studies. Instances where learning rate appears to increase with token budget (with batch size scaling) are shown to arise naturally from protocol path-conditioning and hyperparameter constraints, rather than from deviations in the theoretical schedule. Furthermore, the impact of heavy-tailed noise and the choice of norm in defining variance is discussed, emphasizing circumstances under which the predicted exponents and scaling behaviors might differ from idealized settings.

Budget Transfer and Practical Implications

The framework yields direct "budget transfer" laws: transfer of tuned hyperparameters from a small (pilot) run to a larger training horizon should follow precise power laws as derived from the bound minimizations (e.g., for fixed batch size and momentum, learning rate should scale as $\eta_1 = \eta_0 (T_0/T_1)^{1/2}$ when moving from $T_0$ to $T_1$ tokens). These schedules depend critically on which parameters are constrained (hardware caps or protocol constraints), and the analysis characterizes when tuning momentum or batch size becomes necessary to avoid performance plateaus due to non-vanishing noise floors.

Limitations and Directions for Future Research

Key caveats are explicitly addressed: all results assume constant learning rates, fixed model size, and implicit statistical generalization via the population objective. Theoretical predictions are contingent on finite-variance noise and idealized initialization and may not capture the entirety of empirical scaling observed in large-scale LLM protocols—especially when generalization or protocol idiosyncrasies dominate. Notably, model size scaling, initialization mismatch, weight decay, learning rate annealing, and warmup are not included in the baseline analysis. The question of reconciling remaining discrepancies between empirical and theoretical scaling, particularly as both batch size and token budget increase, is highlighted as an open frontier.

Conclusion

The study presents a systematic and optimization-theoretic derivation of hyperparameter scaling laws for first-order optimizers in deep learning under fixed model architectures. The analysis clarifies when batch size, learning rate, and momentum should be tuned and how these parameters interact with compute budgets, iteration counts, and noise regimes. By recovering and explaining many observed empirical rules via explicit proxy minimization, the work supplies a principled framework for hyperparameter transfer and sheds light on outstanding mismatches between theory and practice. Extensions to account for model scaling, regularization, scheduling, and heavy-tailed noise remain crucial for further advances in scaling law theory and its integration into LLM training protocols.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory — Explained Simply

What is this paper about?

This paper looks for simple, reliable rules to choose three important training settings (called hyperparameters) when training neural networks:

learning rate (how big a step the model takes while learning),
batch size (how many examples it looks at at once),
momentum (how much it keeps “remembering” past directions).

The authors focus on modern optimizers that behave like Adam or Muon (they use directions that are normalized or sign-based), and they ask: as we change the total amount of training (the “token budget”), how should these hyperparameters change to keep training efficient?

What questions are the authors trying to answer?

In friendly terms:

If we train longer or on more data, how should we adjust the learning rate, batch size, and momentum?
Can we write down simple “power laws” (like “scale by the square root”) that tell us what to do?
Do these rules match what people have seen in practice with big LLMs?

How did they approach the problem?

Think of training as driving a car toward a destination:

learning rate is how hard you press the gas pedal,
batch size is how many clues you use at once to decide your direction,
momentum is like cruise control that smooths sudden changes.

Instead of testing every possible setting, the authors use a math tool called a “performance bound.” You can think of it like a safety ceiling: it guarantees training won’t be worse than a certain score. They then:

Treat this bound as a “proxy score” for how well training will go.
Minimize this score with respect to learning rate, batch size, and momentum under different conditions (fixed steps, fixed batch size, fixed momentum, or a fixed total token budget).
From this, they derive simple scaling rules—how each hyperparameter should grow or shrink as the training budget changes.

Technically, they study a family of optimizers called LMO-based methods (including normalized SGD, signSGD, and Muon). These methods adjust direction using normalized or sign-based updates, which are a good proxy for how Adam-like optimizers behave. They use recent theoretical results that bound how fast such methods converge, then optimize those bounds to get the scaling laws.

What did they find, and why does it matter?

Here are the core takeaways, translated into practical, easy-to-remember rules. “Token budget” ( $T$ ) means the total number of training examples processed (tokens = batch size × steps).

Square-root rule for batch size and learning rate (with fixed momentum)
- If you multiply your batch size by $\kappa$ , the best learning rate should multiply by about $\sqrt{\kappa}$ .
- If you multiply your token budget by $\kappa$ but keep batch size fixed, the best learning rate should divide by about $\sqrt{\kappa}$ .
- Why this matters: It matches what many practitioners already do and gives a clean theory behind it.
There is a non-trivial “best” batch size when tokens are fixed (for Adam-like/LMO methods)
- For a fixed total token budget, there is an optimal batch size greater than 1 once training is long enough. Choosing too small or too large a batch size can be suboptimal.
- This is different from vanilla SGD, where the batch size doesn’t have a unique “best” for a fixed token budget (once you tune the learning rate).
If your batch size can’t grow (hardware limit), tune momentum with training length
- With a fixed batch size, if you don’t adjust momentum, you can get “stuck” with a noise floor (progress stops improving).
- The fix: as you train longer, gradually increase the momentum (i.e., make it closer to 1). This removes the floor and restores good progress.
Jointly tuning everything gives one “compute-optimal” recipe—but it’s not the only good one
- When tuning learning rate, batch size, and momentum together for large token budgets, the math suggests:
  - batch size grows slowly like $T^{1/6}$ ,
  - learning rate shrinks like $T^{-7/12}$ ,
  - momentum approaches 1 like $1 - T^{-1/3}$ ,
  - overall training “difficulty” drops like $T^{-1/4}$ .
- But here’s the punchline: many different batch-size growth patterns (as long as they don’t grow too fast) are nearly as good if you retune learning rate and momentum appropriately. So there isn’t just one “magic” schedule.
Practical understanding of momentum and batch size
- As batch size increases (for the same total tokens), the optimal momentum should slightly decrease.
- As training gets longer at a fixed batch size, the optimal momentum should increase (get closer to 1).
- This gives a principled way to adjust momentum—something many training recipes historically leave fixed.
Limits and scope
- These results hold with fixed model size, constant learning rate (no schedule), and certain standard assumptions about noise in gradients.
- The paper focuses on optimization (how fast you reach a good solution), not directly on generalization (how well you perform on new data).

Simple rules of thumb you can remember

Use these with Adam-like or normalized/sign-based optimizers:

If you double batch size and keep tokens fixed: multiply learning rate by about √2.
If you double tokens and keep batch size fixed: divide learning rate by about √2.
If your batch size can’t increase: push momentum closer to 1 as you train longer.
If you tune all three, many scaling strategies are almost equally good—don’t feel locked into a single formula.

Why is this important?

Training large models is extremely expensive. Having clear, theory-backed rules for how to adjust hyperparameters when you change batch size or how long you train can:

reduce trial-and-error,
improve training stability and speed,
help “transfer” good settings from small runs to bigger ones,
save time and compute.

What’s the bigger picture?

This work builds a bridge between practical training tricks and modern optimization theory. It explains why widely used rules like the square-root learning-rate scaling make sense, shows where they do and don’t apply, and offers new guidance—especially about momentum—when batch size is limited. It also opens the door to future work that adds common training features like learning-rate schedules, warmup, weight decay, and changing model size.

In short: the paper turns folk wisdom about hyperparameters into clear, mathematical guidance that can make big-model training more predictable and efficient.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single consolidated list of unresolved issues that are missing, uncertain, or left unexplored in the paper; each item is phrased to be concrete and actionable for future research.

Bound tightness and constants: Quantify how tight the LMO-based nonconvex bound is for modern LLM training regimes; estimate $C_1, C_2, C_3$ , $L$ , $\sigma$ , and $\rho$ from real runs and assess whether the predicted $T^{-1/4}$ rate and exponents hold beyond asymptotics.
Pre-asymptotic (“burn‑in”) regime: Characterize the finite‑ $K$ , finite‑ $T$ regime where the burn‑in term is not negligible; derive thresholds for the “critical token budget” where $b^\star>1$ becomes optimal as a function of $\alpha$ , $C_i$ , and validate empirically.
Noise model validity: Test the assumption $E\|g_b-\nabla f\|_2^2\propto \sigma^2/b$ $E ∥ g_{b} - \nabla f ∥_{2}^{2} \propto σ^{2} / b$ under:
- correlated minibatches and repeated data passes,
- heavy‑tailed gradient noise,
- gradient clipping and mixed precision,
- anisotropic or norm‑mismatched noise relevant to LMO norms.
- Derive revised scaling laws when variance deviates from $b^{-1}$ .
Time‑varying curvature and noise: Incorporate nonstationary $L(t)$ and $\sigma(t)$ along training; derive phase‑dependent scalings and stagewise schedules when $L,\sigma$ evolve significantly.
Mapping to Adam and other adaptive methods: Replace the signSGD proxy with a theory that explicitly models Adam’s second moment ( $\beta_2$ ), $\epsilon$ , and decoupled weight decay; derive batch/token/momentum scaling for $(\beta_1,\beta_2)$ and compare to the $\beta_2^\kappa$ heuristic.
Norm choice and architecture dependence: Analyze how the choice of norm (e.g., $\ell_\infty$ vs spectral) and the norm‑equivalence constant $\rho$ depend on layer shapes, parameterization, and architecture; quantify how this shifts constants and possibly exponents.
Inexact LMO implementations: Model the gap between idealized LMO updates and practical approximations in Muon/Scion (e.g., approximate spectral norms, per‑layer constraints); determine how inexactness alters optimal scalings.
Learning‑rate schedules and warmup: Extend the analysis beyond constant $\eta$ to scheduled learning rates (cosine/step/linear warmup), and derive how schedule parameters should scale with $T$ , $b$ , and $\alpha$ .
Weight decay and regularization: Incorporate decoupled weight decay (and other regularizers) into the bound and re‑derive scaling laws for $(\eta,b,\alpha)$ ; quantify interactions with generalization.
Generalization and finite‑sample effects: Move beyond the population‑risk oracle to account for finite datasets, multiple epochs, and test loss; determine how optimization‑optimal schedules trade off with generalization‑optimal schedules.
Reconciling LR–token trends: Provide a formal, path‑dependent analysis explaining when $\eta^\star(T)$ can empirically increase with $T$ under joint $b(T)$ growth, and design experiments that isolate path effects from true compute‑optimal behavior.
Joint scaling with model size: The current analysis fixes model size $N$ ; derive how $(\eta^\star,b^\star,\alpha^\star)$ scale jointly with $N$ under $\mu$ P and alternative parameterizations, including how $L(N)$ and $\sigma(N)$ evolve.
Compute/wall‑clock optimality: Incorporate systems constraints (per‑step latency, communication, memory, pipeline bubbles) to translate token‑optimal schedules into wall‑clock/energy‑optimal schedules; re‑derive batch versus step tradeoffs.
SGD versus LMO in practice: Provide controlled empirical comparisons that verify the paper’s claim that SGD lacks a nontrivial token‑optimal $b$ after tuning $\eta$ ; delineate regimes where LMO advantages are practically significant.
Loss connection and landscape assumptions: Replace star‑convex assumptions with weaker or empirically validated conditions connecting $\|\nabla f\|_\star$ reductions to loss/perplexity improvements; calibrate the gradient‑norm proxy against loss curves.
Robustness to gradient clipping and safety constraints: Determine how clipping (ubiquitous in LLM training) modifies the bound terms and optimal exponents; derive clipping‑aware scaling rules.
Layerwise/parameterwise schedules: Investigate whether layerwise norms and per‑layer hyperparameters (common in Muon/AdamW) imply different optimal exponents across layers; propose practical layerwise scaling recipes.
Hardware‑capped batch sizes: Provide explicit momentum and LR schedules that recover $T^{-1/4}$ when $b\le b_{\max}$ , quantify constants, and test stability (e.g., how fast $\alpha$ must decrease and numerical issues when $\alpha\to 0$ ).
Stability and numerical constraints: Among the many near‑optimal batch growth laws (e.g., $b\propto T^{\phi}$ for $\phi\le 1/2$ ), establish constraints that prevent unstable $\eta$ or $\alpha$ (e.g., overly small $\alpha$ or large $\eta$ ), and propose selection criteria beyond asymptotic optimality.
Sensitivity to initialization and pretraining protocols: Analyze how different initializations, pretraining curricula, and data mixtures affect $C_1=\Delta_0$ , $\sigma$ , and the crossover to the variance‑dominated regime; derive initialization‑aware scaling.
Calibration on real LLMs: Move beyond small‑scale demonstrations to multi‑billion‑parameter LLMs; fit $C_i$ , validate exponents for Muon, sign‑based Adam approximations, and AdamW, and release reproducible benchmarks.
Formalizing near‑optimality classes: Prove a characterization of “equivalence classes” of batch/momentum/LR schedules that achieve $T^{-1/4}$ up to constant factors, and identify practically preferable members based on throughput, memory, and generalization.
Data non‑IID and sequence effects: Model temporal correlations in token streams, document‑level batching, and dataset non‑IIDness; re‑derive variance scaling and resulting hyperparameter exponents under realistic data pipelines.
Extension to mixed‑objective training: Study how auxiliary losses (e.g., RLHF, contrastive objectives, auxiliary pretraining tasks) alter effective $L$ , $\sigma$ , and the optimal scaling laws when objectives are interleaved.
Linking critical batch size to theory: Connect empirical notions of critical batch size to the bound terms (optimization vs variance vs trust‑region terms) and provide predictive formulas for the critical/efficient batch size as training progresses.
Explicit constants and crossover analysis: Provide closed‑form expressions for constants in $\eta_T^\star$ , $b_T^\star$ , $\alpha_T^\star$ and compute crossover points where different terms dominate; turn asymptotic rules into deployable knobs for practitioners.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of “Deriving Hyperparameter Scaling Laws via Modern Optimization Theory”

This paper offers closed-form, theoretically grounded scaling rules for learning rate, momentum, and batch size as functions of training iterations or token budget, within a Linear Minimization Oracle (LMO) optimizer family that includes normalized SGD, signSGD (approximating Adam), and Muon. Below, we distill real-world applications organized by deployment horizon, with sector links, potential tools/workflows, and feasibility assumptions.

Immediate Applications

These can be deployed now using existing optimizers (AdamW, signSGD variants, Muon/Scion) and common training stacks (PyTorch, JAX, DeepSpeed, Megatron, FSDP/ZeRO).

Compute-aware training recipes for fixed model size
- Use case: Given a token budget $T$ and batch size $b$ , set learning rate and momentum using the paper’s closed-form rules to reach near-optimal optimization efficiency (expected gradient norm) without extensive grid search.
- Action:
- Fixed momentum (typical default β): set learning rate $\eta \propto b^{1/2}\,T^{-1/2}$ .
- Hardware-limited batch size: remove the noise floor by momentum tuning with $\,\alpha \equiv 1-\beta \propto b\,T^{-1/2}\,$ and $\eta \propto b^{1/2}\alpha^{1/2} T^{-1/2}$ (equivalently, with the previous choice $\eta \propto b\,T^{-3/4}$ ).
- Joint tuning (if permitted): a theoretically optimal asymptotic schedule is $b \propto T^{1/6}$ , $\alpha \propto T^{-1/3}$ , $\eta \propto T^{-7/12}$ ; however, many milder batch growth laws $b(T)\le T^{1/2}$ are near-optimal when momentum/learning rate are re-tuned accordingly.
- Sectors/workflows:
- Software/AI infrastructure: pretraining and continued pretraining of LLMs and vision models.
- Cloud/enterprise MLOps: cost-optimized training pipelines under fixed budgets.
- Assumptions/dependencies: constant learning rate (no decay), fixed model size, unbiased gradient oracle with variance $\propto 1/b$ , large-horizon regime for some results, LMO-like update behavior (e.g., Adam’s sign dynamics). Generalization effects are not modeled.
Hyperparameter transfer across token budgets
- Use case: Transfer a known-good configuration to a new training duration without re-running large sweeps.
- Action:
- If $b$ changes by factor $\kappa$ , scale $\eta \leftarrow \eta\,\kappa^{1/2}$ at fixed $T$ .
- If $T$ changes by factor $\kappa$ (at fixed $b$ and momentum), scale $\eta \leftarrow \eta\,\kappa^{-1/2}$ .
- If $b$ is capped, increase momentum (β→1) over longer horizons using $\alpha \propto b\,T^{-1/2}$ .
- Sectors/workflows:
- Academia/benchmarks: reproducible scaling across different run lengths.
- Industry: extending training runs when budgets allow.
- Assumptions/dependencies: same as above; transfer presumes similar data, initialization, and optimizer family.
Hardware-constrained training optimization
- Use case: Maintain efficiency when GPU memory caps batch size (edge devices, small clusters, robotics platforms).
- Action: At fixed $b=b_{\max}$ , tune momentum with $\,\alpha \propto b_{\max}\,T^{-1/2}\,$ and learning rate with $\eta \propto b_{\max}\,T^{-3/4}$ to avoid a variance-driven performance floor.
- Sectors/workflows:
- Robotics/edge computing: on-device or near-device fine-tuning.
- Healthcare (privacy-preserving/federated): client-specific $b$ limits.
- Assumptions/dependencies: stochastic gradients with variance decreasing ~1/√b; LMO-style optimizer behavior; stable momentum implementation on device.
AutoML/HPO acceleration via principled priors
- Use case: Cut hyperparameter search time/cost by seeding searches with scaling-law-derived priors and/or constraining search manifolds.
- Action: Plug $\eta$ – $b$ – $\alpha$ power laws into Bayesian optimization priors or population-based training schedules; prioritize tuning constants over exponents.
- Sectors/workflows:
- AutoML platforms; internal HPO services in AI labs.
- Assumptions/dependencies: optimizer resembles normalized/LMO family (e.g., Adam’s sign dynamics); constants ( $\rho,\sigma, L,\Delta_0$ ) unknown but can be absorbed into tuned prefactors.
Training orchestration and capacity planning
- Use case: Translate compute or token budgets into planned batch/step/momentum schedules to meet target optimization performance.
- Action: Use $T=bK$ and the hyperbolic step–batch trade-off to plan minimal steps $K$ and minimal batch $b$ to hit a fixed target risk; adopt momentum scaling when batch growth is infeasible.
- Sectors/workflows:
- Cloud providers; enterprise MLOps; finance/FP&A for AI spend planning.
- Assumptions/dependencies: performance proxy is the expected gradient norm bound; ignores generalization and pipeline overheads.
Defaults and plugins for training stacks
- Use case: Provide better “out-of-the-box” defaults that adapt to user budgets.
- Action:
- Implement a “scaling-law scheduler” module for PyTorch, JAX, DeepSpeed, FSDP/ZeRO, Megatron that sets $(\eta, \beta, b)$ from $T$ and hardware limits.
- Provide adapters for AdamW (using signSGD-with-momentum approximation) and Muon/Scion.
- Sectors/workflows:
- Open-source frameworks; internal tooling in labs.
- Assumptions/dependencies: clean API access to momentum parameters (e.g., β1/β2 in AdamW); reliable token counters and batch controllers.
Federated and multi-client training robustness
- Use case: Normalize client-side hyperparameters across heterogeneous batch sizes and training durations.
- Action: Per-client scaling of $\eta$ with $b^{1/2}$ and duration with $T^{-1/2}$ ; if clients are small-batch-limited, increase momentum (β per client) as per $\alpha \propto b\,T^{-1/2}$ .
- Sectors/workflows:
- Healthcare, finance, mobile/edge federated learning.
- Assumptions/dependencies: client gradient noise obeys similar variance scaling; coordination server can distribute per-client hyperparameters.
Fine-tuning and RLHF/online updates
- Use case: Short- to mid-horizon fine-tunes where $T$ is known (or capped) and $b$ is small.
- Action: Start from $\eta \propto b^{1/2} T^{-1/2}$ ; if convergence stalls due to variance, raise β towards 1 with $\alpha \propto b\,T^{-1/2}$ .
- Sectors/workflows:
- Software product teams fine-tuning LLMs; content moderation, RLHF stages.
- Assumptions/dependencies: stationary objective approximation over the short horizon; stability under higher momentum.
Reporting and reproducibility standards in academic experiments
- Use case: Present results that transfer across compute regimes.
- Action: Report $\eta$ and β scaled to $T$ and $b$ using the paper’s rules; include batch–step trade-off contours to clarify regime.
- Sectors/workflows:
- Academia; open benchmarks (e.g., Pythia-like suites).
- Assumptions/dependencies: consistent token accounting; same optimizer family across comparisons.
Sustainability and cost-reduction guidelines
- Use case: Reduce overtraining inefficiencies that inflate energy use and cloud spend.
- Action: Pre-flight calculators that predict performance vs. $(T,b,\beta)$ using the proxy; stop wasting budget in suboptimal batch/step regions; tune momentum when batch increases are capped.
- Sectors/workflows:
- Policy/ESG reporting for AI; enterprise sustainability.
- Assumptions/dependencies: proxy correlates with final task performance in small-epoch training; ignores generalization gap.

Long-Term Applications

These require further empirical validation at extreme scales, integration with additional training features (weight decay, warmup, LR decay), or extension beyond fixed model size.

Unified scaling across model size, token budget, and batch (integration with μP)
- Vision: Combine these token/batch/momentum laws with model-size transfer (μP) to produce end-to-end recipes for scaling models and training durations simultaneously.
- Sectors/workflows: frontier LLM/Multimodal labs; foundation model pretraining.
- Dependencies: theory for joint $(N,T,b,\alpha,\eta)$ scaling; interaction with parameterization, initialization, and curvature.
Online, feedback-driven controllers for $(\eta,\beta,b)$ $(η, β, b)$
- Vision: Closed-loop schedulers that estimate noise/curvature on the fly and adjust momentum and batch size to stay near the $T^{-1/4}$ frontier.
- Sectors/workflows: autonomous training systems; automated lab environments.
- Dependencies: robust online variance/curvature estimators; stability under dynamic $b$ and β changes; integration with pipeline latencies.
Generalization-aware scaling laws
- Vision: Extend the optimization proxy to include generalization behavior (e.g., flatness-sensitive terms, heavy-tailed noise, non- $b^{-1/2}$ scaling) to better match observed learning-rate trends when both $b$ and $T$ grow.
- Sectors/workflows: safety-critical domains (healthcare diagnostics, finance risk models); curriculum learning design.
- Dependencies: new bounds capturing data- and model-dependent generalization; empirical calibration.
Co-design with warmup, weight decay, and LR decay
- Vision: Incorporate common training heuristics into the theoretical framework to produce complete schedules (warmup length ∝ budget, decay rates tied to $b$ and α).
- Sectors/workflows: standardized training stacks for industry benchmarks.
- Dependencies: revised convergence bounds that include scheduling and regularization effects.
Hardware–optimizer co-optimization
- Vision: Schedule batch growth subject to memory/throughput curves while adjusting momentum as per theory to meet target accuracy under latency/energy constraints.
- Sectors/workflows: cloud providers, accelerator vendors; energy/edge systems.
- Dependencies: accurate systems models (throughput vs. batch), coordination with gradient accumulation and mixed precision.
Cross-paradigm extensions (RL, sequence modeling beyond LMOs)
- Vision: Adapt proxy-based scaling to RL, off-policy updates, and non-LMO optimizers (e.g., Shampoo), bridging theory to broader training regimes.
- Sectors/workflows: robotics, autonomous driving, recommendation systems.
- Dependencies: bounds in nonstationary or correlated-sample settings; alternative norm geometries.
Standards and policy for compute-efficient training
- Vision: Industry norms that encourage budget-aware hyperparameter scaling, reported alongside emissions estimates and compute footprints.
- Sectors/workflows: regulators, standards bodies, cloud consumption reporting.
- Dependencies: consensus on proxies/metrics; alignment with carbon accounting methodologies.
Educational and diagnostic tools
- Vision: Interactive simulators that forecast performance versus $(T,b,\eta,\beta)$ , helping students/practitioners internalize scaling behavior; run “what-if” budget scenarios.
- Sectors/workflows: education platforms; internal training at AI organizations.
- Dependencies: user-friendly implementations of the proxy with documented caveats.
Optimizer design leveraging LMO insights
- Vision: New Adam-like methods that explicitly manage normalization and momentum–batch coupling to preserve near-optimal $T^{-1/4}$ scaling under practical constraints.
- Sectors/workflows: optimizer libraries; high-throughput training products.
- Dependencies: theory–practice bridging for non-Euclidean norms; stability with mixed-precision and sharded training.

Cross-cutting assumptions and caveats to feasibility

The core proxy optimizes expected gradient norm; it does not directly account for generalization. Empirically, small-epoch improvements often correlate with downstream performance, but this can break in some regimes.
Results assume: constant learning rates (no decay), fixed model size, unbiased stochastic gradients with variance scaling $\sigma^2/b$ , Lipschitz-smooth objectives, and large-horizon approximations for some derivations ( $\alpha^{3/2}K \gg 1$ ).
For AdamW, the signSGD-with-momentum approximation underlies the mapping; care is needed with β1/β2 choices. The momentum scaling insight suggests decreasing β as batch increases and increasing β as $T$ grows at fixed small $b$ .
Deviations such as heavy-tailed gradients, non- $1/\sqrt{b}$ noise, or norm mismatch can shift exponents; the paper outlines these as open areas for refinement.
Constants ( $\rho,\sigma,L,\Delta_0$ ) affect prefactors; in practice, treat power-law exponents as guides and fit constants on small pilots.

View Paper Prompt View All Prompts

Glossary

Adam: A popular adaptive first-order optimizer that uses estimates of first and second moments of gradients to scale updates. "signSGD (approximating Adam)"
Adaptive methods: Optimizers that adjust learning rates based on gradient statistics across parameters during training. "for adaptive methods, at a fixed token budget, scaling $b \mapsto \kappa b$ requires $\eta \mapsto \kappa^{1/2} \eta$ "
Bias-dominated regime: A training phase where optimization error (bias) dominates over stochastic noise (variance), often early in training. "in the bias-dominated, early-training regime, the critical batch size is $b=1$ ;"
Convergence bounds: Theoretical upper bounds on optimization error or gradient norms that describe how fast an algorithm approaches optimality. "through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO)"
Continuous-time approximations: Analyses that model discrete optimization algorithms with continuous-time differential equations. "continuous-time approximations."
Critical batch size: The batch size beyond which further increases yield diminishing returns in step-efficiency. "Several works instead discuss the notion of critical batch size"
Dual norm: For a given norm, the associated norm on the dual space used to measure gradients or forces; satisfies Hölder's inequality. "Let $\|\cdot\|$ be any norm, with dual norm $\|\cdot\|_\star$ "
Euclidean geometry: Optimization under the standard 2-norm geometry, as opposed to non-Euclidean norms. "extend beyond convex settings and Euclidean geometry"
FLOPs: Floating point operations, a measure of computational cost. "training budgets of $5 \times 10^{26}$ FLOPs"
Gradient noise variance: The variance of the stochastic gradient estimator, often decreasing with larger batch sizes. "the gradient noise variance $E \|g_b - \nabla f\|^2_2$ is upper bounded by a constant $\sigma^2/b$ "
Heavy tails: Distributions with heavier-than-Gaussian tails, which can affect noise behavior and optimization dynamics. "non- $b^{-1/2}$ noise scaling, heavy tails, or norm-mismatched variance for LMO methods"
Hyperbolic relation: An inverse relationship (hyperbola-like) between two variables, here step count and batch size, for meeting a target performance. "the classical hyperbolic relations between batch size and step count"
Hyperparameter scaling laws: Predictive rules describing how optimal hyperparameters change with model, data, or compute scale. "We study hyperparameter scaling laws for modern first-order optimizers"
Hyperparameter transfer: Techniques for mapping tuned hyperparameters from one training regime (e.g., size) to another. "Hyperparameter transfer has become an important component of modern large-scale training recipes."
Large-horizon regime: The setting of many optimization steps where asymptotic approximations become valid. "In the large-horizon regime $\alpha^{3/2}K\gg 1$ "
Learning rate transfer: Transferring a learning-rate choice across different model sizes or training regimes. "enabling learning rate transfer across model sizes."
Linear Minimization Oracle (LMO): A framework that updates in the direction minimizing a linearization under a norm constraint; includes normalized and sign-based updates. "methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon."
Lipschitz gradients: A smoothness condition where gradients do not change too rapidly, bounded by a Lipschitz constant. "the loss $f$ has $L$ -Lipschitz gradients, with respect to the general $\|\cdot\|$ norm"
Momentum burn-in: The initial phase where moving averages (momentum) have not yet stabilized, introducing transient error terms. "momentum ``burn-in'' / averaging term $2\rho\sigma/(\alpha \sqrt{b} K)$ ;"
Muon: An optimizer using orthogonalized (spectral-norm-based) updates to improve training stability. "spectral norm: Muon (orthogonalized update)"
Norm equivalence constant: A factor bounding one norm by another in finite-dimensional spaces, used to relate analysis across norms. "where $\rho \geq 1$ is a norm equivalence constant"
Norm-based optimizers: Methods whose update directions or constraints are defined by norms other than the Euclidean norm. "captures a family of norm-based optimizers directly relevant to modern practice"
Normalized SGD: A variant of SGD that normalizes the update direction by its norm, often coupled with momentum. "Euclidean $|\cdot|=|\cdot|_2): normalized SGD with momentum"
Normalized steepest descent: Updates taken along the steepest descent direction normalized in a chosen norm, as instantiated by LMO. "normalized steepest descent~(LMO) methods, not vanilla SGD."
Noise floor: A lower bound on achievable error due to stochastic gradient noise that persists even as iterations increase. "a noise floor term (2\rho\sigma\sqrt{\frac{\alpha}{b}"
Operator-norm updates: Updates constrained or scaled by an operator norm (e.g., spectral norm) at the layer or matrix level. "Muon/Scion-style operator-norm updates"
Orthogonalized update: An update modified to be orthogonal (or approximately so) to certain directions, often via spectral normalization. "Muon (orthogonalized update)"
Power-law schedules: Hyperparameter schedules that scale as a power of training budget or steps. "yields closed-form power-law schedules for learning rate, momentum, and batch size"
Proxy objective: A tractable objective (often a bound) optimized in place of the true, harder-to-evaluate training objective. "Define the following proxy~(the right-hand-side of \eqref{eq:lmo-nonconvex-bound} up to constants):"
SDE (stochastic differential equation): Continuous-time models used to approximate and analyze stochastic optimization dynamics. "stochastic differential equation~(SDE) approximations"
SignSGD: An optimizer that uses only the sign of gradient components (with or without momentum). "signSGD with momentum"
Speed-matching flows: SDE-based analyses aligning the “speeds” (time scales) of different dynamics to optimize convergence. "theoretical insights around speed-matching flows in the SDE literature"
Square-root learning rate scaling: The rule that optimal learning rate scales with the square root of batch size (or inversely with training horizon’s square root). "we recover the well-known square-root learning rate scaling with batch size"
Star-convex: A generalization of convexity where all segments from a fixed point to the set lie within the set, enabling certain guarantees. "For the star-convex case~\citep{kovalev2025muon}, the gradient norm can be lower bounded"
Token budget: The total number of training data tokens processed, often equal to batch size times number of steps. "token budget $T \coloneqq bK$ ."
Trust-region error terms: Errors arising from approximating the objective within a region where the model is trusted to be accurate. "smoothness/trust-region error terms proportional to $\eta$ "
Unbiased stochastic gradient oracle: An assumption that the stochastic gradient estimator has expectation equal to the true gradient. "assumes access to an unbiased stochastic gradient oracle for the population objective."
Unconstrained Stochastic Conditional Gradient method: Another name for Frank–Wolfe-type methods applied without explicit constraints, related here to LMO. "Also known as Unconstrained Stochastic Conditional Gradient method"
Warmup: A training schedule that begins with a smaller learning rate and gradually increases it before holding or decaying. "incorporating weight decay, learning-rate scheduling, and warmup into our analysis."
Weight decay: A regularization technique that penalizes large weights, typically implemented as L2 regularization. "incorporating weight decay, learning-rate scheduling, and warmup into our analysis."

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Summary

Explicit Hyperparameter Scaling Laws from Modern Optimization Theory

Introduction

Theoretical Framework

Scaling Law Derivations

Fixed Momentum Regime

Fixed Batch Size and Momentum Tuning

Joint Tuning of All Hyperparameters

Empirical and Qualitative Verification

Comparison to SGD and Observed Deviations

Budget Transfer and Practical Implications

Limitations and Directions for Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory — Explained Simply

What is this paper about?

What questions are the authors trying to answer?

How did they approach the problem?

What did they find, and why does it matter?

Simple rules of thumb you can remember

Why is this important?

What’s the bigger picture?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of “Deriving Hyperparameter Scaling Laws via Modern Optimization Theory”

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and caveats to feasibility

Glossary

Open Problems

Continue Learning

Authors (6)

Collections

Tweets

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Summary

Explicit Hyperparameter Scaling Laws from Modern Optimization Theory

Introduction

Theoretical Framework

Scaling Law Derivations

Fixed Momentum Regime

Fixed Batch Size and Momentum Tuning

Joint Tuning of All Hyperparameters

Empirical and Qualitative Verification

Comparison to SGD and Observed Deviations

Budget Transfer and Practical Implications

Limitations and Directions for Future Research

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory — Explained Simply

What is this paper about?

What questions are the authors trying to answer?

How did they approach the problem?

What did they find, and why does it matter?

Simple rules of thumb you can remember

Why is this important?

What’s the bigger picture?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of “Deriving Hyperparameter Scaling Laws via Modern Optimization Theory”

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and caveats to feasibility

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets