Benchmarking Optimizers for Large Language Model Pretraining

Published 1 Sep 2025 in cs.LG | (2509.01440v1)

Abstract: The recent development of LLMs has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.

Abstract PDF Upgrade to Chat

Summary

The paper presents a systematic, large-scale benchmark of 11 optimizers for LLM pretraining, identifying methods like AdEMAMix and MARS as potent alternatives to AdamW.
The paper employs rigorous hyperparameter tuning and controlled ablations on weight decay, learning rate schedules, and warmup durations to assess optimizer performance.
The paper demonstrates that insights from dense model experiments transfer to MoE architectures, offering practical guidance for tuning optimizers in different LLM setups.

Benchmarking Optimizers for LLM Pretraining: A Comprehensive Evaluation

Introduction

This work presents a systematic, large-scale empirical study of optimization algorithms for LLM pretraining, addressing the lack of standardized, controlled benchmarks in the field. The authors evaluate 11 optimizers—including AdamW, AdEMAMix, MARS, SOAP, Muon/D-Muon, Signum, Lion, Sophia, Prodigy, and SF-AdamW—across a range of model sizes (124M–720M parameters), batch sizes, and training durations, with careful hyperparameter tuning and compute accounting. The study also includes extensive ablations on critical training hyperparameters (e.g., weight decay, learning rate schedules, warmup, initialization, gradient clipping), and extends the analysis to Mixture-of-Experts (MoE) architectures. The codebase and all experimental configurations are open-sourced for reproducibility.

Experimental Design and Methodology

The benchmark is constructed around Llama-like transformer architectures with modern components (SwiGLU, RMSNorm, RoPE, weight tying), trained on a 100B-token subset of FineWeb. Four dense model sizes (124M, 210M, 583M, 720M) and a 520M MoE variant are considered. Batch sizes are varied from 16K to 2M tokens, and training durations are chosen to span both below and above the Chinchilla-optimal regime. All optimizers are tuned per model/batch/horizon, with grid searches over learning rate, betas, weight decay, warmup, gradient clipping, and scheduler parameters. Compute cost is normalized across methods.

The optimizer suite covers:

Adam-like: AdamW, ADOPT, AdEMAMix
Sign-based: Lion, Signum
Second-order/Preconditioned: Muon, D-Muon, SOAP, Sophia
Schedule-free/Parameter-free: SF-AdamW, Prodigy
Variance-reduced: MARS

Ablations are performed for each optimizer on weight decay, learning rate schedule (cosine, WSD, linear), warmup, and learning rate decay endpoint. The study also tracks wall-clock time and gradient norm dynamics.

Main Results

Optimizer Rankings Across Scales

For small models (124M) and small batch sizes, AdamW remains competitive, but AdEMAMix, D-Muon, SOAP, and Prodigy can outperform it, especially in short runs. As batch size increases, sign-based methods (Signum, Lion) and MARS benefit substantially, closing the gap with or surpassing AdamW. For large models (720M) and large batches, AdEMAMix and MARS consistently dominate, with a significant margin over AdamW and other methods.

Figure 1: Ranking of optimizers for 720M Llama-based models. AdEMAMix and MARS achieve the lowest final validation loss, outperforming AdamW and other baselines.

Training Dynamics and Batch Size Effects

Short training runs favor optimizers with aggressive weight decay and fast adaptation (e.g., D-Muon, AdEMAMix). As training duration increases, AdamW narrows the gap, but AdEMAMix remains superior, especially when betas are re-tuned for longer horizons. Increasing batch size disproportionately benefits sign-based and variance-reduced methods, with MARS, Prodigy, Lion, and Signum matching or exceeding AdamW at large batch sizes.

Figure 2: Comparing optimizers for training a 124M parameter LLM. Signum, MARS, Lion, and Prodigy benefit from increased batch size, outperforming AdamW in long runs.

Weight Decay and Learning Rate Decay

Ablations reveal that large weight decay (e.g., 0.5) is optimal for short training, but moderate decay (0.1) is best for long runs. Omitting weight decay is consistently suboptimal. For learning rate schedules, decaying to 0.01× (or lower) of the maximum learning rate is critical; the common practice of decaying only to 0.1× is suboptimal across all tested schedulers.

Figure 3: Larger weight decay achieves significantly better results when training on fewer tokens. For long training, moderate decay is optimal.

Warmup and Scheduler Interactions

Warmup duration is optimizer-dependent. SF-AdamW, Sophia, Signum, and Lion benefit from longer warmup, while AdamW and AdEMAMix prefer shorter warmup. Cosine scheduling is generally optimal, but WSD is preferred by Muon, and linear scheduling can be competitive for sign-based methods.

Figure 4: Warmup ablation. Sign-based optimizers and SF-AdamW benefit from increased warmup duration.

Hyperparameter Sensitivity and Transfer

Learning rate and beta parameters require careful tuning, especially for long training runs and large models. The optimal betas for AdamW-like methods increase with training duration (e.g., β₂=0.9999 for long runs). Prodigy’s effective learning rate closely tracks AdamW’s, suggesting its utility as a proxy for learning rate tuning.

Figure 5: Re-tuning beta parameters is significant for longer training. Increasing β₃ for AdEMAMix is crucial for long runs.

MoE Architectures

Optimizer rankings and best practices transfer smoothly from dense to MoE models. SOAP and AdEMAMix remain strong, with Prodigy and AdamW also competitive.

Figure 6: Ranking optimizers for 520M MoE models with 256×512 batch size. Rankings mirror those for dense models.

Wall-Clock Performance

All optimizers except SOAP exhibit similar wall-clock time scaling with model size. SOAP incurs a significant slowdown due to preconditioner computations.

Implementation and Practical Guidance

The study provides detailed pseudocode and PyTorch implementations for all optimizers, including correct decoupled weight decay for sign-based methods (Signum, Lion). For Muon and D-Muon, the application of weight decay to all parameter groups is essential for robust performance. For AdEMAMix, slow EMA and α-scheduling are critical for stability and scaling. For Prodigy and SF-AdamW, bias correction and careful β₂ tuning are necessary.

Key practical recommendations:

Always tune weight decay and learning rate decay endpoint; default values from popular codebases are often suboptimal.
Re-tune betas for longer training runs; higher β₂ is generally better for AdamW-like methods.
Use large batch sizes to unlock the potential of sign-based and variance-reduced optimizers.
For schedule-free optimizers, tune warmup and betas carefully; do not disable gradient clipping for SF-AdamW.
For Muon/D-Muon, ensure weight decay is applied to all parameter groups.
For SOAP, be aware of wall-clock overhead at large model sizes.

Theoretical and Empirical Implications

The results challenge the default reliance on AdamW for LLM pretraining, demonstrating that newer optimizers (AdEMAMix, MARS, D-Muon) can achieve better loss and scaling, especially when hyperparameters are properly tuned. The study also highlights the optimizer-dependence of best practices for weight decay, learning rate scheduling, and warmup, and exposes the sensitivity of some methods (e.g., Sophia) to training horizon and batch size.

The findings suggest that optimizer selection and tuning remain a critical, under-explored axis for improving LLM pretraining efficiency and final model quality. The transferability of results to MoE architectures and the open-sourcing of the benchmark framework provide a foundation for future research and industry adoption.

Limitations and Future Directions

The benchmark is limited to models up to 720M parameters and does not directly evaluate downstream task performance, though loss scaling is generally predictive. Some optimizers (e.g., Shampoo, Adan, Scion) and memory-efficient variants are not included. The study does not address distributed training or sharding framework compatibility, which may affect practical deployment at larger scales.

Future work should extend the benchmark to trillion-parameter models, include downstream evaluation, and explore optimizer performance in distributed and low-precision settings. The development of unified benchmarks for memory- and communication-efficient optimizers is also a priority.

Conclusion

This work establishes a rigorous, reproducible benchmark for optimizer selection in LLM pretraining, providing actionable insights for both practitioners and researchers. The results demonstrate that with careful tuning, several modern optimizers can outperform AdamW, especially at scale and with large batch sizes. The open-source codebase and comprehensive ablations set a new standard for optimizer evaluation in the LLM community.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Explaining “Benchmarking Optimizers for LLM Pretraining”

What is this paper about?

This paper compares many different “optimizers” (the rules that tell a neural network how to adjust itself while learning) to find out which ones work best for training LLMs. The authors run fair, careful tests on models of different sizes, with different batch sizes, and for different training lengths, and they share all their code so others can repeat the tests.

What questions are the authors trying to answer?

Which optimizer is best for training LLMs, depending on model size, batch size, and how long you train?
Do new optimizers actually beat the classic AdamW optimizer in real, fair tests?
Which training choices (like warmup, weight decay, or learning rate schedules) make the biggest difference?
How should practitioners pick and tune an optimizer in practice?

How did they test this? (In simple terms)

Think of training a model like teaching someone to guess the next word in a sentence. An optimizer is the study strategy they use to improve with each guess.

Here’s what the authors did:

They trained Llama-style LLMs of different sizes (from 124 million to 720 million parameters, plus a Mixture-of-Experts model).
The models learned on a large, cleaned text dataset (FineWeb), predicting the next token (piece of text). Lower “validation loss” means better predictions.
They tried 11 optimizers, including:
- AdamW (the standard)
- AdEMAMix, ADOPT, SOAP, MARS (newer methods)
- Lion and Signum (sign-based optimizers)
- Muon and D-Muon (a fixed version of Muon)
- Prodigy and SF-AdamW (aim to need fewer manual settings)
- Sophia (a second-order-style method)
They carefully tuned each optimizer’s settings (like learning rate and momentum) near a common training length (based on “Chinchilla” scaling, a rule-of-thumb for how long to train a model), then checked how well those settings held up when training longer.
They also ran “ablations,” which are focused tests where you change one thing at a time:
- Weight decay (a way to keep the model’s numbers from getting too large—like keeping your room tidy so it doesn’t get messy)
- Warmup (start with small learning steps, then speed up)
- Learning rate schedules (how the step size shrinks over time: cosine, linear, or WSD)
- Learning rate sensitivity (how picky a method is about the step size)
- Other details like gradient clipping and initialization

What did they find, and why does it matter?

Here are the main takeaways, written plainly:

AdEMAMix is a star performer
- It consistently ranks among the best across many settings and training lengths.
- For big models and long runs, it often wins.
Batch size changes who the winners are
- With small batches: sign-based methods (Lion, Signum) and MARS don’t do great; Sophia can even become unstable.
- With large batches: Lion, Signum, MARS, and Prodigy improve a lot and can beat AdamW in shorter runs. AdEMAMix still stays strong.
- At 720M parameters with 1M-token batches: AdEMAMix and MARS are top performers.
D-Muon > Muon (because of proper weight decay)
- A version of Muon called D-Muon, which applies weight decay correctly, performs clearly better than the original Muon.
Weight decay really matters
- For short training: using a larger weight decay (around 0.5) can noticeably improve results.
- For long training: a moderate weight decay (around 0.1) works best.
- No weight decay is usually a bad idea; performance gets worse, especially as training goes longer.
- Using “decoupled” weight decay (the modern, correct way) is important, especially for some optimizers like Signum.
Warmup is optimizer-dependent
- A longer warmup helps certain methods (Signum, Sophia, SF-AdamW), and can even make Lion surpass AdamW in some cases.
- Other methods do fine with shorter, standard warmups.
Learning rate schedules: cosine usually wins
- Cosine scheduling tended to work best across optimizers.
- A few exceptions exist (e.g., Muon sometimes liked WSD), but cosine was the safest choice overall.
Learning rate sensitivity
- Many optimizers had a clear “sweet spot” for learning rate found in short runs that still worked well in longer runs.
- Sign-based methods and Sophia often diverged (crashed) if the learning rate was set too high.
- MARS was unusually stable across a wide range of learning rates.
They trained 2,900 models and spent about 30,000 GPU hours
- That’s a lot—so the comparisons are broad and careful.
- All code and configs are released for reproducibility.

What does this mean going forward?

For people training LLMs:
- If you want a strong and reliable choice, try AdEMAMix.
- If you use very large batches, consider Lion, Signum, MARS, or Prodigy—they can become very competitive.
- Don’t forget weight decay: use about 0.5 for short runs, and about 0.1 for long runs. Avoid zero.
- Prefer cosine learning rate schedules unless you have a strong reason not to.
- If you use Muon, use D-Muon instead (it fixes important weight decay behavior).
- Tune warmup length based on the optimizer; longer warmup can help some methods a lot.
For researchers:
- Batch size has a big effect on which optimizer looks best—future papers should report results across batch sizes.
- Stability issues (like Sophia’s divergence at small batches/long runs) are important to address.
- Weight decay design (and correct implementation) strongly shapes outcomes—don’t overlook it.
- The open-source benchmark provides a fair playground to test new ideas and make solid comparisons.

Overall, this paper gives a clear, up-to-date map of which optimizers work best for LLM pretraining in different situations, and it offers practical advice backed by large, careful experiments.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, consolidated list of concrete knowledge gaps, limitations, and open questions that remain unresolved by the paper and could guide future research:

Scaling to frontier LLMs: Results are limited to dense models up to 720M parameters and a single 520M MoE; it remains unknown whether optimizer rankings and hyperparameter recommendations hold for 1–70B+ parameter models and modern large MoEs (varying experts, capacity factor, routing, load-balancing).
Sequence length generality: All experiments use sequence length 512; the impact of longer contexts (2k–32k+), curriculum of context length, and positional encoding choices on optimizer stability and performance is not evaluated.
Dataset and domain coverage: Benchmarks focus on a FineWeb subset with GPT-2 tokenization; transferability to other corpora (e.g., The Pile, C4, multilingual/web/code mixtures), tokenizers (SentencePiece/BPE sizes), and data quality regimes (dedup levels, contamination) is unknown.
Downstream performance correlation: The study centers on validation loss; how optimizer choice affects downstream zero-shot/few-shot tasks, calibration, and robustness—and how well loss improvements predict downstream gains across optimizers—remains open.
Statistical robustness: Variance across random seeds, data orders, and initialization seeds is not reported; ranking stability under multi-seed evaluation and confidence intervals is unquantified.
Wall-clock and systems metrics: There is no systematic comparison of step-time, throughput, memory footprint, and communication overhead (e.g., SOAP preconditioning cost, Muon orthogonalization steps, D-Muon communication efficiency). Compute-to-quality Pareto analyses are missing.
Tuning budget fairness: The equivalence of search budgets and search spaces across optimizers is unclear; ranking sensitivity to tuning budget, search strategy (manual vs. automated), and hyperparameter priors is not analyzed.
Hyperparameter transferability: While tuned near Chinchilla-optimal and reused, a systematic study of hyperparameter transfer across model sizes, batch sizes, datasets, and training horizons is missing; guidelines or scaling laws (e.g., for betas) are not formalized.
Very long-horizon training: Results extend to 16.8–33.6B tokens for 124M and limited horizons for larger models; optimizer ranking at substantially longer horizons (compute-optimal for modern scaling laws) is unknown.
Learning-rate schedules: Only cosine, linear, and WSD are studied; effects of other schedules (e.g., OneCycle, step, exponential, sqrt decay, warm restarts) and end-LR ratios beyond 0.01× are not explored.
Warmup design space: Warmup experiments are linear and mostly at 124M; the benefits of alternative warmup shapes (cosine/exponential), per-parameter warmup, or warmup scheduling at larger scales remain untested.
Weight decay policy: Although decoupled weight decay is ablated, the paper does not assess layer-wise WD, parameter-group exclusions (e.g., norms/bias/embeddings), WD schedules, or optimizer-specific WD formulations; downstream and calibration effects of large WD are unclear.
Gradient clipping specifics: The type (global vs per-parameter/layer), thresholds, and adaptive strategies for clipping—and their optimizer-specific impacts—are not systematically mapped.
Precision and numerics: The interaction between optimizer choice and numerical formats (bf16/fp16/fp8), dynamic loss scaling, and kernel implementations is not evaluated; ranking stability under mixed precision is unknown.
MoE-specific behaviors: Only one MoE configuration is tested; optimizer effects on router dynamics (entropy, load balance), capacity factors, expert counts, and expert dropout are not studied.
Schedule-free/continual training: Schedule-free optimizers are not evaluated in true continual pretraining scenarios (non-decaying LR, non-stationary data); their practical advantages remain unquantified.
Broader optimizer coverage: Popular large-batch or memory-efficient methods (e.g., LAMB, AdaFactor, Adagrad variants, 8-bit optimizers) are absent; it is unknown how they compare under the standardized setup.
Robustness and failure modes: While some divergences (e.g., Sophia) are noted, there is no systematic “stability map” of failure regions (lr, betas, batch, warmup, clip) per optimizer or standardized mitigation strategies.
Interpretability of differences: No mechanistic analysis (e.g., gradient-noise scale, curvature/Hessian spectra, sharpness, update direction statistics) is provided to explain why certain optimizers (e.g., AdEMAMix) outperform others across regimes.
Architecture sensitivity: Results are for a specific Llama-like, RMSNorm, SwiGLU, RoPE setup; sensitivity to architectural variants (LayerNorm placement, activation functions, attention kernels, normalization types) is not explored.
Data ordering and curriculum: The influence of data shuffling, token repetition, curriculum (by length/difficulty), and epoching strategies on optimizer performance is not studied.
Objective variations: The conclusion that z-loss “has little impact” is based on limited ablation; other objectives (label smoothing, entropy regularization, auxiliary losses) and their optimizer interactions are largely unexplored.
Per-parameter-group settings: Beyond a Signum WD fix, there is no thorough exploration of per-layer/per-group LR, momentum, or WD (e.g., embedding vs attention/MLP vs norms) and their effect on optimizer rankings.
Compute accounting granularity: Although compute cost is “accounted,” the paper does not provide standardized, reproducible wall-clock budgets or FLOP-normalized comparisons that jointly consider speed and quality across hardware/software stacks.
Reproducibility across frameworks: Results are shown in one codebase; cross-framework replication (Megatron/DeepSpeed/OLMo/LLaMA recipes) to test sensitivity to fused kernels and implementation details is missing.
Generalization across tasks: Only autoregressive LM pretraining is considered; transfer to masked LM, seq2seq, instruction tuning, or RLHF remains untested.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (3)

Collections

Tweets

new benchmarking of optimizers for llms (3 points, 0 comments)

alphaXiv

Benchmarking Optimizers for Large Language Model Pretraining (60 likes, 0 questions)

Benchmarking Optimizers for Large Language Model Pretraining

Summary

Benchmarking Optimizers for LLM Pretraining: A Comprehensive Evaluation

Introduction

Experimental Design and Methodology

Main Results

Optimizer Rankings Across Scales

Training Dynamics and Batch Size Effects

Weight Decay and Learning Rate Decay

Warmup and Scheduler Interactions

Hyperparameter Sensitivity and Transfer

MoE Architectures

Wall-Clock Performance

Implementation and Practical Guidance

Theoretical and Empirical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Explaining “Benchmarking Optimizers for LLM Pretraining”

What is this paper about?

What questions are the authors trying to answer?

How did they test this? (In simple terms)

What did they find, and why does it matter?

What does this mean going forward?

Knowledge Gaps

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

Tweets

Reddit

alphaXiv