Papers
Topics
Authors
Recent
Search
2000 character limit reached

Practical Efficiency of Muon for Pretraining

Published 4 May 2025 in cs.LG and stat.ML | (2505.02222v4)

Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Summary

  • The paper introduces Muon, a second-order optimizer that expands the Pareto frontier over AdamW by optimizing compute-time tradeoffs and data efficiency.
  • Muon employs matrix-structured steepest descent with Newton-Schulz iteration and Nesterov momentum, integrated with muP scaling for robust hyperparameter transfer.
  • Experiments on transformer models show that Muon achieves target loss faster and allocates resources more effectively, validated by telescoping strategies across model widths.

Practical Efficiency of Muon for Pretraining

Introduction

The paper "Practical Efficiency of Muon for Pretraining" introduces Muon as an efficient second-order optimizer expanding the Pareto frontier over AdamW by enhancing data efficiency at large batch sizes without sacrificing computational advantage. While AdamW has dominated the landscape due to its data-efficiency and marginal compute cost increase, Muon challenges this status quo by maintaining superior performance in the compute-time tradeoff. This tradeoff enables practitioners to optimize resource allocation effectively, necessitating studies that characterize the modification in this tradeoff when adopting different optimizers. Figure 1

Figure 1: Muon expands the Pareto frontier over AdamW on the compute-time tradeoff, maintaining data efficiency at large batch sizes.

Muon Improves the Compute-Time Tradeoff

Review of Muon

Muon employs matrix-structured steepest descent with spectral norm regularization:

Ot=arg minORm×n:  O21tr(GtO)O_t = \argmin_{ O \in \mathbb{R}^{m \times n}:\; ||O||_2 \leq 1} tr(G_t^\top O)

Where Gt=UΣVG_t = U \Sigma V^\top is the singular value decomposition (SVD) of gradients. Muon approximates the SVD using Newton-Schulz iteration, combines it with Nesterov momentum, learning rate scaling, and weight decay. At each step, given the gradient GtG_t, it updates weights Wt+1W_{t+1} using momentum hyperparameters.

Experimental Setup

The study employs modern decoder-only transformer models and a mixture of high-compression text and code data from DCLM and Python to validate Muon's performance. The authors conduct extensive experiments by varying batch size, achieving close to 50% model FLOP utility on TPU v5p chips. Hyperparameter tuning reveals that Muon can achieve target loss significantly faster than AdamW in wall time. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Muon expands the Pareto frontier over AdamW at various loss thresholds on Python and DCLM datasets.

Characterization of Compute-Time Tradeoff

On the compute-time plane, optimizers are represented as iso-loss curves, comparing the total training time against the number of devices. Muon demonstrates superior flexibility in resource allocation compared to AdamW by maintaining data-efficiency with large batch sizes, confirmed by observing non-decreasing token ratio even in large-batch regimes. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Relative data efficiency of Muon over AdamW at varying batch sizes across different model sizes.

Choosing Hyperparameters for Muon

The Maximal Update Parameterization (muP)

Muon is compatible with muP scaling, which permits consistent predictive hyperparameter transfer across model scales by regulating initialization and learning rate scaling. This resolves how hyperparameters calibrated on smaller models transfer efficiently to larger models, overcoming estimation errors by a telescoping algorithm. Figure 4

Figure 4

Figure 4: Token ratios to loss portray Muon's practical advantages over AdamW for large batch training.

Telescoping Strategy for Hyperparameter Transfer

The telescoping protocol ensures efficient hyperparameter tuning by reducing sweep points systematically as width increases. This approach bounds hyperparameter tuning overhead while maintaining access to optimal hyperparameters due to the controlled drift of optimum values as model width increases. The telescoping protocol was validated through family-size transformer models demonstrating efficient hyperparameter transfer to pretraining with Muon. Figure 5

Figure 5: Telescoping algorithm visualization applied to weight decay and learning rate across model widths.

Comparative studies on optimizers like AdamW and emerging second-order techniques such as Shampoo and Muon highlight the advantages offered by Muon. These findings are consistent with efforts in understanding optimizer batch size dynamics, with Muon expanding upon these foundations with increased compute-time flexibility demonstrated by its data efficiency and automated parameter coverage. Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Comparison of the best 1B model training runs under AdamW and Muon demonstrating wall-time efficiency.

Conclusion

Muon emerges as a robust alternative for large-scale pretraining, delivering superior compute-time tradeoffs and flexible resource allocations compared to AdamW. This advantage is sustained throughout large batch sizes due to Muon's efficient token consumption. Hyperparameter tuning with muP further enhances Muon's applicability by ensuring scaling efficiency, thus establishing Muon as a practical replacement for AdamW in extensive pretraining tasks. With broadened empirical validation across multiple setups, Muon stands out as a practically efficient optimizer for distributed neural network training.

Overall, Muon optimizes resource strategies more effectively, rendering it a preferred choice for practitioners focusing on scale-efficient pretraining within the landscape of modern compute infrastructures.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper asks a practical question: how can we train LLMs faster and more cheaply? The authors compare two ways to “teach” a model during training, called optimizers. The standard optimizer is AdamW. The new one, Muon, is a simple kind of second‑order optimizer. The paper shows that Muon lets you train faster without wasting data, especially when you use very large batch sizes (lots of data processed in one go across many devices). It also explains a clever way to pick good training settings (hyperparameters) for big models using a method called muP and a “telescoping” search strategy.

What questions did the researchers ask?

  • Can Muon beat AdamW in a fair, practical comparison that considers both compute (how many devices and how much work they do) and time (how long training takes)?
  • Does Muon stay “data‑efficient” at very large batch sizes, meaning it reaches the same quality using fewer training tokens (pieces of text/code)?
  • Can muP, a method for copying good training settings from small models to big models, work well with Muon?
  • Is there a simple, low‑cost way to tune hyperparameters for big models reliably?

How did they do the research?

Comparing optimizers fairly: the compute–time tradeoff

Think of training as a race where you can pick:

  • how many runners you use (devices),
  • how big each stride is (batch size),
  • and how much total energy you spend (compute).

Different choices trade off training time against total cost. The authors plot “iso‑loss curves,” which are lines showing all the ways (different device counts and batch sizes) to reach the same target quality (loss). If Muon’s curve lies “better” than AdamW’s, it means Muon gives you strictly more options: either finish sooner with the same budget or spend less with the same finish time. This “frontier of best choices” is called the Pareto frontier.

What is Muon?

When you train a model, you repeatedly nudge its parameters in directions that reduce error—like finding the fastest way downhill on a landscape. AdamW looks at the slope (first‑order information). Muon looks a bit deeper at how the landscape curves and how parameters interact (second‑order flavor), but in a very simple way.

Muon’s update roughly says: “move in the single strongest direction suggested by the gradient, but keep the step well‑behaved.” It uses a fast trick (Newton–Schulz) to avoid expensive math and pairs with momentum and weight decay (common training tools). In practice, Muon only stores a small amount of extra information and adds little overhead, especially at large batch sizes.

Plain analogy:

  • AdamW: a careful hiker using the slope to go downhill.
  • Muon: a careful hiker who also notices the shape of the ground and steps in a more effective direction, without overthinking it.

Hyperparameters are the knobs you set before training (like learning rate and weight decay). Picking them for a huge model by brute force is very expensive. muP (maximal update parameterization) is a set of rules that help transfer good hyperparameters found on a small model to a larger one so they still work.

The “telescoping” algorithm is a practical tuning strategy:

  • Start with a small model and do a wider hyperparameter search.
  • Double the model’s width (make it bigger) and narrow the search range.
  • Repeat a few times. Each stage costs about the same, and the total extra cost grows slowly, like C·log(N), where C is the cost of training the final big model and N is its width.

Analogy:

  • muP: scaling a recipe from cooking for 2 people to 200 people without ruining the taste.
  • Telescoping: zooming in on a map step by step to find the exact address, instead of searching the whole city every time.

What did they find and why is it important?

  • Muon expands the compute–time Pareto frontier compared to AdamW.
    • In simple terms: Muon gives you more “best choice” options—finish faster with the same compute, or use fewer devices for the same finish time.
  • Muon stays data‑efficient at very large batch sizes.
    • They measure a “token ratio”: how many more tokens AdamW needs compared to Muon to reach the same quality.
    • Across different model sizes (up to 4 billion parameters) and datasets (web text and Python code), Muon consistently needed about 10–15% fewer tokens. This advantage did not fade at huge batch sizes; it was flat or even grew as batch size increased.
    • This matters because big training runs often use many devices and large batches to finish quickly. If your optimizer gets worse with bigger batches, you waste data and time. Muon doesn’t.
  • Muon’s overhead is small and shrinks with larger batches.
    • Even though Muon does a bit more math per update than AdamW, that extra cost becomes relatively smaller as batches get larger. Combined with fewer tokens needed, Muon wins in practice.
  • muP works with Muon.
    • The paper confirms that muP’s scaling rules transfer good settings cleanly from small to large models when using Muon, not just AdamW.
  • Telescoping hyperparameter tuning is accurate and cheap.
    • By narrowing the search each time the model width doubles, you keep tuning costs modest and still find near‑optimal settings.
    • They validated this on models up to about 3.7B parameters, training up to 160B tokens at sequence length 8192, and achieved strong results with specific settings (learning rate and weight decay).

Why it’s important:

  • Training large models is expensive. If you can reach the same quality with fewer tokens and less time, you save money and energy.
  • If your optimizer works well at large batch sizes, you can scale across more devices without losing quality.
  • If muP and telescoping reliably tune big models, you cut the time spent on trial‑and‑error.

Implications and impact

  • For teams training LLMs, Muon is a practical drop‑in replacement for AdamW:
    • It’s more flexible for resource planning because it remains data‑efficient at large batch sizes.
    • It can reduce training time without increasing total compute.
  • Combining Muon with muP and telescoping gives a unified recipe:
    • Use Muon for optimization.
    • Use muP to scale hyperparameters from small to big models.
    • Use telescoping to fine‑tune those settings cheaply as you grow the model.
  • Overall, this can make industry‑scale pretraining more economical and faster, helping organizations build capable models with lower cost and shorter timelines.

Key takeaways

  • Muon lets you train faster or cheaper than AdamW at the same quality, especially with very large batch sizes.
  • Muon needs fewer training tokens to reach the same loss, and this advantage doesn’t vanish as batch size grows.
  • muP works with Muon, so you can transfer good hyperparameters from small models to big ones reliably.
  • The telescoping search keeps tuning costs low while staying accurate, adding only a small, logarithmic overhead to the final training cost.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 73 likes about this paper.