Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuonBP: Faster Muon via Block-Periodic Orthogonalization

Published 19 Oct 2025 in cs.LG and math.OC | (2510.16981v1)

Abstract: Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Jordan, Jin, et al., 2024) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW (Loshchilov and Hutter, 2019) for LLM training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5%-10% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with baseline Muon while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8% throughput increase compared to Muon with no degradation in performance.

Summary

  • The paper introduces a block-periodic orthogonalization method that reduces inter-device communication while boosting training throughput.
  • The methodology leverages dual learning rates and local orthogonalization to achieve up to an 8% performance improvement over baseline approaches.
  • Empirical results demonstrate lower validation perplexities and faster convergence in large-scale models using eight-way tensor parallelism.

MuonBP: Faster Muon via Block-Periodic Orthogonalization

Introduction

The paper "MuonBP: Faster Muon via Block-Periodic Orthogonalization" (2510.16981) presents an innovation in gradient orthogonalization aimed at optimizing the Muon algorithm's performance when training large-scale LLMs. Gradient orthogonalization, combined with first-order momentum in the Muon optimizer, enhances data efficiency significantly compared to traditional methods like Adam and AdamW. However, the inherent communication costs in model parallelism—stemming from gradient matrix shard operations across devices—pose throughput challenges. This research introduces MuonBP, a variant of Muon with block-periodic orthogonalization, which strategically balances local and full orthogonalization steps to maintain training stability, achieve competitive iteration complexity, and significantly improve throughput.

Methodology

The MuonBP approach introduces a block-periodic orthogonalization strategy. This involves local orthogonalization of matrix shards residing on individual devices with periodic full orthogonalization steps, effectively reducing orthogonalization communication overhead. The theoretical framework incorporates convergence guarantees by utilizing dual learning rates: one for blockwise and another for full orthogonalization steps. This adjustment is critical in ensuring that throughput increases do not come at the cost of convergence performance or training stability. Figure 1

Figure 1

Figure 1: 8B model validation perplexities. Comparison of Muon, BlockMuon, and MuonBP across wall-clock time. For a target validation perplexity our method is sim10-13\% faster in terms of the wall-clock time to reach it, and for a given time point before the learning rate decay our method results in sim5-7\% lower perplexity compared to the baseline.

The algorithm operates under the assumption of smooth gradients, bounded variance, and norm equivalency. The selection mechanism utilizes Non-Euclidean Trust Region (NTR) optimization principles, allowing the algorithm to exploit different matrix norms efficiently. The block orthogonalization takes place locally, independently on each device, significantly reducing inter-device communication which is a common bottleneck in parallelized training. The periodic global orthogonalization ensures that training is not adversely affected by these local operations, thereby preserving the integrity of the convergence guarantees.

Results and Discussion

Empirical evaluations highlight MuonBP’s efficacy in accelerating training without sacrificing performance. Specifically, MuonBP exhibits up to an 8% increase in throughput compared to baseline Muon when training models with eight-way tensor parallelism and ZeRO optimizer state sharding. This improvement is crucial for large-scale LLM pretraining, aligning with the broader industry's shift towards efficiency in computational operations and cost reduction. Figure 2

Figure 2

Figure 2

Figure 2: 960M model. Comparison of baseline, block, and periodic orthogonal block methods across training steps and wall-clock time.

Across different model scales, MuonBP consistently achieves superior performance metrics—lower validation perplexities in shorter wall-clock times—underscoring its practical benefits in reducing training times for models scaling up to 8B parameters. The strategy offers quantitative improvements in model throughput, essential for real-world applications involving extensive data processing and energy expenditures.

Implications and Future Work

Theoretical analysis reveals that the block-periodic orthogonalization approach not only retains the data efficiency of Muon but enhances communication efficiency comparable to more traditional coordinate-wise methods. This innovative methodology provides a template for reducing communication burdens in distributed systems, which is pivotal for scaling neural network training. Figure 3

Figure 3

Figure 3

Figure 3: 1.2B model. Comparison of baseline, block, and periodic orthogonal block methods across training steps and wall-clock time.

Future developments could explore adaptive strategies for setting orthogonalization periods, potentially refined by real-time monitoring of network and compute resource metrics. Additionally, integrating MuonBP with other parallelization approaches like expert parallelism or advanced load balancing strategies could further leverage its efficiency benefits.

Conclusion

MuonBP offers a compelling enhancement over the traditional Muon algorithm by effectively addressing the communication overhead while maintaining competitive convergence guarantees. The research represents a critical step in optimizing the operational efficiency of large-scale LLM training, paving the way for future innovations in AI training methodologies. Figure 4

Figure 4

Figure 4

Figure 4: 1.2B model (larger lr), trained to 3x Chinchilla. Comparison of baseline, block, and periodic orthogonal block methods across training steps and wall-clock time.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.