Dion: Distributed Orthonormalized Updates

Published 7 Apr 2025 in cs.LG, cs.AI, and math.OC | (2504.05295v2)

Abstract: Recent work has shown that orthonormal matrix updates speed up neural network optimization, improve training stability, and offer better hyperparameter transfer across model sizes. Applying these updates efficiently when model weights and optimizer states are sharded across a large-scale distributed LLM training system remains a major challenge. We introduce Dion (DIstributed OrthoNormalization), a scalable and communication-efficient orthonormalizing optimizer. Dion leverages low-rank approximation and decoupled momentum buffers, eliminating the need for full gradient synchronization while producing numerically equivalent results. It is compatible with simultaneous DDP, FSDP, and TP parallelism, and it computes an orthonormalized update without unsharding a full parameter matrix on any single device. We evaluate Dion on LLMs from 120M to 3B parameters and find that its benefits improve with increasing model size and batch size.

Abstract PDF Upgrade to Chat

Summary

Dion: A Communication-Efficient Optimizer for Large Models

The paper entitled "Dion: A Communication-Efficient Optimizer for Large Models" by Kwangjun Ahn and Byron Xu presents a novel optimization method designed to reduce communication overheads in distributed training environments. As large AI models require deployment across multiple accelerators to achieve efficient training times, communication—particularly during gradient synchronization—becomes a critical factor limiting scalability and flexibility. Dion addresses these challenges by proposing a fundamentally synchronous update rule that minimizes data exchange while maintaining compatibility with traditional distributed training approaches like Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP).

Dion's innovative approach lies in its orthonormalized update mechanism paired with device-local momentum buffers, altogether eliminating the need for full gradient synchronization traditionally necessitated by methods such as Adam. This design choice facilitates considerable I/O savings, allowing for efficient sharding without reconstructing large parameter matrices. This synchronous implementation significantly reduces communication costs, making Dion particularly advantageous for hybrid sharding configurations and large-scale, geographically dispersed training scenarios where data parallelism is layered on top of model parallelism.

Quantitatively, Dion demonstrates that its I/O demands are significantly lower than those of conventional optimizers such as Adam and Muon, with detailed metrics provided in the paper (Table 1). For data parallel I/O, Dion incurs a cost of $(m+n)r$, markedly less than the $mn$ required by Adam. Furthermore, Dion's optimizer state memory footprint is $mn + nr$, efficiently managing storage demands by leveraging a low-rank factor $r$.

This paper also delves into the theoretical formulation of the Dion optimizer through an algorithm that integrates power iteration and error feedback mechanisms to approximate orthonormal updates efficiently. The authors prove theoretically that the distributed version of Dion is equivalent to its centralized counterpart, establishing the optimizer's correctness and robustness in a distributed setting (Theorem 1).

Empirical evaluations reinforce Dion's competitive performance across varied batch sizes and model configurations, prominently under scenarios simulating lower rank approximations. For instance, Dion maintains superior convergence rates and stability compared to baselines like Adam and Muon under different batch size settings while reducing rank to fractions of the model dimension. The paper further explores critical design choices, such as the utility of error feedback, which enhances Dion's performance at lower ranks by correcting approximation errors.

In the realm of related works, Dion distinguishes itself by reducing communication overhead while retaining synchronous semantics of updates, contrasting current methods focusing on asynchronous or compressed gradient communication strategies. By aligning with and optimizing the existing distributed training constructs, Dion offers a pathway to integrate communication efficiency without major infrastructure overhauls.

In conclusion, Dion represents a promising advancement in communication-efficient optimization for large-scale AI models. By focusing on synchronous update rules and enabling efficient distributed implementation, it opens avenues for further exploration into low-rank optimizers and potential hybrid strategies marrying model efficiency with broader scalability. Future work could extend its principles to various model architectures or incorporate additional compression techniques, maximizing its applicability across different AI domains.