DiLoCo: Distributed Low-Communication Training of Language Models
Abstract: LLMs (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of LLMs on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.
- Petals: Collaborative inference and fine-tuning of large models. arXiv preprint library, 2022.
- Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Linear mode connectivity and the lottery ticket hypothesis. International Conference on Machine Learning (ICML), 2020.
- A survey on heterogeneous federated learning. arXiv preprint library, 2022.
- Why (and when) does local sgd generalize better than sgd? Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Continual pre-training of large language models: How to (re)warm your model? arXiv preprint library, 2023.
- Scaling expert language models with unsupervised domain discovery. arXiv preprint library, 2023.
- Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Training compute-optimal large language models. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Faster on-device training using new federated momentum algorithm. arXiv preprint library, 2020.
- Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Dataless knowledge fusion by merging weights of language models. Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Population parameter averaging (papa). arXiv preprint library, 2023.
- Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint library, 2023.
- Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. Advances in Neural Information Processing Systems (NeurIPS) Workshop, 2022.
- Git-theta: A git extension for collaborative development of machine learning models. arXiv preprint library, 2023.
- Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint library, 2022.
- Don’t use large mini-batches, use local sgd. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Decoupled weight decay regularization. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Communication-efficient learning of deep networks from decentralized data. International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- Trade-offs of local sgd at scale: An empirical study. arXiv preprint library, 2021.
- Shawn Presser. Swarm training, 2020. URL https://battle.shawwn.com/swarm-training-v01a.pdf.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems (NeurIPS), 2023a.
- Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems (NeurIPS), 2023b.
- Revisiting adapters with adversarial training. Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Adaptive federated optimization. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Sebastian U. Stich. Local SGD converges fast and communicates little. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Zipit! merging models from different tasks without training. arXiv preprint library, 2023.
- On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (ICML), 2013.
- Communication-efficient distributed deep learning: A comprehensive survey. arXiv preprint library, 2023.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Slowmo: Improving communication-efficient distributed sgd with slow momentum. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Learning neural network subspaces. International Conference on Machine Learning (ICML), 2021.
- lo-fi: distributed fine-tuning without communication. arXiv preprint library, 2022a.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. International Conference on Machine Learning (ICML), 2022b.
- Robust fine-tuning of zero-shot models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022c.
- Resolving interference when merging models. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Lookahead optimizer: k steps forward, 1 step back. Advances in Neural Information Processing Systems (NeurIPS), 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.