Unicron: Economizing Self-Healing LLM Training at Scale
Abstract: Training large-scale LLMs is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale LLM training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale LLM training.
- {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
- Amazon Web Services. Amazon Web Services. https://aws.amazon.com/.
- Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022.
- David Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 1(3):81–84, 2014.
- Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv preprint arXiv:2305.13614, 2023.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Alibaba Cloud. Alibaba cloud: Reliable and secure cloud computing services. https://www.alibabacloud.com/.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- {{\{{Check-N-Run}}\}}: A checkpointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929–943, 2022.
- Jinze Bai et al. Qwen technical report, 2023.
- etcd. etcd. https://etcd.io/.
- Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
- glm. glm. https://keg.cs.tsinghua.edu.cn/yuxiao/papers/slides-2023-ijcai-llm-glm-130-chatglm-agentbench.pdf.
- Google. Google cloud platform. https://cloud.google.com/.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721–739. USENIX Association, April 2021.
- Elastic resource sharing for distributed deep learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 721–739, 2021.
- Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
- Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 382–395, 2023.
- Andreas Jungherr. Using chatgpt and other large language model (llm) applications for academic paper assignments. 2023.
- Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiv preprint arXiv:2111.05972, 2021.
- Bpipe: Memory-balanced pipeline parallelism for training large language models. 2023.
- Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
- Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, pages 835–850, New York, NY, USA, 2023. Association for Computing Machinery.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
- KungFu: Making training in distributed machine learning adaptive. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 937–954. USENIX Association, November 2020.
- Microsoft Azure. Microsoft Azure. https://azure.microsoft.com/.
- MLPerf. MLPerf. https://www.mlperf.org/.
- MLPerf v3.1 NVIDIA Submission. MLPerf v3.1 NVIDIA Submission. https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA.
- {{\{{CheckFreq}}\}}: Frequent,{{\{{Fine-Grained}}\}}{{\{{DNN}}\}} checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 203–216, 2021.
- Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 172–181. IEEE, 2020.
- OpenAI. Language models are few-shot learners. https://openai.com/blog/gpt-3-apps, 2020.
- OpenAI. Chatgpt: Language models for task-oriented dialogue. https://openai.com/blog/chatgpt/, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- SageMaker. Sagemaker. https://docs.aws.amazon.com/sagemaker/index.html.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Singularity: Planet-scale, preemptive and elastic scheduling of ai workloads, 2022.
- Bamboo: Making preemptible instances resilient for affordable training of large {{\{{DNNs}}\}}. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497–513, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 364–381, 2023.
- Transom: An efficient fault-tolerant system for training llms. arXiv preprint arXiv:2310.10046, 2023.
- Elastic deep learning in multi-tenant gpu clusters. IEEE Transactions on Parallel and Distributed Systems, 33(1):144–158, 2022.
- Elan: Towards generic and efficient elastic training for deep learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pages 78–88, 2020.
- Slurm: Simple linux utility for resource management. Job Scheduling Strategies for Parallel Processing, pages 44–60, 2003.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.