Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis

Published 27 Jun 2024 in cs.DC and cs.LG | (2406.18820v3)

Abstract: Deep neural network (DNN) training continues to scale rapidly in terms of model size, data volume, and sequence length, to the point where multiple machines are required to fit large models for training. Different distributed and parallel training strategies have been developed to support large-scale DNN training by partitioning the training state across GPUs. However, existing DNN training systems provide very limited support for reconfiguring parallelism strategies in the middle of the training via checkpointing. This limitation arises because distributed checkpoints are tightly coupled to specific model parallelism and hardware configurations, preventing large-scale training jobs from efficiently adapting to hardware failures or resource elasticity. This paper presents Universal Checkpointing (UCP), a novel checkpointing system that enables flexible and efficient DNN training with reconfigurable parallelism. UCP overcomes challenges in existing systems by decoupling checkpoint structure from parallel training strategies and hardware configurations. In addition, we present a pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies. Evaluation on a range of DNN models, including state-of-the-art dense and sparse LLMs, shows that UCP enables reconfiguration for a broader set of widely used parallelism strategies than existing solutions while adding negligible reconfiguration cost. UCP has been successfully employed in real LLM training workloads, greatly enhancing their flexibility and resilience to dynamic hardware environments.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a universal checkpointing system for reconfigurable parallelism in large-scale DNN training using atomic checkpoints.
It employs a pattern-based reconfiguration pipeline and optimizations such as nested parallel reconfiguration, redundancy-bypassing loading, and lazy invocation.
Evaluation shows that UCP supports diverse parallelism strategies efficiently with minimal training overhead and enhanced fault recovery.

Universal Checkpointing for Reconfigurable Parallelism in Large-Scale DNN Training

Introduction

The study explores Universal Checkpointing (UCP), a system designed to address the challenges of reconfigurable parallelism in large-scale deep neural network (DNN) training. When training large DNNs, resizing and reconfiguring parallelism strategies without losing training efficiency or accuracy is crucial. Existing checkpointing mechanisms are tightly coupled with specific parallelism strategies, limiting their adaptability. The UCP aims to resolve this limitation by introducing a flexible checkpointing system supporting a broad range of parallelism strategies.

Atomic Checkpoints

UCP departs from conventional tightly-coupled checkpointing systems by leveraging atomic checkpoints, an abstraction that decouples checkpoint data from specific parallelism strategies or hardware configurations. This decoupling allows UCP to offer a flexible and universal format for saving model states. An atomic checkpoint represents a consolidated view of a model's parameters, allowing seamless transitioning between different form factors in which DNN models might be configured or trained.

Figure 1: Existing distributed checkpoints are highly coupled to specific model parallelism strategy. Each worker in ZeRO-3 saves its own unique sharded model parameters (P) and optimizer states (OS). Reconfiguring ZeRO-3 requires altering each sharded parameter and optimizer states.

Pattern-Based Reconfiguration

The pattern-based reconfiguration pipeline constitutes the core of UCP's strategy to achieve seamless parallelism reconfiguration. By recognizing pre-defined tensor patterns in model parameters—such as replicates, sharded, unique, and more—UCP utilizes pattern-based transformation operations, which allow the system to automatically adapt parameter storage and retrieval processes to suit the target parallelism strategy.

Figure 2: Overview of UCP system design. UCP enables flexible and efficient reconfiguration from any Source parallelism strategy, e.g., $P_{src}$ (ZeRO-1 + PP) = (DP=2, PP=4) to any Target parallelism strategy, e.g., $P_{tgt}$ (3D-Parallel) = (DP=2, TP=2, PP=2), via {atomic checkpoints}, {pattern-based reconfiguration pipeline}, and efficient optimizations such as {nested parallel reconfiguration}.

Efficient Reconfiguration

Efficiency in UCP is achieved through three core optimizations:

Nested Parallel Reconfiguration: UCP employs a MapReduce framework to handle reconfiguration, leveraging multiple nodes and processors to distribute the data transformation workload efficiently.
Redundancy-Bypassing Loading: By eliminating redundant loading in DP groups and utilizing fast GPU interconnects, UCP optimizes the loading phase and minimizes I/O pressure.
Lazy Reconfiguration Invocation: The system optimizes performance by triggering reconfiguration only when necessary, thus preserving training speeds and preventing excessive computing overhead.
Figure 3: Illustration of the nested parallel reconfiguration process of UCP. UCP leverages a MapReduce-based approach to utilize the aggregated compute and bandwidth of multi-node multi-processors to convert distributed checkpoints into atomic checkpoints in parallel. Meanwhile, it performs careful load balancing to avoid the straggle problem.

Evaluation and Impact

The evaluation of UCP demonstrates its capability to support a broader set of reconfigurable parallelism strategies, achieving higher flexibility compared to existing systems like DCP and MCP. UCP proves effective across various model sizes, scales efficiently with the size of the networks, and maintains a minimal overhead on the total training time. This system enables efficient fault recovery and dynamic resource utilization in large-scale distributed training environments, improving throughput in potentially unstable hardware scenarios.

Figure 4: Time cost for converting distributed checkpoints to UCP atomic checkpoints with sequential (S.Q.) and nested parallel (N.P) approach across different model sizes.

Conclusion

UCP represents a significant advancement in distributed DNN training by enabling flexible and efficient reconfiguration through atomic checkpoints and a pattern-based reconfiguration framework. It broadens the adaptability of training strategies across different hardware configurations without compromising system efficiency or model accuracy. The development of UCP marks a step forward in managing the complexity and dynamism of large-scale AI training systems.