Galvatron: Hybrid Parallelism Framework

Updated 27 January 2026

Galvatron is an open-source distributed framework that automates hybrid parallelism selection to optimize memory and compute efficiency for large-scale models.
It integrates data, tensor, pipeline, sharded data, and sequence parallelism using decision-tree pruning and dynamic programming to minimize iteration time.
The framework leverages hardware profiling and dynamic runtime adaptation to adjust strategies based on GPU utilization and communication overhead, boosting throughput.

Galvatron is an automatic, open-source distributed system framework for training large-scale Transformer and Foundation Models utilizing multi-dimensional hybrid parallelism. By integrating data, tensor/model, pipeline, sharded data, and sequence parallelism, in addition to activation recomputation, Galvatron optimizes throughput and resource utilization across heterogeneous GPU clusters. The system employs hardware/model profiling, decision-tree based configuration space pruning, and dynamic programming-based search algorithms, together with dynamic strategy adaptation during runtime, to maximize training efficiency subject to memory and performance constraints (Miao et al., 2022, Gumaan, 13 Mar 2025, Wang et al., 2023, Liu et al., 30 Apr 2025).

1. Motivation and Foundational Paradigms

Transformer models such as BERT, GPT, T5, ViT, and Swin have demonstrated state-of-the-art performance but pose significant challenges for distributed training owing to increased parameter counts and activation sizes. The training of such models on GPU clusters requires navigating multiple orthogonal parallelism modalities: Data Parallelism (DP), Sharded Data Parallelism (SDP/FSDP), Tensor Parallelism (TP), Pipeline Parallelism (PP), and Sequence Parallelism (SP). Legacy frameworks (Megatron-LM, DeepSpeed, FairScale, GShard) support limited combinations and often require expert manual tuning, constraining scalability and efficiency.

Galvatron formalizes the hybrid parallelism selection as an optimization: given a model $M$ with $L$ layers and $N$ GPUs (each with capacity $E$ ), find a layer-wise assignment $\pi = \{\pi_1,\ldots,\pi_L\}$ (where each $\pi_i$ draws from DP, SDP/FSDP, TP, PP, SP, RC) such that (i) per-GPU memory usage does not exceed $E$ and (ii) total iteration time $C(\pi)$ is minimized (Miao et al., 2022).

2. Architecture and Supported Techniques

Galvatron comprises three core, tightly integrated modules:

Profiler: Executes hardware profiling (inter-device bandwidth, GPU FLOPS for GEMM/softmax/layernorm) and model profiling (per-layer FLOPS, activations, parameter and optimizer memory).
Search Engine: Encodes each layer’s candidate parallelism choices as nodes in a decision tree, pruned early by device constraints. The cost model for each candidate comprises profiled compute time $T_\mathrm{compute}$ , communication time $T_\mathrm{comm}$ , and peak memory $L$ 0. Supported techniques include DP, TP, PP (1D/2D, micro-batch with 1F1B scheduling), SDP (ZeRO-1/2/3/FSDP-style), SP, and activation recomputation (RC) (Liu et al., 30 Apr 2025).
Runtime Engine: Implements all parallel modes natively, orchestrates NCCL collectives with fusion and overlap, auto-generates per-layer strategy assignments, and provides high-level APIs for resource-aware model construction.

The configuration space is constructed hierarchically, beginning with PP (determining pipeline stages and “islands”), then integrating DP/SDP/TP/SP/RC composition per stage. Sequence parallelism becomes essential for extended-context models, while recomputation trades compute for memory by recomputing discarded activations during backward (Liu et al., 30 Apr 2025, Wang et al., 2023).

3. Strategy Search: Decision-Tree Pruning and Dynamic Programming

Evaluating all combinations of hybrid parallelism at each layer is computationally prohibitive. Galvatron uses decision-tree decomposition (height dictated by the number of parallel dimensions; e.g., 3–6), pruned by three key principles: PP applied first over low-bandwidth links; group sizes for each paradigm are equal; SDP is not nested with DP (Miao et al., 2022, Wang et al., 2023). Configuration tree paths represent distinct strategies (≤44 per layer with 8 GPUs given CKPT), dramatically reducing the effective search set.

Once candidate sets are generated, Galvatron applies dynamic programming:

$L$ 1

where $L$ 2 tracks residual memory budget, $L$ 3 is compute+communication time of strategy $L$ 4 at layer $L$ 5, and $L$ 6 is memory usage. The transition cost $L$ 7 accounts for re-layout overhead (Slice-Gather transformations). Complexity is $L$ 8, tractable for $L$ 9, $N$ 0– $N$ 1.

Galvatron-BMW extends this by optimizing both time and memory, balancing pipeline splits across stages and micro-batches to minimize overall runtime $N$ 2 and the maximum per-stage memory consumption $N$ 3. Pareto-efficient splits are constructed by iteratively adjusting pipeline boundaries from memory-balanced to compute-balanced configurations (Wang et al., 2023).

4. Dynamic Runtime Adaptation and System Integration

Unlike prior static frameworks, Galvatron implements runtime monitoring and real-time adjustment using metrics such as per-iteration throughput, GPU utilization, and communication overhead. If metrics degrade below predictive thresholds, the strategy selector proposes new degrees ( $N$ 4), triggering checkpoint/resharding and process group reinitialization; this enables seamless adaption to hardware failures and dynamic resource changes (Gumaan, 13 Mar 2025).

Galvatron leverages DeepSpeed ZeRO for memory-efficient sharded states and Megatron-LM for intra-layer tensor parallelism primitives, automatically tuning fusion sizes and leveraging NCCL for collective communication overlap/fusion. The effective iteration time is modeled as $N$ 5 (versus additive), yielding significant perceived communication cost reductions.

5. Empirical Benchmarks and Comparative Analysis

Galvatron has been benchmarked on NVIDIA H100/A100/RTX clusters (up to 64 GPUs) across NLP (BERT, T5), CV (ViT, Swin), and LM workloads (GPT-3 variants) (Miao et al., 2022, Liu et al., 30 Apr 2025, Wang et al., 2023). Key results include:

Framework	GPUs	Throughput (GPT-3 175B)	Scaling Eff.	GPU Utilization
Megatron-LM	32	1.00× (baseline)	72%	86%
DeepSpeed Z3	32	1.10×	75%	88%
FairScale	32	1.08×	73%	87%
Galvatron	32	1.28×	86%	94%

For BERT-Huge-32 on 8×A100@16GB, Galvatron-BMW accommodates batch 128 (yielding 5.3× speedup vs pure paradigms and 1.5–2.4× vs expert hybrid baselines) (Wang et al., 2023). Throughput gains (1.26×–1.47×) over manually tuned Megatron-LM/DeepSpeed are consistently observed across diverse hardware/model sizes (Liu et al., 30 Apr 2025). Cost model accuracy in estimating plans remains within 5% error provided overlapping comm/compute is properly compensated. Suboptimal plans result from neglecting this overlap.

6. System Interfaces, Usability, and Extensibility

Galvatron offers user-friendly APIs for profiling, strategy search, and hybrid model construction (Python, PyTorch integration). Configuration exposes per-layer overrides and cluster topology, accepting both YAML/JSON for device layout, NVLink, PCI-e grouping. Minimal code modifications are required:

$N$ 6

Documentation and codebase are available at https://github.com/PKU-DAIR/Hetu-Galvatron and https://hetu-galvatron.readthedocs.io (Liu et al., 30 Apr 2025).

Planned extensions include learned cost models, reinforcement learning strategy search, finer-grained (per-head) parallelism, MoE/expert partitioning, fault tolerance, and dynamic elasticity support for volatile clusters (Gumaan, 13 Mar 2025).

7. Limitations and Lessons Learned

Galvatron assumes homogeneous GPUs and static model architectures (batched strictly per iteration). Communication modeling omits global network contention beyond single node. Pipeline strategies are restricted to GPipe-style scheduling (interleaved/branched topologies such as PipeDream are not yet supported). Memory optimizations such as activation checkpointing and quantization are being integrated into the configuration space (Miao et al., 2022, Wang et al., 2023).

No single parallel paradigm is universally optimal—model activation versus parameter sparsity determines the best hybrid composition. Activation recomputation is critical in activation-bound scenarios, often doubling batch size under tight memory budgets. Bi-objective partitioning is necessary to exploit both runtime and memory headroom. The linear-in-E DP algorithms and aggressive decision-tree pruning make exhaustive hybrid search feasible, occupying negligible time compared to multi-day training.

Galvatron automates a previously manual, expert-driven process for distributed hybrid parallelism selection and execution, providing a robust framework for state-of-the-art throughput on large-scale GPU clusters in both research and production environments (Miao et al., 2022, Gumaan, 13 Mar 2025, Wang et al., 2023, Liu et al., 30 Apr 2025).